Hi
This series introduces the concept of "file sealing". Sealing a file restricts the set of allowed operations on the file in question. Multiple seals are defined and each seal will cause a different set of operations to return EPERM if it is set. The following seals are introduced:
* SEAL_SHRINK: If set, the inode size cannot be reduced * SEAL_GROW: If set, the inode size cannot be increased * SEAL_WRITE: If set, the file content cannot be modified
Unlike existing techniques that provide similar protection, sealing allows file-sharing without any trust-relationship. This is enforced by rejecting seal modifications if you don't own an exclusive reference to the given file. So if you own a file-descriptor, you can be sure that no-one besides you can modify the seals on the given file. This allows mapping shared files from untrusted parties without the fear of the file getting truncated or modified by an attacker.
Several use-cases exist that could make great use of sealing:
1) Graphics Compositors If a graphics client creates a memory-backed render-buffer and passes a file-decsriptor to it to the graphics server for display, the server _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise, the client might run ftruncate() or O_TRUNC on the on file in parallel, thus crashing the server. With sealing, a compositor can reject any incoming file-descriptor that does _not_ have SEAL_SHRINK set. This way, any memory-mappings are guaranteed to stay accessible. Furthermore, we still allow clients to increase the buffer-size in case they want to resize the render-buffer for the next frame. We also allow parallel writes so the client can render new frames into the same buffer (client is responsible of never rendering into a front-buffer if you want to avoid artifacts).
Real use-case: Wayland wl_shm buffers can be transparently converted
2) Geneal-purpose IPC IPC mechanisms that do not require a mutual trust-relationship (like dbus) cannot do zero-copy so far. With sealing, zero-copy can be easily done by sharing a file-descriptor that has SEAL_SHRINK | SEAL_GROW | SEAL_WRITE set. This way, the source can store sensible data in the file, seal the file and then pass it to the destination. The destination verifies these seals are set and then can parse the message in-line. Note that these files are usually one-shot files. Without any trust-relationship, a destination can notify the source that it released a file again, but a source can never rely on it. So unless the destination releases the file, a source cannot clear the seals for modification again. However, this is inherent to situations without any trust-relationship.
Real use-case: kdbus messages already use a similar interface and can be transparently converted to use these seals
Other similar use-cases exist (eg., audio), but these two I am personally working on. Interest in this interface has been raised from several other camps and I've put respective maintainers into CC. If more information on these use-cases is needed, I think they can give some insights.
The API introduced by this patchset is:
* fcntl() extension: Two new fcntl() commands are added that allow retrieveing (SHMEM_GET_SEALS) and setting (SHMEM_SET_SEALS) seals on a file. Only shmfs implements them so far and there is no intention to implement them on other file-systems. All shmfs based files support sealing.
Patch 2/6
* memfd_create() syscall: The new memfd_create() syscall is a public frontend to the shmem_file_new() interface in the kernel. It avoids the need of a local shmfs mount-point (as requested by android people) and acts more like MAP_ANON than O_TMPFILE.
Patch 3/6
The other 4 patches are cleanups, self-tests and docs.
The commit-messages explain the API extensions in detail. Man-page proposals are also provided. Last but not least, the extensive self-tests document the intended behavior, in case it is still not clear.
Technically, sealing and memfd_create() are independent, but the described use-cases would greatly benefit from the combination of both. Hence, I merged them into the same series. Please also note that this series is based on earlier works (ashmem, memfd, shmgetfd, ..) and unifies these attempts.
Comments welcome!
Thanks David
David Herrmann (4): fs: fix i_writecount on shmem and friends shm: add sealing API shm: add memfd_create() syscall selftests: add memfd_create() + sealing tests
David Herrmann (2): (man-pages) fcntl.2: document SHMEM_SET/GET_SEALS commands memfd_create.2: add memfd_create() man-page
arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + fs/fcntl.c | 12 +- fs/file_table.c | 27 +- include/linux/shmem_fs.h | 17 + include/linux/syscalls.h | 1 + include/uapi/linux/fcntl.h | 13 + include/uapi/linux/memfd.h | 9 + kernel/sys_ni.c | 1 + mm/shmem.c | 267 +++++++- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/memfd/.gitignore | 2 + tools/testing/selftests/memfd/Makefile | 29 + tools/testing/selftests/memfd/memfd_test.c | 972 +++++++++++++++++++++++++++++ 14 files changed, 1338 insertions(+), 15 deletions(-) create mode 100644 include/uapi/linux/memfd.h create mode 100644 tools/testing/selftests/memfd/.gitignore create mode 100644 tools/testing/selftests/memfd/Makefile create mode 100644 tools/testing/selftests/memfd/memfd_test.c
VM_DENYWRITE currently relies on i_writecount. Unless there's an active writable reference to an inode, VM_DENYWRITE is not allowed. Unfortunately, alloc_file() does not increase i_writecount, therefore, does not prevent a following VM_DENYWRITE even though the new file might have been opened with FMODE_WRITE. However, callers of alloc_file() expect the file object to be fully instantiated so they can call fput() on it. We could now either fix all callers to do an get_write_access() if opened with FMODE_WRITE, or simply fix alloc_file() to do that. I chose the latter.
Note that this bug allows some rather subtle misbehavior. The following sequence of calls should work just fine, but currently fails: int p[2], orig, ro, rw; char buf[128];
pipe(p); sprintf(buf, "/proc/self/fd/%d", p[1]); ro = open(buf, O_RDONLY); close(p[1]); sprintf(buf, "/proc/self/fd/%d", ro); rw = open(buf, O_RDWR);
The final open() cannot succeed as close(p[1]) caused an integer underflow on i_writecount, effectively causing VM_DENYWRITE on the inode. The open will fail with -ETXTBUSY.
It's a rather odd sequence of calls and given that open() doesn't use alloc_file() (and thus not affected by this bug), it's rather unlikely that this is a serious issue. But stuff like anon_inode shares a *single* inode across a huge set of interfaces. If any of these is broken like pipe(), it will affect all of these (ranging from dma-buf to epoll).
Signed-off-by: David Herrmann dh.herrmann@gmail.com --- Hi
This patch is only included for reference. It was submitted to fs-devel separately and is being worked on. However, this bug must be fixed in order to make use of memfd_create(), so I decided to include it here.
David
fs/file_table.c | 27 ++++++++++++++++++--------- 1 file changed, 18 insertions(+), 9 deletions(-)
diff --git a/fs/file_table.c b/fs/file_table.c index 5b24008..8059d68 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -168,6 +168,7 @@ struct file *alloc_file(struct path *path, fmode_t mode, const struct file_operations *fop) { struct file *file; + int error;
file = get_empty_filp(); if (IS_ERR(file)) @@ -179,15 +180,23 @@ struct file *alloc_file(struct path *path, fmode_t mode, file->f_mode = mode; file->f_op = fop;
- /* - * These mounts don't really matter in practice - * for r/o bind mounts. They aren't userspace- - * visible. We do this for consistency, and so - * that we can do debugging checks at __fput() - */ - if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) { - file_take_write(file); - WARN_ON(mnt_clone_write(path->mnt)); + if (mode & FMODE_WRITE) { + error = get_write_access(path->dentry->d_inode); + if (error) { + put_filp(file); + return ERR_PTR(error); + } + + /* + * These mounts don't really matter in practice + * for r/o bind mounts. They aren't userspace- + * visible. We do this for consistency, and so + * that we can do debugging checks at __fput() + */ + if (!special_file(path->dentry->d_inode->i_mode)) { + file_take_write(file); + WARN_ON(mnt_clone_write(path->mnt)); + } } if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) i_readcount_inc(path->dentry->d_inode);
If two processes share a common memory region, they usually want some guarantees to allow safe access. This often includes: - one side cannot overwrite data while the other reads it - one side cannot shrink the buffer while the other accesses it - one side cannot grow the buffer beyond previously set boundaries
If there is a trust-relationship between both parties, there is no need for policy enforcement. However, if there's no trust relationship (eg., for general-purpose IPC) sharing memory-regions is highly fragile and often not possible without local copies. Look at the following two use-cases: 1) A graphics client wants to share its rendering-buffer with a graphics-server. The memory-region is allocated by the client for read/write access and a second FD is passed to the server. While scanning out from the memory region, the server has no guarantee that the client doesn't shrink the buffer at any time, requiring rather cumbersome SIGBUS handling. 2) A process wants to perform an RPC on another process. To avoid huge bandwidth consumption, zero-copy is preferred. After a message is assembled in-memory and a FD is passed to the remote side, both sides want to be sure that neither modifies this shared copy, anymore. The source may have put sensible data into the message without a separate copy and the target may want to parse the message inline, to avoid a local copy.
While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide ways to achieve most of this, the first one is unproportionally ugly to use in libraries and the latter two are broken/racy or even disabled due to denial of service attacks.
This patch introduces the concept of SEALING. If you seal a file, a specific set of operations is blocked until this seal is removed again. Unlike locks, seals can only be modified if you own an exclusive reference to the file. Hence, if, and only if you hold a reference to a file, you can be sure that no-one else can modify the seals besides you (and you can only modify them, if you are the exclusive holder). This makes sealing useful in situations where no trust-relationship is given.
An initial set of SEALS is introduced by this patch: - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced in size. This currently affects only ftruncate(). - GROW: If SEAL_GROW is set, the file in question cannot be increased in size. This affects ftruncate(), fallocate() and write(). - WRITE: If SEAL_WRITE is set, no write operations (besides resizing) are possible. This affects fallocate(PUNCH_HOLE), mmap() and write().
The described use-cases can easily use these seals to provide safe use without any trust-relationship: 1) The graphics server can verify that a passed file-descriptor has SEAL_SHRINK set. This allows safe scanout, while the client is allowed to increase buffer size for window-resizing on-the-fly. Concurrent writes are explicitly allowed. 2) Both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE are set. This guarantees that neither process can modify the data while the other side parses it. Furthermore, it guarantees that even with writable FDs passed to the peer, it cannot increase the size to hit memory-limits of the source process (in case the file-storage is accounted to the source).
There is one exception to setting seals: Imagine a library makes use of sealing. While creating a new memory object with an FD, another thread may fork(), retaining a copy of the FD and thus also a reference. Sealing wouldn't be possible anymore, until this process closes the FDs or exec()s. To avoid this race initial seals can be set on non-exclusive FDs. This is safe as both sides can, and always have to, verify that the required set of seals is set. Once they are set, neither side can extend, reduce or modify the set of seals as long as they have no exclusive reference. Note that this exception also allows keeping read-only mmaps() around during initial sealing (mmaps() also own a reference to the file).
The new API is an extension to fcntl(), adding two new commands: SHMEM_GET_SEALS: Return a bitset describing the seals on the file. This can be called on any FD if the underlying file supports sealing. SHMEM_SET_SEALS: Change the seals of a given file. This requires WRITE access to the file. If at least one seal is already set, this also requires an exclusive reference. Note that this call will fail with EPERM if there is any active mapping with MAP_SHARED set.
The fcntl() handler is currently specific to shmem. There is no intention to support this on other file-systems, that's why the bits are prefixed with SHMEM_*. Furthermore, sealing is supported on all shmem-files. Setting seals requires write-access, so this doesn't allow any DoS attacks onto existing shmem users (just like mandatory locking).
Signed-off-by: David Herrmann dh.herrmann@gmail.com --- fs/fcntl.c | 12 ++- include/linux/shmem_fs.h | 17 ++++ include/uapi/linux/fcntl.h | 13 +++ mm/shmem.c | 200 ++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 236 insertions(+), 6 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c index ef68665..eea0b65 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -21,6 +21,7 @@ #include <linux/rcupdate.h> #include <linux/pid_namespace.h> #include <linux/user_namespace.h> +#include <linux/shmem_fs.h>
#include <asm/poll.h> #include <asm/siginfo.h> @@ -248,9 +249,10 @@ static int f_getowner_uids(struct file *filp, unsigned long arg) #endif
static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, - struct file *filp) + struct fd f) { long err = -EINVAL; + struct file *filp = f.file;
switch (cmd) { case F_DUPFD: @@ -326,6 +328,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, case F_GETPIPE_SZ: err = pipe_fcntl(filp, cmd, arg); break; + case SHMEM_SET_SEALS: + case SHMEM_GET_SEALS: + err = shmem_fcntl(f, cmd, arg); + break; default: break; } @@ -360,7 +366,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
err = security_file_fcntl(f.file, cmd, arg); if (!err) - err = do_fcntl(fd, cmd, arg, f.file); + err = do_fcntl(fd, cmd, arg, f);
out1: fdput(f); @@ -397,7 +403,7 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd, (struct flock64 __user *) arg); break; default: - err = do_fcntl(fd, cmd, arg, f.file); + err = do_fcntl(fd, cmd, arg, f); break; } out1: diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 9d55438..6a3f685 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -1,6 +1,7 @@ #ifndef __SHMEM_FS_H #define __SHMEM_FS_H
+#include <linux/file.h> #include <linux/swap.h> #include <linux/mempolicy.h> #include <linux/pagemap.h> @@ -20,6 +21,7 @@ struct shmem_inode_info { struct shared_policy policy; /* NUMA memory alloc policy */ struct list_head swaplist; /* chain of maybes on swap */ struct simple_xattrs xattrs; /* list of xattrs */ + u32 seals; /* shmem seals */ struct inode vfs_inode; };
@@ -57,6 +59,21 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping, extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end); extern int shmem_unuse(swp_entry_t entry, struct page *page);
+#ifdef CONFIG_SHMEM + +extern int shmem_set_seals(struct file *file, u32 seals); +extern int shmem_get_seals(struct file *file); +extern long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg); + +#else + +static inline long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg) +{ + return -EINVAL; +} + +#endif + static inline struct page *shmem_read_mapping_page( struct address_space *mapping, pgoff_t index) { diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 074b886..8f31bef 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -28,6 +28,19 @@ #define F_GETPIPE_SZ (F_LINUX_SPECIFIC_BASE + 8)
/* + * Set/Get seals + */ +#define SHMEM_SET_SEALS (F_LINUX_SPECIFIC_BASE + 9) +#define SHMEM_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10) + +/* + * Types of seals + */ +#define SHMEM_SEAL_SHRINK 0x0001 /* prevent file from shrinking */ +#define SHMEM_SEAL_GROW 0x0002 /* prevent file from growing */ +#define SHMEM_SEAL_WRITE 0x0004 /* prevent writes */ + +/* * Types of directory notifications that may be requested. */ #define DN_ACCESS 0x00000001 /* File accessed */ diff --git a/mm/shmem.c b/mm/shmem.c index 1f18c9d..44d7f3b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt; #include <linux/highmem.h> #include <linux/seq_file.h> #include <linux/magic.h> +#include <linux/fcntl.h>
#include <asm/uaccess.h> #include <asm/pgtable.h> @@ -596,16 +597,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range); static int shmem_setattr(struct dentry *dentry, struct iattr *attr) { struct inode *inode = dentry->d_inode; + struct shmem_inode_info *info = SHMEM_I(inode); + loff_t oldsize = inode->i_size; + loff_t newsize = attr->ia_size; int error;
error = inode_change_ok(inode, attr); if (error) return error;
- if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { - loff_t oldsize = inode->i_size; - loff_t newsize = attr->ia_size; + /* protected by i_mutex */ + if (attr->ia_valid & ATTR_SIZE) { + if ((newsize < oldsize && (info->seals & SHMEM_SEAL_SHRINK)) || + (newsize > oldsize && (info->seals & SHMEM_SEAL_GROW))) + return -EPERM; + }
+ if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) { if (newsize != oldsize) { i_size_write(inode, newsize); inode->i_ctime = inode->i_mtime = CURRENT_TIME; @@ -1354,6 +1362,13 @@ out_nomem:
static int shmem_mmap(struct file *file, struct vm_area_struct *vma) { + struct inode *inode = file_inode(file); + struct shmem_inode_info *info = SHMEM_I(inode); + + /* protected by mmap_sem and owns additional file-reference */ + if ((info->seals & SHMEM_SEAL_WRITE) && (vma->vm_flags & VM_SHARED)) + return -EPERM; + file_accessed(file); vma->vm_ops = &shmem_vm_ops; return 0; @@ -1433,7 +1448,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping, struct page **pagep, void **fsdata) { struct inode *inode = mapping->host; + struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t index = pos >> PAGE_CACHE_SHIFT; + + /* i_mutex is held by caller */ + if (info->seals & SHMEM_SEAL_WRITE) + return -EPERM; + if ((info->seals & SHMEM_SEAL_GROW) && pos + len > inode->i_size) + return -EPERM; + return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL); }
@@ -1802,11 +1825,171 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) return offset; }
+#define SHMEM_ALL_SEALS (SHMEM_SEAL_SHRINK | \ + SHMEM_SEAL_GROW | \ + SHMEM_SEAL_WRITE) + +int shmem_set_seals(struct file *file, u32 seals) +{ + struct dentry *dentry = file->f_path.dentry; + struct inode *inode = dentry->d_inode; + struct shmem_inode_info *info = SHMEM_I(inode); + bool has_writers, has_readers; + int r; + + /* + * SHMEM SEALING + * Sealing allows multiple parties to share a shmem-file but restrict + * access to a specific subset of file operations as long as more than + * one party has access to the inode. This way, mutually untrusted + * parties can share common memory regions with a well-defined policy. + * + * Seals can be set on any shmem-file, but always affect the whole + * underlying inode. Once a seal is set, it may prevent some kinds of + * access to the file. Currently, the following seals are defined: + * SHRINK: Prevent the file from shrinking + * GROW: Prevent the file from growing + * WRITE: Prevent write access to the file + * + * As we don't require any trust relationship between two parties, we + * cannot allow asynchronous sealing. Instead, sealing is only allowed + * if you own an exclusive reference to the shmem-file. Each FD, each + * mmap and any link increase the ref-count. So as long as you have any + * access to the file, you can be sure no-one (besides perhaps you) can + * modify the seals. + * There is one exception: Setting initial seals is allowed even if + * there are multiple references to the file (but no writable mappings + * may exist). Once *any* seal is set, removing or changing it requires + * an exclusive reference, though. + * + * The combination of SHRINK and WRITE also guarantees that any mapped + * region will not get destructed asynchronously. Even if at some point + * revoke() is supported, the region will stay mapped (maybe only + * privately) and accessible. + */ + + if (file->f_op != &shmem_file_operations) + return -EBADF; + + /* require write-access to modify seals */ + if (!(file->f_mode & FMODE_WRITE)) + return -EPERM; + + if (seals & ~(u32)SHMEM_ALL_SEALS) + return -EINVAL; + + /* + * - i_mutex prevents racing write/ftruncate/fallocate/.. + * - mmap_sem prevents racing mmap() calls + * - i_lock prevents racing open() calls and new inode-refs + */ + + mutex_lock(&inode->i_mutex); + down_read(¤t->mm->mmap_sem); + spin_lock(&inode->i_lock); + + /* + * Changing seals is only allowed on exclusive references. Exception is + * initial sealing, which allows other readers. We need to test for + * i_mmap_writable to prevent VM_SHARED vmas on our exclusive writer. + * i_writecount is not checked, as we explicitly allow writable FDs + * even if sealed. It's the write-operation that is blocked, not the + * writable FD itself. + * Readers are tested the same way F_SETLEASE does it. One dentry, + * inode and file ref combination is allowed. + * Note that we actually allow 2 file-refs: One is the ref in the + * file-table, the other is from the current context. + * Note: for racing dup() calls see GET_SEALS + */ + has_writers = file->f_mapping->i_mmap_writable > 0; + + has_readers = d_count(dentry) > 1 || atomic_read(&inode->i_count) > 1; + has_readers = has_readers || file_count(file) > 2; + + if (has_writers || (has_readers && info->seals != 0)) { + r = -EPERM; + } else { + info->seals = seals; + r = 0; + } + + spin_unlock(&inode->i_lock); + up_read(¤t->mm->mmap_sem); + mutex_unlock(&inode->i_mutex); + + return r; +} +EXPORT_SYMBOL(shmem_set_seals); + +int shmem_get_seals(struct file *file) +{ + struct inode *inode = file_inode(file); + struct shmem_inode_info *info = SHMEM_I(inode); + unsigned long flags; + int r; + + if (file->f_op != &shmem_file_operations) + return -EBADF; + + /* + * Lock i_lock so we don't read seals between file_count() and setting + * the seals in SET_SEALS. Racing get_file()s could end up with an + * inconsistent view. + */ + + spin_lock_irqsave(&inode->i_lock, flags); + r = info->seals; + spin_unlock_irqrestore(&inode->i_lock, flags); + + return r; +} +EXPORT_SYMBOL(shmem_get_seals); + +long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg) +{ + long r; + + if (f.file->f_op != &shmem_file_operations) + return -EBADF; + + switch (cmd) { + case SHMEM_SET_SEALS: + /* disallow upper 32bit */ + if (arg >> 32) + return -EINVAL; + + /* + * shmem_set_seals() allows 2 file-refs, one of the owner and + * one of the current context. Make sure we have a real + * owner-ref here, otherwise the fast-path of __fdget_light + * breaks the assumptions in shmem_set_seals(). + */ + + if (!(f.flags & FDPUT_FPUT)) + get_file(f.file); + + r = shmem_set_seals(f.file, arg); + + if (!(f.flags & FDPUT_FPUT)) + fput(f.file); + break; + case SHMEM_GET_SEALS: + r = shmem_get_seals(f.file); + break; + default: + r = -EINVAL; + break; + } + + return r; +} + static long shmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { struct inode *inode = file_inode(file); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); + struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_falloc shmem_falloc; pgoff_t start, index, end; int error; @@ -1818,6 +2001,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, loff_t unmap_start = round_up(offset, PAGE_SIZE); loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
+ /* protected by i_mutex */ + if (info->seals & SHMEM_SEAL_WRITE) { + error = -EPERM; + goto out; + } + if ((u64)unmap_end > (u64)unmap_start) unmap_mapping_range(mapping, unmap_start, 1 + unmap_end - unmap_start, 0); @@ -1832,6 +2021,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, if (error) goto out;
+ if ((info->seals & SHMEM_SEAL_GROW) && offset + len > inode->i_size) { + error = -EPERM; + goto out; + } + start = offset >> PAGE_CACHE_SHIFT; end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; /* Try to avoid a swapstorm if len is impossible to satisfy */
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It explicitly allows sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it.
memfd_create() does not create a front-FD, but instead returns the raw shmem file, so calls like ftruncate() can be used. Also calls like fstat() will return proper information and mark the file as regular file. Sealing is explicitly supported on memfds.
Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to quotas and alike.
Signed-off-by: David Herrmann dh.herrmann@gmail.com --- arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/linux/memfd.h | 9 ++++++ kernel/sys_ni.c | 1 + mm/shmem.c | 67 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 80 insertions(+) create mode 100644 include/uapi/linux/memfd.h
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index 96bc506..c943b8a 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -359,3 +359,4 @@ 350 i386 finit_module sys_finit_module 351 i386 sched_setattr sys_sched_setattr 352 i386 sched_getattr sys_sched_getattr +353 i386 memfd_create sys_memfd_create diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index a12bddc..e9d56a8 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -322,6 +322,7 @@ 313 common finit_module sys_finit_module 314 common sched_setattr sys_sched_setattr 315 common sched_getattr sys_sched_getattr +316 common memfd_create sys_memfd_create
# # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a747a77..124b838 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -791,6 +791,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags, asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr); asmlinkage long sys_eventfd(unsigned int count); asmlinkage long sys_eventfd2(unsigned int count, int flags); +asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags); asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int); asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *, diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h new file mode 100644 index 0000000..d74cc89 --- /dev/null +++ b/include/uapi/linux/memfd.h @@ -0,0 +1,9 @@ +#ifndef _UAPI_LINUX_MEMFD_H +#define _UAPI_LINUX_MEMFD_H + +#include <linux/types.h> + +/* flags for memfd_create(2) */ +#define MFD_CLOEXEC 0x0001 + +#endif /* _UAPI_LINUX_MEMFD_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 7078052..53e05af 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -193,6 +193,7 @@ cond_syscall(compat_sys_timerfd_settime); cond_syscall(compat_sys_timerfd_gettime); cond_syscall(sys_eventfd); cond_syscall(sys_eventfd2); +cond_syscall(sys_memfd_create);
/* performance counters: */ cond_syscall(sys_perf_event_open); diff --git a/mm/shmem.c b/mm/shmem.c index 44d7f3b..48feb42 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt; #include <linux/highmem.h> #include <linux/seq_file.h> #include <linux/magic.h> +#include <linux/syscalls.h> #include <linux/fcntl.h> +#include <uapi/linux/memfd.h>
#include <asm/uaccess.h> #include <asm/pgtable.h> @@ -3039,6 +3041,71 @@ out4: return error; }
+/* maximum length of memfd names */ +#define MFD_MAX_NAMELEN 256 + +SYSCALL_DEFINE3(memfd_create, + const char*, uname, + u64, size, + u64, flags) +{ + struct file *shm; + char *name; + int fd, r; + long len; + + if (flags & ~(u64)MFD_CLOEXEC) + return -EINVAL; + if ((u64)(loff_t)size != size || (loff_t)size < 0) + return -EINVAL; + + /* length includes terminating zero */ + len = strnlen_user(uname, MFD_MAX_NAMELEN); + if (len <= 0) + return -EFAULT; + else if (len > MFD_MAX_NAMELEN) + return -EINVAL; + + name = kmalloc(len + 6, GFP_KERNEL); + if (!name) + return -ENOMEM; + + strcpy(name, "memfd:"); + if (copy_from_user(&name[6], uname, len)) { + r = -EFAULT; + goto err_name; + } + + /* terminating-zero may have changed after strnlen_user() returned */ + if (name[len + 6 - 1]) { + r = -EFAULT; + goto err_name; + } + + fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0); + if (fd < 0) { + r = fd; + goto err_name; + } + + shm = shmem_file_setup(name, size, 0); + if (IS_ERR(shm)) { + r = PTR_ERR(shm); + goto err_fd; + } + shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; + + fd_install(fd, shm); + kfree(name); + return fd; + +err_fd: + put_unused_fd(fd); +err_name: + kfree(name); + return r; +} + #else /* !CONFIG_SHMEM */
/*
On Wed, Mar 19, 2014 at 08:06:48PM +0100, David Herrmann wrote:
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It explicitly allows sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it.
memfd_create() does not create a front-FD, but instead returns the raw shmem file, so calls like ftruncate() can be used. Also calls like fstat() will return proper information and mark the file as regular file. Sealing is explicitly supported on memfds.
Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to quotas and alike.
If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files feature, Pavel?
On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
On Wed, Mar 19, 2014 at 08:06:48PM +0100, David Herrmann wrote:
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It explicitly allows sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it.
memfd_create() does not create a front-FD, but instead returns the raw shmem file, so calls like ftruncate() can be used. Also calls like fstat() will return proper information and mark the file as regular file. Sealing is explicitly supported on memfds.
Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to quotas and alike.
If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files feature, Pavel?
Thanks, Cyrill.
It is, but the map_files will work "in the opposite direction" :) In the memfd case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one should first mmap() a region, then open it via /proc/self/map_files.
But I don't know whether this matters.
Thanks, Pavel
Hi
On Thu, Mar 20, 2014 at 10:01 AM, Pavel Emelyanov xemul@parallels.com wrote:
On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files feature, Pavel?
It is, but the map_files will work "in the opposite direction" :) In the memfd case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one should first mmap() a region, then open it via /proc/self/map_files.
But I don't know whether this matters.
Yes, you can replace memfd_create() so far with: p = mmap(NULL, size, ..., MAP_ANON | MAP_SHARED, -1, 0); sprintf(path, "/proc/self/map_files/%lx-%lx", p, p + size); fd = open(path, O_RDWR);
However, map_files is only enabled with CONFIG_CHECKPOINT_RESTORE, the /proc/pid/map_files/ directory is root-only (at least I get EPERM if non-root), it doesn't provide the "name" argument which is very handy for debugging, it doesn't explicitly support sealing (it requires MAP_ANON to be backed by shmem) and it's a very weird API for something this simple.
Thanks David
On 03/20/2014 03:29 PM, David Herrmann wrote:
Hi
On Thu, Mar 20, 2014 at 10:01 AM, Pavel Emelyanov xemul@parallels.com wrote:
On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files feature, Pavel?
It is, but the map_files will work "in the opposite direction" :) In the memfd case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one should first mmap() a region, then open it via /proc/self/map_files.
But I don't know whether this matters.
Yes, you can replace memfd_create() so far with: p = mmap(NULL, size, ..., MAP_ANON | MAP_SHARED, -1, 0); sprintf(path, "/proc/self/map_files/%lx-%lx", p, p + size); fd = open(path, O_RDWR);
However, map_files is only enabled with CONFIG_CHECKPOINT_RESTORE, the /proc/pid/map_files/ directory is root-only (at least I get EPERM if non-root),
Yes. But this is something we'd also like to have fixed :) Having two parties willing the same makes it easier for the patch to get accepted.
it doesn't provide the "name" argument which is very handy for debugging,
What if we make mmap's shmem_zero_setup() generate a meaningful name, would it solve the debugging issue?
it doesn't explicitly support sealing (it requires MAP_ANON to be backed by shmem)
Can you elaborate on this? The fd generated by sys_memfd() will be shmem-backed, so will be the file opened via map_files link for the MAP_ANON | MAP_SHARED mapping. So what are the problems to make it support sealing?
and it's a very weird API for something this simple.
:)
Thanks, Pavel
On 03/19/2014 12:06 PM, David Herrmann wrote:
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It explicitly allows sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it.
memfd_create() does not create a front-FD, but instead returns the raw shmem file, so calls like ftruncate() can be used. Also calls like fstat() will return proper information and mark the file as regular file. Sealing is explicitly supported on memfds.
Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to quotas and alike.
This syscall would also be useful to Android, since it would satisfy the requirement for providing atomically unlinked tmpfs fds that ashmem provides (although upstreamed solutions to ashmem's other functionalities are still needed).
My only comment is that I think memfd_* is sort of a new namespace. Since this is providing shmem files, it seems it might be better named something like shmfd_create() or my earlier suggestion of shmget_fd(). Otherwise, when talking about functionality like sealing, which is only available on shmfs, we'll have to say "shmfs/tmpfs/memfd" or risk confusing folks who might not initially grasp that its all the same underneath.
thanks -john
On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann dh.herrmann@gmail.com wrote:
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It explicitly allows sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it.
memfd_create() does not create a front-FD, but instead returns the raw shmem file, so calls like ftruncate() can be used. Also calls like fstat() will return proper information and mark the file as regular file. Sealing is explicitly supported on memfds.
Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to quotas and alike.
Instead of adding new syscall we can extend existing openat() a little bit more:
openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)
Hi
On Wed, Apr 2, 2014 at 3:38 PM, Konstantin Khlebnikov koct9i@gmail.com wrote:
On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann dh.herrmann@gmail.com wrote:
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It explicitly allows sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it.
memfd_create() does not create a front-FD, but instead returns the raw shmem file, so calls like ftruncate() can be used. Also calls like fstat() will return proper information and mark the file as regular file. Sealing is explicitly supported on memfds.
Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to quotas and alike.
Instead of adding new syscall we can extend existing openat() a little bit more:
openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)
O_TMPFILE requires an existing directory as "name". So you have to use: open("/run/", O_TMPFILE | O_RDWR, 0666) instead of open("/run/new_file", O_TMPFILE | O_RDWR, 0666)
We _really_ want to set a name for the inode, though. Otherwise, debug-info via /proc/pid/fd/ is useless.
Furthermore, Linus requested to allow sealing only on files that _explicitly_ allow sealing. So v2 of this series will have MFD_ALLOW_SEALING as memfd_create() flag. I don't think we can do this with linkat() (or is that meant to be implicit for the new AT_FDSHM?). Last but not least, you now need a separate syscall to set the file-size.
I could live with most of these issues, except for the name-thing. Ideas?
Thanks David
On Wed, Apr 2, 2014 at 6:18 PM, David Herrmann dh.herrmann@gmail.com wrote:
Hi
On Wed, Apr 2, 2014 at 3:38 PM, Konstantin Khlebnikov koct9i@gmail.com wrote:
On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann dh.herrmann@gmail.com wrote:
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It explicitly allows sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it.
memfd_create() does not create a front-FD, but instead returns the raw shmem file, so calls like ftruncate() can be used. Also calls like fstat() will return proper information and mark the file as regular file. Sealing is explicitly supported on memfds.
Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to quotas and alike.
Instead of adding new syscall we can extend existing openat() a little bit more:
openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)
O_TMPFILE requires an existing directory as "name". So you have to use: open("/run/", O_TMPFILE | O_RDWR, 0666) instead of open("/run/new_file", O_TMPFILE | O_RDWR, 0666)
We _really_ want to set a name for the inode, though. Otherwise, debug-info via /proc/pid/fd/ is useless.
Furthermore, Linus requested to allow sealing only on files that _explicitly_ allow sealing. So v2 of this series will have MFD_ALLOW_SEALING as memfd_create() flag. I don't think we can do this with linkat() (or is that meant to be implicit for the new AT_FDSHM?). Last but not least, you now need a separate syscall to set the file-size.
I could live with most of these issues, except for the name-thing. Ideas?
Hmm, why AT_FDSHM + O_TMPFILE pair cannot has different naming behavior? Actually O_TMPFILE flag is optional here. AT_FDSHM is enough, but O_TMPFILE allows to move branching out of common fast-paths and hide it inside do_tmpfile.
BTW you can set some extended attribute via fsetxattr and distinguish files in proc by its value.
OR you could add fcntl() for changing 'name' of tmpfiles. In combination with AT_FDSHM this would give complete solution without changing O_TMPFILE naming scheme. But one syscall turns into three. )
--
On 04/02/2014 06:38 AM, Konstantin Khlebnikov wrote:
On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann dh.herrmann@gmail.com wrote:
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It explicitly allows sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it.
memfd_create() does not create a front-FD, but instead returns the raw shmem file, so calls like ftruncate() can be used. Also calls like fstat() will return proper information and mark the file as regular file. Sealing is explicitly supported on memfds.
Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to quotas and alike.
Instead of adding new syscall we can extend existing openat() a little bit more:
openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)
Please don't. O_TMPFILE is a messy enough API, and the last thing we need to do is to extend it. If we want a fancy API for creating new inodes with no corresponding dentry, let's create one.
Otherwise, let's just stick with a special-purpose API for these shm files.
--Andy
Some basic tests to verify sealing on memfds works as expected and guarantees the advertised semantics.
Signed-off-by: David Herrmann dh.herrmann@gmail.com --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/memfd/.gitignore | 2 + tools/testing/selftests/memfd/Makefile | 29 + tools/testing/selftests/memfd/memfd_test.c | 972 +++++++++++++++++++++++++++++ 4 files changed, 1004 insertions(+) create mode 100644 tools/testing/selftests/memfd/.gitignore create mode 100644 tools/testing/selftests/memfd/Makefile create mode 100644 tools/testing/selftests/memfd/memfd_test.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 32487ed..c57325a 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -2,6 +2,7 @@ TARGETS = breakpoints TARGETS += cpu-hotplug TARGETS += efivarfs TARGETS += kcmp +TARGETS += memfd TARGETS += memory-hotplug TARGETS += mqueue TARGETS += net diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore new file mode 100644 index 0000000..bcc8ee2 --- /dev/null +++ b/tools/testing/selftests/memfd/.gitignore @@ -0,0 +1,2 @@ +memfd_test +memfd-test-file diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile new file mode 100644 index 0000000..36653b9 --- /dev/null +++ b/tools/testing/selftests/memfd/Makefile @@ -0,0 +1,29 @@ +uname_M := $(shell uname -m 2>/dev/null || echo not) +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/) +ifeq ($(ARCH),i386) + ARCH := X86 +endif +ifeq ($(ARCH),x86_64) + ARCH := X86 +endif + +CFLAGS += -I../../../../arch/x86/include/generated/uapi/ +CFLAGS += -I../../../../arch/x86/include/uapi/ +CFLAGS += -I../../../../include/uapi/ +CFLAGS += -I../../../../include/ + +all: +ifeq ($(ARCH),X86) + gcc $(CFLAGS) memfd_test.c -o memfd_test +else + echo "Not an x86 target, can't build memfd selftest" +endif + +run_tests: all +ifeq ($(ARCH),X86) + gcc $(CFLAGS) memfd_test.c -o memfd_test +endif + @./memfd_test || echo "memfd_test: [FAIL]" + +clean: + $(RM) memfd_test diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c new file mode 100644 index 0000000..41bac6f --- /dev/null +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -0,0 +1,972 @@ +#define _GNU_SOURCE +#define __EXPORTED_HEADERS__ + +#include <errno.h> +#include <inttypes.h> +#include <limits.h> +#include <linux/falloc.h> +#include <linux/fcntl.h> +#include <linux/memfd.h> +#include <sched.h> +#include <stdio.h> +#include <stdlib.h> +#include <signal.h> +#include <string.h> +#include <sys/mman.h> +#include <sys/stat.h> +#include <sys/syscall.h> +#include <unistd.h> + +#define MFD_DEF_SIZE 8192 +#define STACK_SIZE 65535 + +static int sys_memfd_create(const char *name, + __u64 size, + __u64 flags) +{ + return syscall(__NR_memfd_create, name, size, flags); +} + +static int mfd_assert_new(const char *name, __u64 sz, __u64 flags) +{ + int r; + + r = sys_memfd_create(name, sz, flags); + if (r < 0) { + printf("memfd_create("%s", %llu, %llu) failed: %m\n", + name, (unsigned long long)sz, + (unsigned long long)flags); + abort(); + } + + return r; +} + +static void mfd_fail_new(const char *name, __u64 size, __u64 flags) +{ + int r; + + r = sys_memfd_create(name, size, flags); + if (r >= 0) { + printf("memfd_create("%s", %llu, %llu) succeeded, but failure expected\n", + name, (unsigned long long)size, + (unsigned long long)flags); + close(r); + abort(); + } +} + +static __u64 mfd_assert_get_seals(int fd) +{ + long r; + + r = fcntl(fd, SHMEM_GET_SEALS); + if (r < 0) { + printf("GET_SEALS(%d) failed: %m\n", fd); + abort(); + } + + return r; +} + +static void mfd_assert_has_seals(int fd, __u64 seals) +{ + __u64 s; + + s = mfd_assert_get_seals(fd); + if (s != seals) { + printf("%llu != %llu = GET_SEALS(%d)\n", + (unsigned long long)seals, (unsigned long long)s, fd); + abort(); + } +} + +static void mfd_assert_set_seals(int fd, __u64 seals) +{ + long r; + __u64 s; + + s = mfd_assert_get_seals(fd); + r = fcntl(fd, SHMEM_SET_SEALS, seals); + if (r < 0) { + printf("SET_SEALS(%d, %llu -> %llu) failed: %m\n", + fd, (unsigned long long)s, (unsigned long long)seals); + abort(); + } +} + +static void mfd_fail_set_seals(int fd, __u64 seals) +{ + long r; + __u64 s; + + s = mfd_assert_get_seals(fd); + r = fcntl(fd, SHMEM_SET_SEALS, seals); + if (r >= 0) { + printf("SET_SEALS(%d, %llu -> %llu) didn't fail as expected\n", + fd, (unsigned long long)s, (unsigned long long)seals); + abort(); + } +} + +static void mfd_assert_size(int fd, size_t size) +{ + struct stat st; + int r; + + r = fstat(fd, &st); + if (r < 0) { + printf("fstat(%d) failed: %m\n", fd); + abort(); + } else if (st.st_size != size) { + printf("wrong file size %lld, but expected %lld\n", + (long long)st.st_size, (long long)size); + abort(); + } +} + +static int mfd_assert_dup(int fd) +{ + int r; + + r = dup(fd); + if (r < 0) { + printf("dup(%d) failed: %m\n", fd); + abort(); + } + + return r; +} + +static void *mfd_assert_mmap_shared(int fd) +{ + void *p; + + p = mmap(NULL, + MFD_DEF_SIZE, + PROT_READ | PROT_WRITE, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + + return p; +} + +static void *mfd_assert_mmap_private(int fd) +{ + void *p; + + p = mmap(NULL, + MFD_DEF_SIZE, + PROT_READ, + MAP_PRIVATE, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + + return p; +} + +static int mfd_assert_open(int fd, int flags, mode_t mode) +{ + char buf[512]; + int r; + + sprintf(buf, "/proc/self/fd/%d", fd); + r = open(buf, flags, mode); + if (r < 0) { + printf("open(%s) failed: %m\n", buf); + abort(); + } + + return r; +} + +static void mfd_fail_open(int fd, int flags, mode_t mode) +{ + char buf[512]; + int r; + + sprintf(buf, "/proc/self/fd/%d", fd); + r = open(buf, flags, mode); + if (r >= 0) { + printf("open(%s) didn't fail as expected\n"); + abort(); + } +} + +static void mfd_assert_read(int fd) +{ + char buf[16]; + void *p; + ssize_t l; + + l = read(fd, buf, sizeof(buf)); + if (l != sizeof(buf)) { + printf("read() failed: %m\n"); + abort(); + } + + /* verify PROT_READ *is* allowed */ + p = mmap(NULL, + MFD_DEF_SIZE, + PROT_READ, + MAP_PRIVATE, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + munmap(p, MFD_DEF_SIZE); + + /* verify MAP_PRIVATE is *always* allowed (even writable) */ + p = mmap(NULL, + MFD_DEF_SIZE, + PROT_READ | PROT_WRITE, + MAP_PRIVATE, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + munmap(p, MFD_DEF_SIZE); +} + +static void mfd_assert_write(int fd) +{ + ssize_t l; + void *p; + int r; + + /* verify write() succeeds */ + l = write(fd, "\0\0\0\0", 4); + if (l != 4) { + printf("write() failed: %m\n"); + abort(); + } + + /* verify PROT_READ | PROT_WRITE is allowed */ + p = mmap(NULL, + MFD_DEF_SIZE, + PROT_READ | PROT_WRITE, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + *(char*)p = 0; + munmap(p, MFD_DEF_SIZE); + + /* verify PROT_WRITE is allowed */ + p = mmap(NULL, + MFD_DEF_SIZE, + PROT_WRITE, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + *(char*)p = 0; + munmap(p, MFD_DEF_SIZE); + + /* verify PROT_READ with MAP_SHARED is allowed and a following + * mprotect(PROT_WRITE) allows writing */ + p = mmap(NULL, + MFD_DEF_SIZE, + PROT_READ, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + + r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE); + if (r < 0) { + printf("mprotect() failed: %m\n"); + abort(); + } + + *(char*)p = 0; + munmap(p, MFD_DEF_SIZE); + + /* verify PUNCH_HOLE works */ + r = fallocate(fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + 0, + MFD_DEF_SIZE); + if (r < 0) { + printf("fallocate(PUNCH_HOLE) failed: %m\n"); + abort(); + } +} + +static void mfd_fail_write(int fd) +{ + ssize_t l; + void *p; + int r; + + /* verify write() fails */ + l = write(fd, "data", 4); + if (l != -EPERM) { + printf("expected EPERM on write(), but got %d: %m\n", (int)l); + abort(); + } + + /* verify PROT_READ | PROT_WRITE is not allowed */ + p = mmap(NULL, + MFD_DEF_SIZE, + PROT_READ | PROT_WRITE, + MAP_SHARED, + fd, + 0); + if (p != MAP_FAILED) { + printf("mmap() didn't fail as expected\n"); + abort(); + } + + /* verify PROT_WRITE is not allowed */ + p = mmap(NULL, + MFD_DEF_SIZE, + PROT_WRITE, + MAP_SHARED, + fd, + 0); + if (p != MAP_FAILED) { + printf("mmap() didn't fail as expected\n"); + abort(); + } + + /* verify PROT_READ with MAP_SHARED is not allowed */ + p = mmap(NULL, + MFD_DEF_SIZE, + PROT_READ, + MAP_SHARED, + fd, + 0); + if (p != MAP_FAILED) { + printf("mmap() didn't fail as expected\n"); + abort(); + } + + /* verify PUNCH_HOLE fails */ + r = fallocate(fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + 0, + MFD_DEF_SIZE); + if (r >= 0) { + printf("fallocate(PUNCH_HOLE) didn't fail as expected\n"); + abort(); + } +} + +static void mfd_assert_shrink(int fd) +{ + int r, fd2; + + r = ftruncate(fd, MFD_DEF_SIZE / 2); + if (r < 0) { + printf("ftruncate(SHRINK) failed: %m\n"); + abort(); + } + + mfd_assert_size(fd, MFD_DEF_SIZE / 2); + + fd2 = mfd_assert_open(fd, + O_RDWR | O_CREAT | O_TRUNC, + S_IRUSR | S_IWUSR); + close(fd2); + + mfd_assert_size(fd, 0); +} + +static void mfd_fail_shrink(int fd) +{ + int r; + + r = ftruncate(fd, MFD_DEF_SIZE / 2); + if (r >= 0) { + printf("ftruncate(SHRINK) didn't fail as expected\n"); + abort(); + } + + mfd_fail_open(fd, + O_RDWR | O_CREAT | O_TRUNC, + S_IRUSR | S_IWUSR); +} + +static void mfd_assert_grow(int fd) +{ + int r; + + r = ftruncate(fd, MFD_DEF_SIZE * 2); + if (r < 0) { + printf("ftruncate(GROW) failed: %m\n"); + abort(); + } + + mfd_assert_size(fd, MFD_DEF_SIZE * 2); + + r = fallocate(fd, + 0, + 0, + MFD_DEF_SIZE * 4); + if (r < 0) { + printf("fallocate(ALLOC) failed: %m\n"); + abort(); + } + + mfd_assert_size(fd, MFD_DEF_SIZE * 4); +} + +static void mfd_fail_grow(int fd) +{ + int r; + + r = ftruncate(fd, MFD_DEF_SIZE * 2); + if (r >= 0) { + printf("ftruncate(GROW) didn't fail as expected\n"); + abort(); + } + + r = fallocate(fd, + 0, + 0, + MFD_DEF_SIZE * 4); + if (r >= 0) { + printf("fallocate(ALLOC) didn't fail as expected\n"); + abort(); + } +} + +static void mfd_assert_grow_write(int fd) +{ + static char buf[MFD_DEF_SIZE * 8]; + ssize_t l; + + l = pwrite(fd, buf, sizeof(buf), 0); + if (l != sizeof(buf)) { + printf("pwrite() failed: %m\n"); + abort(); + } + + mfd_assert_size(fd, MFD_DEF_SIZE * 8); +} + +static void mfd_fail_grow_write(int fd) +{ + static char buf[MFD_DEF_SIZE * 8]; + ssize_t l; + + l = pwrite(fd, buf, sizeof(buf), 0); + if (l == sizeof(buf)) { + printf("pwrite() didn't fail as expected\n"); + abort(); + } +} + +static int idle_thread_fn(void *arg) +{ + sigset_t set; + int sig; + + /* dummy waiter; SIGTERM terminates us anyway */ + sigemptyset(&set); + sigaddset(&set, SIGTERM); + sigwait(&set, &sig); + + return 0; +} + +static pid_t spawn_idle_thread(void) +{ + uint8_t *stack; + pid_t pid; + + stack = malloc(STACK_SIZE); + if (!stack) { + printf("malloc(STACK_SIZE) failed: %m\n"); + abort(); + } + + pid = clone(idle_thread_fn, + stack + STACK_SIZE, + CLONE_FILES | CLONE_FS | CLONE_VM | SIGCHLD, + NULL); + if (pid < 0) { + printf("clone() failed: %m\n"); + abort(); + } + + return pid; +} + +static void join_idle_thread(pid_t pid) +{ + kill(pid, SIGTERM); + waitpid(pid, NULL, 0); +} + +static pid_t spawn_idle_proc(void) +{ + pid_t pid; + sigset_t set; + int sig; + + pid = fork(); + if (pid < 0) { + printf("fork() failed: %m\n"); + abort(); + } else if (!pid) { + /* dummy waiter; SIGTERM terminates us anyway */ + sigemptyset(&set); + sigaddset(&set, SIGTERM); + sigwait(&set, &sig); + exit(0); + } + + return pid; +} + +static void join_idle_proc(pid_t pid) +{ + kill(pid, SIGTERM); + waitpid(pid, NULL, 0); +} + +/* + * Test memfd_create() syscall + * Verify syscall-argument validation, including name checks, flag validation + * and more. + */ +static void test_create(void) +{ + char buf[2048]; + int fd; + + /* test NULL name */ + mfd_fail_new(NULL, 0, 0); + + /* test over-long name (not zero-terminated) */ + memset(buf, 0xff, sizeof(buf)); + mfd_fail_new(buf, 0, 0); + + /* test over-long zero-terminated name */ + memset(buf, 0xff, sizeof(buf)); + buf[sizeof(buf) - 1] = 0; + mfd_fail_new(buf, 0, 0); + + /* verify "" is a valid name */ + fd = mfd_assert_new("", 0, 0); + close(fd); + + /* verify invalid O_* open flags */ + mfd_fail_new("", 0, 0x0100); + mfd_fail_new("", 0, ~MFD_CLOEXEC); + mfd_fail_new("", 0, ~0); + mfd_fail_new("", 0, 0x8000000000000000ULL); + + /* verify MFD_CLOEXEC is allowed */ + fd = mfd_assert_new("", 0, MFD_CLOEXEC); + close(fd); +} + +/* + * Test basic sealing + * A very basic sealing test to see whether setting/retrieving seals works. + */ +static void test_basic(void) +{ + int fd; + + fd = mfd_assert_new("kern_memfd_basic", + MFD_DEF_SIZE, + MFD_CLOEXEC); + mfd_assert_has_seals(fd, 0); + mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK | + SHMEM_SEAL_GROW | + SHMEM_SEAL_WRITE); + mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK | + SHMEM_SEAL_GROW | + SHMEM_SEAL_WRITE); + close(fd); +} + +/* + * Test SEAL_WRITE + * Test whether SEAL_WRITE actually prevents modifications. + */ +static void test_seal_write(void) +{ + int fd; + + fd = mfd_assert_new("kern_memfd_seal_write", + MFD_DEF_SIZE, + MFD_CLOEXEC); + mfd_assert_has_seals(fd, 0); + mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + mfd_assert_read(fd); + mfd_fail_write(fd); + mfd_assert_shrink(fd); + mfd_assert_grow(fd); + mfd_fail_grow_write(fd); + + close(fd); +} + +/* + * Test SEAL_SHRINK + * Test whether SEAL_SHRINK actually prevents shrinking + */ +static void test_seal_shrink(void) +{ + int fd; + + fd = mfd_assert_new("kern_memfd_seal_shrink", + MFD_DEF_SIZE, + MFD_CLOEXEC); + mfd_assert_has_seals(fd, 0); + mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK); + + mfd_assert_read(fd); + mfd_assert_write(fd); + mfd_fail_shrink(fd); + mfd_assert_grow(fd); + mfd_assert_grow_write(fd); + + close(fd); +} + +/* + * Test SEAL_GROW + * Test whether SEAL_GROW actually prevents growing + */ +static void test_seal_grow(void) +{ + int fd; + + fd = mfd_assert_new("kern_memfd_seal_grow", + MFD_DEF_SIZE, + MFD_CLOEXEC); + mfd_assert_has_seals(fd, 0); + mfd_assert_set_seals(fd, SHMEM_SEAL_GROW); + mfd_assert_has_seals(fd, SHMEM_SEAL_GROW); + + mfd_assert_read(fd); + mfd_assert_write(fd); + mfd_assert_shrink(fd); + mfd_fail_grow(fd); + mfd_fail_grow_write(fd); + + close(fd); +} + +/* + * Test SEAL_SHRINK | SEAL_GROW + * Test whether SEAL_SHRINK | SEAL_GROW actually prevents resizing + */ +static void test_seal_resize(void) +{ + int fd; + + fd = mfd_assert_new("kern_memfd_seal_resize", + MFD_DEF_SIZE, + MFD_CLOEXEC); + mfd_assert_has_seals(fd, 0); + mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK | SHMEM_SEAL_GROW); + mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK | SHMEM_SEAL_GROW); + + mfd_assert_read(fd); + mfd_assert_write(fd); + mfd_fail_shrink(fd); + mfd_fail_grow(fd); + mfd_fail_grow_write(fd); + + close(fd); +} + +/* + * Test sharing via dup() + * Test whether seal-modifications are correctly discarded if multiple FDs for + * the same file exist. + */ +static void test_share_dup(void) +{ + int fd, fd2; + + fd = mfd_assert_new("kern_memfd_share_dup", + MFD_DEF_SIZE, + MFD_CLOEXEC); + mfd_assert_has_seals(fd, 0); + + fd2 = mfd_assert_dup(fd); + mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd, 0); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + close(fd2); + + mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + + mfd_assert_set_seals(fd, SHMEM_SEAL_GROW); + mfd_assert_has_seals(fd, SHMEM_SEAL_GROW); + + mfd_assert_set_seals(fd, 0); + mfd_assert_has_seals(fd, 0); + + /* try again but switch FDs to test that they're equal */ + + fd2 = mfd_assert_dup(fd); + mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE); + mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd2, SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd2, 0); + mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE); + + close(fd); + + mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + + mfd_assert_set_seals(fd2, SHMEM_SEAL_GROW); + mfd_assert_has_seals(fd2, SHMEM_SEAL_GROW); + + mfd_assert_set_seals(fd2, 0); + mfd_assert_has_seals(fd2, 0); + + close(fd2); +} + +/* + * Test sealing with active mmap()s + * Modifying seals is only allowed if no other mmap() refs exist, except for + * initial sealing, which allows read-only mappings. Test for the different + * combinations here. + */ +static void test_share_mmap(void) +{ + int fd; + void *p; + + fd = mfd_assert_new("kern_memfd_share_mmap", + MFD_DEF_SIZE, + MFD_CLOEXEC); + mfd_assert_has_seals(fd, 0); + + /* shared/writable ref prevents sealing */ + p = mfd_assert_mmap_shared(fd); + mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, 0); + munmap(p, MFD_DEF_SIZE); + + /* readable ref allows initial sealing, but prevents modifications */ + p = mfd_assert_mmap_private(fd); + mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK); + mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE); + mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK); + munmap(p, MFD_DEF_SIZE); + + /* dropping all additional refs allows modifications again */ + mfd_assert_set_seals(fd, 0); + mfd_assert_has_seals(fd, 0); + + close(fd); +} + +/* + * Test sealing with open(/proc/self/fd/%d) + * Via /proc we can get access to a separate file-context for the same memfd. + * This is *not* like dup(), but like a real separate open(). Make sure the + * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR. + */ +static void test_share_open(void) +{ + int fd, fd2; + + fd = mfd_assert_new("kern_memfd_share_open", + MFD_DEF_SIZE, + MFD_CLOEXEC); + mfd_assert_has_seals(fd, 0); + + fd2 = mfd_assert_open(fd, O_RDONLY, 0); + mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd, 0); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + close(fd2); + + mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + + mfd_assert_set_seals(fd, SHMEM_SEAL_GROW); + mfd_assert_has_seals(fd, SHMEM_SEAL_GROW); + + mfd_assert_set_seals(fd, 0); + mfd_assert_has_seals(fd, 0); + + /* test that RDONLY doesn't allow setting seals, even if exclusive */ + + fd2 = mfd_assert_open(fd, O_RDONLY, 0); + mfd_fail_set_seals(fd2, SHMEM_SEAL_WRITE); + mfd_assert_has_seals(fd2, 0); + + close(fd); + + mfd_fail_set_seals(fd2, SHMEM_SEAL_WRITE); + mfd_assert_has_seals(fd2, 0); + + close(fd2); + + /* same again but with writable open */ + + fd = mfd_assert_new("kern_memfd_share_open", + MFD_DEF_SIZE, + MFD_CLOEXEC); + mfd_assert_has_seals(fd, 0); + + fd2 = mfd_assert_open(fd, O_RDWR, 0); + mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE); + mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE); + + close(fd); + + mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + + mfd_assert_set_seals(fd2, SHMEM_SEAL_GROW); + mfd_assert_has_seals(fd2, SHMEM_SEAL_GROW); + + mfd_assert_set_seals(fd2, 0); + mfd_assert_has_seals(fd2, 0); + + close(fd2); +} + +/* + * Test sharing via fork() + * Test whether seal-modifications are correctly discarded if multiple FDs for + * the same file exist. + */ +static void test_share_fork(void) +{ + int fd; + pid_t pid; + + fd = mfd_assert_new("kern_memfd_share_fork", + MFD_DEF_SIZE, + MFD_CLOEXEC); + mfd_assert_has_seals(fd, 0); + + pid = spawn_idle_proc(); + mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + mfd_fail_set_seals(fd, 0); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE); + + join_idle_proc(pid); + + mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK); + + mfd_assert_set_seals(fd, SHMEM_SEAL_GROW); + mfd_assert_has_seals(fd, SHMEM_SEAL_GROW); + + mfd_assert_set_seals(fd, 0); + mfd_assert_has_seals(fd, 0); + + close(fd); +} + +int main(int argc, char **argv) +{ + pid_t pid; + + printf("memfd: CREATE\n"); + test_create(); + printf("memfd: BASIC\n"); + test_basic(); + + printf("memfd: SEAL-WRITE\n"); + test_seal_write(); + printf("memfd: SEAL-SHRINK\n"); + test_seal_shrink(); + printf("memfd: SEAL-GROW\n"); + test_seal_grow(); + printf("memfd: SEAL-RESIZE\n"); + test_seal_resize(); + + printf("memfd: SHARE-DUP\n"); + test_share_dup(); + printf("memfd: SHARE-MMAP\n"); + test_share_mmap(); + printf("memfd: SHARE-OPEN\n"); + test_share_open(); + printf("memfd: SHARE-FORK\n"); + test_share_fork(); + + /* Run test-suite in a multi-threaded environment with a shared + * file-table. This triggers the slow-path in fdget() in the kernel. */ + pid = spawn_idle_thread(); + printf("memfd: SHARE-DUP (shared file-table)\n"); + test_share_dup(); + printf("memfd: SHARE-MMAP (shared file-table)\n"); + test_share_mmap(); + printf("memfd: SHARE-OPEN (shared file-table)\n"); + test_share_open(); + printf("memfd: SHARE-FORK (shared file-table)\n"); + test_share_fork(); + join_idle_thread(pid); + + printf("memfd: DONE\n"); + + return 0; +}
The SHMEM_GET_SEALS and SHMEM_SET_SEALS commands allow retrieving and modifying the active set of seals on a file. They're only supported on selected file-systems (currently shmfs) and are linux-only.
Signed-off-by: David Herrmann dh.herrmann@gmail.com --- man2/fcntl.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 90 insertions(+)
diff --git a/man2/fcntl.2 b/man2/fcntl.2 index c010a49..53d55a5 100644 --- a/man2/fcntl.2 +++ b/man2/fcntl.2 @@ -57,6 +57,8 @@ ." Document F_SETOWN_EX and F_GETOWN_EX ." 2010-06-17, Michael Kerrisk ." Document F_SETPIPE_SZ and F_GETPIPE_SZ. +." 2014-03-19, David Herrmann dh.herrmann@gmail.com +." Document SHMEM_SET_SEALS and SHMEM_GET_SEALS ." .TH FCNTL 2 2014-02-20 "Linux" "Linux Programmer's Manual" .SH NAME @@ -1064,6 +1066,94 @@ of buffer space currently used to store data produces the error .BR F_GETPIPE_SZ " (\fIvoid\fP; since Linux 2.6.35)" Return (as the function result) the capacity of the pipe referred to by .IR fd . +.SS File Sealing +Sealing files limits the set of allowed operations on a given file. For each +seal that is set on a file, a specific set of operations will fail with +.B EPERM +on this file from now on. The file is said to be sealed. A file does not have +any seals set by default. Moreover, most filesystems do not support sealing +(only shmfs implements it right now). The following seals are available: +.RS +.TP +.BR SHMEM_SEAL_SHRINK +If this seal is set, the file in question cannot be reduced in size. This +affects +.BR open (2) +with the +.B O_TRUNC +flag and +.BR ftruncate (2). +They will fail with +.B EPERM +if you try to shrink the file in question. Increasing the file size is still +possible. +.TP +.BR SHMEM_SEAL_GROW +If this seal is set, the size of the file in question cannot be increased. This +affects +.BR write (2) +if you write across size boundaries, +.BR ftruncate (2) +and +.BR fallocate (2). +These calls will fail with +.B EPERM +if you use them to increase the file size or write beyond size boundaries. If +you keep the size or shrink it, those calls still work as expected. +.TP +.BR SHMEM_SEAL_WRITE +If this seal is set, you cannot modify data contents of the file. Note that +shrinking or growing the size of the file is still possible and allowed. Thus, +this seal is normally used in combination with one of the other seals. This seal +affects +.BR write (2) +and +.BR fallocate (2) +(only in combination with the +.B FALLOC_FL_PUNCH_HOLE +flag). Those calls will fail with +.B EPERM +if this seal is set. Furthermore, trying to create new memory-mappings via +.BR mmap (2) +in combination with +.B MAP_SHARED +will also fail with +.BR EPERM . +.RE +.TP +.BR SHMEM_SET_SEALS " (\fIint\fP; since Linux TBD)" +Change the set of seals of the file referred to by +.I fd +to +.IR arg . +You are required to own an exclusive reference to the file in question in order +to modify the seals. Otherwise, this call will fail with +.BR EPERM . +There is one exception: If no seals are set, this restriction does not apply and +you can set seals even if you don't own an exclusive reference. However, in any +case there may not exist any shared writable mapping or this call will always +fail with +.BR EPERM . +These semantics guarantee that once you verified a specific set of seals is set +on a given file, nobody besides you (in case you own an exclusive reference) can +modify the seals, anymore. + +You own an exclusive reference to a file if, and only if, the file-descriptor +passed to +.BR fcntl (2) +is the only reference to the underlying inode. There must not be any duplicates +of this file-descriptor, no other open files to the same underlying inode, no +hard-links or any active memory mappings. +.TP +.BR SHMEM_GET_SEALS " (\fIvoid\fP; since Linux TBD)" +Return (as the function result) the current set of seals of the file referred to +by +.IR fd . +If no seals are set, 0 is returned. If the file does not support sealing, -1 is +returned and +.I errno +is set to +.BR EINVAL . .SH RETURN VALUE For a successful call, the return value depends on the operation: .TP 0.9i
The memfd_create() syscall creates anonymous files similar to O_TMPFILE but does not require an active mount-point.
Signed-off-by: David Herrmann dh.herrmann@gmail.com --- man2/memfd_create.2 | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 man2/memfd_create.2
diff --git a/man2/memfd_create.2 b/man2/memfd_create.2 new file mode 100644 index 0000000..3e362e0 --- /dev/null +++ b/man2/memfd_create.2 @@ -0,0 +1,110 @@ +." Copyright (C) 2014 David Herrmann dh.herrmann@gmail.com +." starting from a version by Michael Kerrisk mtk.manpages@gmail.com +." +." %%%LICENSE_START(GPLv2+_SW_3_PARA) +." This program is free software; you can redistribute it and/or modify +." it under the terms of the GNU General Public License as published by +." the Free Software Foundation; either version 2 of the License, or +." (at your option) any later version. +." +." This program is distributed in the hope that it will be useful, +." but WITHOUT ANY WARRANTY; without even the implied warranty of +." MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +." GNU General Public License for more details. +." +." You should have received a copy of the GNU General Public +." License along with this manual; if not, see +." http://www.gnu.org/licenses/. +." %%%LICENSE_END +." +.TH MEMFD_CREATE 2 2014-03-18 Linux "Linux Programmer's Manual" +.SH NAME +memfd_create - create an anonymous file +.SH SYNOPSIS +.B #include <sys/memfd.h> +.sp +.BI "int memfd_create(const char *" name ", u64 " size ", u64 " flags ");" +.SH DESCRIPTION +.BR memfd_create () +creates an anonymous file and returns a file-descriptor to it. The file behaves +like regular files, thus can be modified, truncated, memory-mapped and more. +However, unlike regular files it lives in main memory and has no non-volatile +backing storage. Once all references to the file are dropped, it is +automatically released. Like all shmem-based files, memfd files support +.BR SHMEM +sealing parameters. See +.BR SHMEM_SET_SEALS " with " fcntl (2) +for more information. + +The initial size of the file is set to +.IR size ". " name +is used as internal file-name and will occur as such in +.IR /proc/self/fd/ . +The name is always prefixed with +.BR memfd: +and serves only debugging purposes. + +The following values may be bitwise ORed in +.IR flags +to change the behaviour of +.BR memfd_create (): +.TP +.BR MFD_CLOEXEC +Set the close-on-exec +.RB ( FD_CLOEXEC ) +flag on the new file descriptor. +See the description of the +.B O_CLOEXEC +flag in +.BR open (2) +for reasons why this may be useful. +.PP +Unused bits must be cleared to 0. + +As its return value, +.BR memfd_create () +returns a new file descriptor that can be used to refer to the file. +A copy of the file descriptor created by +.BR memfd_create () +is inherited by the child produced by +.BR fork (2). +The duplicate file descriptor is associated with the same file. +File descriptors created by +.BR memfd_create () +are preserved across +.BR execve (2), +unless the close-on-exec flag has been set. +.SH RETURN VALUE +On success, +.BR memfd_create () +returns a new file descriptor. +On error, -1 is returned and +.I errno +is set to indicate the error. +.SH ERRORS +.TP +.B EINVAL +An unsupported value was specified in one of the arguments. +.TP +.B EMFILE +The per-process limit on open file descriptors has been reached. +.TP +.B ENFILE +The system-wide limit on the total number of open files has been +reached. +.TP +.B EFAULT +The name given in +.IR name +points to invalid memory. +.TP +.B ENOMEM +There was insufficient memory to create a new anonymous file. +.SH VERSIONS +to-be-defined +.SH CONFORMING TO +.BR memfd_create () +is Linux-specific. +.SH SEE ALSO +.BR shmget (2), +.BR fcntl (2),
On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
Hi
This series introduces the concept of "file sealing". Sealing a file restricts the set of allowed operations on the file in question. Multiple seals are defined and each seal will cause a different set of operations to return EPERM if it is set. The following seals are introduced:
- SEAL_SHRINK: If set, the inode size cannot be reduced
- SEAL_GROW: If set, the inode size cannot be increased
- SEAL_WRITE: If set, the file content cannot be modified
Unlike existing techniques that provide similar protection, sealing allows file-sharing without any trust-relationship. This is enforced by rejecting seal modifications if you don't own an exclusive reference to the given file. So if you own a file-descriptor, you can be sure that no-one besides you can modify the seals on the given file. This allows mapping shared files from untrusted parties without the fear of the file getting truncated or modified by an attacker.
Several use-cases exist that could make great use of sealing:
Graphics Compositors If a graphics client creates a memory-backed render-buffer and passes a file-decsriptor to it to the graphics server for display, the server _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise, the client might run ftruncate() or O_TRUNC on the on file in parallel, thus crashing the server. With sealing, a compositor can reject any incoming file-descriptor that does _not_ have SEAL_SHRINK set. This way, any memory-mappings are guaranteed to stay accessible. Furthermore, we still allow clients to increase the buffer-size in case they want to resize the render-buffer for the next frame. We also allow parallel writes so the client can render new frames into the same buffer (client is responsible of never rendering into a front-buffer if you want to avoid artifacts).
Real use-case: Wayland wl_shm buffers can be transparently converted
Very nice, the Enlightenment developers have been asking for something like this for a while, it should help them out a lot as well.
And thanks for the man pages and test code, if only all new apis came with that already...
greg k-h
On Wed, Mar 19, 2014 at 12:06 PM, David Herrmann dh.herrmann@gmail.com wrote:
Unlike existing techniques that provide similar protection, sealing allows file-sharing without any trust-relationship. This is enforced by rejecting seal modifications if you don't own an exclusive reference to the given file.
I like the concept, but I really hate that "exclusive reference" approach. I see why you did it, but I also worry that it means that people can open random shm files that are *not* expected to be sealed, and screw up applications that don't expect it.
Is there really any use-case where the sealer isn't also the same thing that *created* the file in the first place? Because I would be a ton happier with the notion that you can only seal things that you yourself created. At that point, the exclusive reference isn't such a big deal any more, but more importantly, you can't play random denial-of-service games on files that aren't really yours.
The fact that you bring up the races involved with the exclusive reference approach also just makes me go "Is that really the correct security model"?
Linus
Hi
On Thu, Mar 20, 2014 at 4:49 AM, Linus Torvalds torvalds@linux-foundation.org wrote:
Is there really any use-case where the sealer isn't also the same thing that *created* the file in the first place? Because I would be a ton happier with the notion that you can only seal things that you yourself created. At that point, the exclusive reference isn't such a big deal any more, but more importantly, you can't play random denial-of-service games on files that aren't really yours.
My first idea was to add MFD_ALLOW_SEALING as memfd_create() flag, which enables the sealing-API for that file. Then I looked at POSIX mandatory locking and noticed that it provides similar restrictions on _all_ files. Mandatory locks can be more easily removed, but an attacker could just re-apply them in a loop, so that's not really an argument. Furthermore, sealing requires _write_ access so I wonder what kind of DoS attacks are possible with sealing that are not already possible with write access? And sealing is only possible if no writable, shared mapping exists. So even if an attacker seals a file, all that happens is EPERM, not SIGBUS (still a possible denial-of-service scenario).
But I understand that it is quite hard to review all the possible scenarios. So I'm fine with checking inode-ownership permissions for SET_SEALS. We could also make sealing a one-shot operation. Given that in a no-trust situation there is never a guarantee that the other side drops its references, re-using a sealed file is usually not possible. However, in sane environments, this could be a nice optimization in case the other side plays along. The one-shot semantics would allow dropping reference-checks entirely. The inode-ownership semantics would still require it.
Thanks David
My first idea was to add MFD_ALLOW_SEALING as memfd_create() flag, which enables the sealing-API for that file. Then I looked at POSIX
This actually seems the most sensible to me. The reason being that if I have some existing used object there is no way on earth I can be sure who has existing references to it, and we don't have revoke() to fix that.
So it pretty much has to be a new object in a sane programming model.
mandatory locking and noticed that it provides similar restrictions on _all_ files. Mandatory locks can be more easily removed, but an
The fact someone got it past a standards body doesn't make it a good idea.
attacker could just re-apply them in a loop, so that's not really an argument. Furthermore, sealing requires _write_ access so I wonder what kind of DoS attacks are possible with sealing that are not already possible with write access? And sealing is only possible if no writable, shared mapping exists. So even if an attacker seals a file, all that happens is EPERM, not SIGBUS (still a possible denial-of-service scenario).
I think you want two things at minimum
owner to seal root can always override
I would query the name too. Right now your assumption is 'shmem only' but that might change with other future use cases or types (eg some driver file handles) so SHMEM_ in the fcntl might become misleading.
Whether you want some way to undo a seal without an exclusive reference as the file owner is another question.
Alan
Hi
On Thu, Mar 20, 2014 at 3:41 PM, One Thousand Gnomes gnomes@lxorguk.ukuu.org.uk wrote:
I think you want two things at minimum
owner to seal root can always override
Why should root be allowed to override?
I would query the name too. Right now your assumption is 'shmem only' but that might change with other future use cases or types (eg some driver file handles) so SHMEM_ in the fcntl might become misleading.
I'm fine with F_SET/GET_SEALS. But given you suggested requiring MFD_ALLOW_SEALS for sealing, I don't see why we couldn't limit this interface entirely to memfd_create().
Whether you want some way to undo a seal without an exclusive reference as the file owner is another question.
No. You are never allowed to undo a seal but with an exclusive reference. This interface was created for situations _without_ any trust relationship. So if the owner is allowed to undo seals, the interface doesn't make any sense. The only options I see is to not allow un-sealing at all (which I'm fine with) or tracking users (which is way too much overhead).
Thanks David
On Thu, 20 Mar 2014 16:12:54 +0100 David Herrmann dh.herrmann@gmail.com wrote:
Hi
On Thu, Mar 20, 2014 at 3:41 PM, One Thousand Gnomes gnomes@lxorguk.ukuu.org.uk wrote:
I think you want two things at minimum
owner to seal root can always override
Why should root be allowed to override?
Because root can already override it by say mmapping the kernel memory and patching. It also tends to be valuable for debugging horrible problems with complex systems.
Imposing fake restrictions on root just causes annoyance when doing stuff like debugging but doesn't actually change the security situation.
I'm fine with F_SET/GET_SEALS. But given you suggested requiring MFD_ALLOW_SEALS for sealing, I don't see why we couldn't limit this interface entirely to memfd_create().
But if someone does find a non memfd use for it then it's useful not to have to go "this fnctl for memfd, that fnctl for the other"
Just planning ahead.
Whether you want some way to undo a seal without an exclusive reference as the file owner is another question.
No. You are never allowed to undo a seal but with an exclusive reference. This interface was created for situations _without_ any trust relationship. So if the owner is allowed to undo seals, the interface doesn't make any sense. The only options I see is to not allow un-sealing at all (which I'm fine with) or tracking users (which is way too much overhead).
Ok - that makes sense
On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
This series introduces the concept of "file sealing". Sealing a file restricts the set of allowed operations on the file in question. Multiple seals are defined and each seal will cause a different set of operations to return EPERM if it is set. The following seals are introduced:
- SEAL_SHRINK: If set, the inode size cannot be reduced
- SEAL_GROW: If set, the inode size cannot be increased
- SEAL_WRITE: If set, the file content cannot be modified
Looking at your patches, and what files you are modifying, you are enforcing this in the low-level file system.
Why not make sealing an attribute of the "struct file", and enforce it at the VFS layer? That way all file system objects would have access to sealing interface, and for memfd_shmem, you can't get another struct file pointing at the object, the security properties would be identical.
Cheers,
- Ted
On Thu, 20 Mar 2014 11:32:51 -0400 tytso@mit.edu wrote:
On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
This series introduces the concept of "file sealing". Sealing a file restricts the set of allowed operations on the file in question. Multiple seals are defined and each seal will cause a different set of operations to return EPERM if it is set. The following seals are introduced:
- SEAL_SHRINK: If set, the inode size cannot be reduced
- SEAL_GROW: If set, the inode size cannot be increased
- SEAL_WRITE: If set, the file content cannot be modified
Looking at your patches, and what files you are modifying, you are enforcing this in the low-level file system.
Why not make sealing an attribute of the "struct file", and enforce it at the VFS layer? That way all file system objects would have access to sealing interface, and for memfd_shmem, you can't get another struct file pointing at the object, the security properties would be identical.
Would it be more sensible to have a "sealer" which is a "device" which you give a file handle too and it gives you back a sealable one.
So for the memfd case you'd create a private handle, pass it to the sealer, and then pass the sealer handles to everyone else.
You have to implicitly trust the creator of the object has - handed you the object you expect - sealed it
so that appears no weaker but means you can meaningfully created sealed versions of arbitary objects and if you want have non-sealed ones around with it being up to the creator if they want for example to simply close the unsealed one immediately afterwards.
Alan
Hi
On Thu, Mar 20, 2014 at 4:32 PM, tytso@mit.edu wrote:
Why not make sealing an attribute of the "struct file", and enforce it at the VFS layer? That way all file system objects would have access to sealing interface, and for memfd_shmem, you can't get another struct file pointing at the object, the security properties would be identical.
Sealing as introduced here is an inode-attribute, not "struct file". This is intentional. For instance, a gfx-client can get a read-only FD via /proc/self/fd/ and pass it to the compositor so it can never overwrite the contents (unless the compositor has write-access to the inode itself, in which case it can just re-open it read-write).
Furthermore, I don't see any use-case besides memfd for sealing, so I purposely avoided changing core VFS interfaces. Protecting page-allocation/access for SEAL_WRITE like I do in shmem.c is not that easy to do generically. So if we moved this interface to "struct inode", all that would change is moving "u32 seals;" from one struct to the other. Ok, some protections might get easily implemented generically, but I without proper insight in the underlying implemenation, I couldn't verify all paths and possible races. Isn't keeping the API generic enough so far? Changing the underlying implementation can be done once we know what we want.
Thanks David
On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote:
On Thu, Mar 20, 2014 at 4:32 PM, tytso@mit.edu wrote:
Why not make sealing an attribute of the "struct file", and enforce it at the VFS layer? That way all file system objects would have access to sealing interface, and for memfd_shmem, you can't get another struct file pointing at the object, the security properties would be identical.
Sealing as introduced here is an inode-attribute, not "struct file". This is intentional. For instance, a gfx-client can get a read-only FD via /proc/self/fd/ and pass it to the compositor so it can never overwrite the contents (unless the compositor has write-access to the inode itself, in which case it can just re-open it read-write).
Hmm, good point. I had forgotten about the /proc/self/fd hole. Hmm... what if we have a SEAL_PROC which forces the permissions of /proc/self/fd to be 000?
So if it is a property of the attribute, SEAL_WRITE and SEAL_GROW is basically equivalent to using chattr to set the immutable and append-only attribute, except for the "you can't undo the seal unless you have exclusive access to the inode" magic.
That does make it pretty memfd_create specific and not a very general API, which is a little unfortunate; hence why I'm trying to explore ways of making a bit more generic and hopefully useful for more use cases.
Cheers,
- Ted
On 03/20/2014 09:38 AM, tytso@mit.edu wrote:
On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote:
On Thu, Mar 20, 2014 at 4:32 PM, tytso@mit.edu wrote:
Why not make sealing an attribute of the "struct file", and enforce it at the VFS layer? That way all file system objects would have access to sealing interface, and for memfd_shmem, you can't get another struct file pointing at the object, the security properties would be identical.
Sealing as introduced here is an inode-attribute, not "struct file". This is intentional. For instance, a gfx-client can get a read-only FD via /proc/self/fd/ and pass it to the compositor so it can never overwrite the contents (unless the compositor has write-access to the inode itself, in which case it can just re-open it read-write).
Hmm, good point. I had forgotten about the /proc/self/fd hole. Hmm... what if we have a SEAL_PROC which forces the permissions of /proc/self/fd to be 000?
This is the second time in a week that someone has asked for a way to have a struct file (or struct inode or whatever) that can't be reopened through /proc/pid/fd. This should be quite easy to implement as a separate feature.
Actually, that feature would solve a major pet peeve of mine, I think: I want something like memfd that allows me to keep the thing read-write but that whomever I pass the fd to can't change. With this feature, I could do:
fd_rw = memfd_create (or O_TMPFILE or whatever) fd_ro = open(/proc/self/fd/fd_ro, O_RDONLY); fcntl(fd_ro, F_RESTRICT, F_RESTRICT_REOPEN);
send fd_ro via SCM_RIGHTS.
To really make this work well, I also want to SEAL_SHRINK the inode so that the receiver can verify that I'm not going to truncate the file out from under it.
Bingo, fast and secure one-way IPC.
--Andy
On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote:
This is the second time in a week that someone has asked for a way to have a struct file (or struct inode or whatever) that can't be reopened through /proc/pid/fd. This should be quite easy to implement as a separate feature.
What I suggested on a different thread was to add the following new file descriptor flags, to join FD_CLOEXEC, which would be maniuplated using the F_GETFD and F_SETFD fcntl commands:
FD_NOPROCFS disallow being able to open the inode via /proc/<pid>/fd
FD_NOPASSFD disallow being able to pass the fd via a unix domain socket
FD_LOCKFLAGS if this bit is set, disallow any further changes of FD_CLOEXEC, FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags.
Regardless of what else we might need to meet the use case for the proposed File Sealing API, I think this is a useful feature that could be used in many other contexts besides just the proposed memfd_create() use case.
Cheers,
- Ted
On Thu, Apr 10, 2014 at 1:32 PM, Theodore Ts'o tytso@mit.edu wrote:
On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote:
This is the second time in a week that someone has asked for a way to have a struct file (or struct inode or whatever) that can't be reopened through /proc/pid/fd. This should be quite easy to implement as a separate feature.
What I suggested on a different thread was to add the following new file descriptor flags, to join FD_CLOEXEC, which would be maniuplated using the F_GETFD and F_SETFD fcntl commands:
FD_NOPROCFS disallow being able to open the inode via /proc/<pid>/fd
FD_NOPASSFD disallow being able to pass the fd via a unix domain socket
FD_LOCKFLAGS if this bit is set, disallow any further changes of FD_CLOEXEC, FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags.
Regardless of what else we might need to meet the use case for the proposed File Sealing API, I think this is a useful feature that could be used in many other contexts besides just the proposed memfd_create() use case.
It occurs to me that, before going nuts with these kinds of flags, it may pay to just try to fix the /proc/self/fd issue for real -- we could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is read-only. That may be enough for the file sealing thing.
--Andy
Hi
On Thu, Apr 10, 2014 at 10:37 PM, Andy Lutomirski luto@amacapital.net wrote:
It occurs to me that, before going nuts with these kinds of flags, it may pay to just try to fix the /proc/self/fd issue for real -- we could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is read-only. That may be enough for the file sealing thing.
For the sealing API, none of this is needed. As long as the inode is owned by the uid who creates the memfd, you can pass it around and no-one besides root and you can open /proc/self/fd/$fd (assuming chmod 700). If you share the fd with someone with the same uid as you, you're screwed anyway. We don't protect users against themselves (I mean, they can ptrace you, or kill()..). Therefore, I'm not really convinced that we want this for memfd. At least no-one has provided a _proper_ use-case for this so far.
Thanks David
On Thu, Apr 10, 2014 at 1:49 PM, David Herrmann dh.herrmann@gmail.com wrote:
Hi
On Thu, Apr 10, 2014 at 10:37 PM, Andy Lutomirski luto@amacapital.net wrote:
It occurs to me that, before going nuts with these kinds of flags, it may pay to just try to fix the /proc/self/fd issue for real -- we could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is read-only. That may be enough for the file sealing thing.
For the sealing API, none of this is needed. As long as the inode is owned by the uid who creates the memfd, you can pass it around and no-one besides root and you can open /proc/self/fd/$fd (assuming chmod 700). If you share the fd with someone with the same uid as you, you're screwed anyway. We don't protect users against themselves (I mean, they can ptrace you, or kill()..). Therefore, I'm not really convinced that we want this for memfd. At least no-one has provided a _proper_ use-case for this so far.
Hmm. Fair enough.
Would it make sense for the initial mode on a memfd inode to be 000? Anyone who finds this to be problematic could use fchmod to fix it.
I might even go so far as to suggest that the default uid on the inode should be 0 (i.e. global root), since there is the odd corner case of root setting euid != 0, creating a memfd, and setting euid back to 0. The latter might cause resource accounting issues, though.
--Andy
Hi
On Thu, Apr 10, 2014 at 11:16 PM, Andy Lutomirski luto@amacapital.net wrote:
Would it make sense for the initial mode on a memfd inode to be 000? Anyone who finds this to be problematic could use fchmod to fix it.
memfd_create() should be subject to umask() just like anything else. That should solve any possible race here, right?
Thanks David
On Thu, Apr 10, 2014 at 3:57 PM, David Herrmann dh.herrmann@gmail.com wrote:
Hi
On Thu, Apr 10, 2014 at 11:16 PM, Andy Lutomirski luto@amacapital.net wrote:
Would it make sense for the initial mode on a memfd inode to be 000? Anyone who finds this to be problematic could use fchmod to fix it.
memfd_create() should be subject to umask() just like anything else. That should solve any possible race here, right?
Yes, but how many people will actually think about umask when doing things that don't really look like creating files?
/proc/pid/fd is a really weird corner case in which the mode of an inode that doesn't have a name matters. I suspect that almost no one will ever want to open one of these things out of /proc/self/fd, and those who do should be made to think about it.
It also avoids odd screwups where things are secure until someone runs them with umask 000.
--Andy
Hi
On Fri, Apr 11, 2014 at 1:05 AM, Andy Lutomirski luto@amacapital.net wrote:
/proc/pid/fd is a really weird corner case in which the mode of an inode that doesn't have a name matters. I suspect that almost no one will ever want to open one of these things out of /proc/self/fd, and those who do should be made to think about it.
I'm arguing in the context of memfd, and there's no security leak if people get access to the underlying inode (at least I'm not aware of any). As I said, context information is attached to the inode, not file context, so I'm fine if people want to open multiple file contexts via /proc. If someone wants to forbid open(), I want to hear _why_. I assume the memfd object has uid==uid-of-creator and mode==(777 & ~umask) (which usually results in X00, so no access for non-owners). I cannot see how /proc is a security issue here.
Thanks David
On Thu, Apr 10, 2014 at 4:16 PM, David Herrmann dh.herrmann@gmail.com wrote:
Hi
On Fri, Apr 11, 2014 at 1:05 AM, Andy Lutomirski luto@amacapital.net wrote:
/proc/pid/fd is a really weird corner case in which the mode of an inode that doesn't have a name matters. I suspect that almost no one will ever want to open one of these things out of /proc/self/fd, and those who do should be made to think about it.
I'm arguing in the context of memfd, and there's no security leak if people get access to the underlying inode (at least I'm not aware of any).
I'm not sure what you mean.
As I said, context information is attached to the inode, not file context, so I'm fine if people want to open multiple file contexts via /proc. If someone wants to forbid open(), I want to hear _why_. I assume the memfd object has uid==uid-of-creator and mode==(777 & ~umask) (which usually results in X00, so no access for non-owners). I cannot see how /proc is a security issue here.
On further reflection, my argument for 000 is crap. As far as I can see, the only time that the mode matters at all when playing with /proc/pid/fd, and they only way to get a non-O_RDWR memfd is using /proc/pid/fd, so I'll argue for 0600 instead.
Argument why 0600 is better than 0600 & ~umask: either callers don't care because the inode mode simply doesn't matter or they're using /proc/pid/fd to *reduce* permissions, in which case they'd probably like to avoid having to play with umask or call fchmod.
Argument why 0600 is better than 0777 & ~umask: People /prod/pid/fd are the only ones who care, in which case they probably prefer for the permissions not be increased by other users if they give them a reduced-permission fd.
Anyway, this is all mostly unimportant. Some text in the man page is probably sufficient, but I still think that 0600 is trivial to implement and a little bit more friendly.
--Andy
Thanks David
On Thu 2014-04-10 13:37:26, Andy Lutomirski wrote:
On Thu, Apr 10, 2014 at 1:32 PM, Theodore Ts'o tytso@mit.edu wrote:
On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote:
This is the second time in a week that someone has asked for a way to have a struct file (or struct inode or whatever) that can't be reopened through /proc/pid/fd. This should be quite easy to implement as a separate feature.
What I suggested on a different thread was to add the following new file descriptor flags, to join FD_CLOEXEC, which would be maniuplated using the F_GETFD and F_SETFD fcntl commands:
FD_NOPROCFS disallow being able to open the inode via /proc/<pid>/fd
FD_NOPASSFD disallow being able to pass the fd via a unix domain socket
FD_LOCKFLAGS if this bit is set, disallow any further changes of FD_CLOEXEC, FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags.
Regardless of what else we might need to meet the use case for the proposed File Sealing API, I think this is a useful feature that could be used in many other contexts besides just the proposed memfd_create() use case.
It occurs to me that, before going nuts with these kinds of flags, it may pay to just try to fix the /proc/self/fd issue for real -- we could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is read-only. That may be enough for the file sealing thing.
Yes please.
Current behaviour is very unexpected, and unexpected behaviour in security area is normally called "security hole".
Pavel
On 04/10/2014 10:37 PM, Andy Lutomirski wrote:
It occurs to me that, before going nuts with these kinds of flags, it may pay to just try to fix the /proc/self/fd issue for real -- we could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is read-only. That may be enough for the file sealing thing.
Increasing privilege on O_PATH descriptors via access through /proc/self/fd is part of the userspace API. The same thing might be true for O_RDONLY descriptors, but it's a bit less likely that there are any users out there. In any case, I'm not sure it makes sense to plug the O_RDONLY hole while leaving the O_PATH hole open.
On Jun 17, 2014 2:48 AM, "Florian Weimer" fweimer@redhat.com wrote:
On 04/10/2014 10:37 PM, Andy Lutomirski wrote:
It occurs to me that, before going nuts with these kinds of flags, it may pay to just try to fix the /proc/self/fd issue for real -- we could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is read-only. That may be enough for the file sealing thing.
Increasing privilege on O_PATH descriptors via access through
/proc/self/fd is part of the userspace API. The same thing might be true for O_RDONLY descriptors, but it's a bit less likely that there are any users out there. In any case, I'm not sure it makes sense to plug the O_RDONLY hole while leaving the O_PATH hole open.
Do you mean O_PATH fds for the directory or O_PATH fds for the file itself? In any event, I'm much less concerned about passing O_PATH memfds around than O_RDONLY memfds.
I have incomplete patches for this stuff. I need to fix them so they work and get past Al Viro.
--Andy
On Thu, Mar 20, 2014 at 11:32 AM, tytso@mit.edu wrote:
Looking at your patches, and what files you are modifying, you are enforcing this in the low-level file system.
I would love for this to be implemented in the filesystem level as well. Something like the ext4 immutable bit, but with the ability to still make hardlinks would be *very* useful for OSTree. And anyone else that uses hardlinks as a data source. The vserver people do something similiar: http://linux-vserver.org/util-vserver:Vhashify
At the moment I have a read-only bind mount over /usr, but what I really want is to make the individual objects in the object store in /ostree/repo/objects be immutable, so even if a user or app navigates out to /sysroot they still can't mutate them (or the link targets in the visible /usr).
On 04/10/2014 07:45 AM, Colin Walters wrote:
On Thu, Mar 20, 2014 at 11:32 AM, tytso@mit.edu wrote:
Looking at your patches, and what files you are modifying, you are enforcing this in the low-level file system.
I would love for this to be implemented in the filesystem level as well. Something like the ext4 immutable bit, but with the ability to still make hardlinks would be *very* useful for OSTree. And anyone else that uses hardlinks as a data source. The vserver people do something similiar: http://linux-vserver.org/util-vserver:Vhashify
At the moment I have a read-only bind mount over /usr, but what I really want is to make the individual objects in the object store in /ostree/repo/objects be immutable, so even if a user or app navigates out to /sysroot they still can't mutate them (or the link targets in the visible /usr).
COW links can do this already, I think. Of course, you'll have to use a filesystem that supports them.
--Andy
On Thu, Apr 10, 2014 at 3:15 PM, Andy Lutomirski luto@amacapital.net wrote:
COW links can do this already, I think. Of course, you'll have to use a filesystem that supports them.
COW is nice if the filesystem supports them, but my userspace code needs to be filesystem agnostic. Because of that, the design for userspace simply doesn't allow arbitrary writes.
Instead, I have to painfully audit every rpm %post/dpkg postinst type script to ensure they break hardlinks, and furthermore only allow executing scripts that are known to do so.
But I think even in a btrfs world it'd still be useful to mark files as content-immutable.
Colin Walters wrote:
On Thu, Apr 10, 2014 at 3:15 PM, Andy Lutomirski luto@amacapital.net wrote:
COW links can do this already, I think. Of course, you'll have to use a filesystem that supports them.
COW is nice if the filesystem supports them, but my userspace code needs to be filesystem agnostic. Because of that, the design for userspace simply doesn't allow arbitrary writes.
Instead, I have to painfully audit every rpm %post/dpkg postinst type script to ensure they break hardlinks, and furthermore only allow executing scripts that are known to do so.
But I think even in a btrfs world it'd still be useful to mark files as content-immutable.
If you create each tree as a subvolume and when it's complete put it in place with btrfs subvolume snapshot -r FOO_inprogress /ostree/repo/FOO, you get exactly that.
You can even use the new(ish) btrfs out-of-band dedup functionality to deduplicate read-only snapshots safely.
On 03/19/2014 08:06 PM, David Herrmann wrote:
Unlike existing techniques that provide similar protection, sealing allows file-sharing without any trust-relationship. This is enforced by rejecting seal modifications if you don't own an exclusive reference to the given file. So if you own a file-descriptor, you can be sure that no-one besides you can modify the seals on the given file. This allows mapping shared files from untrusted parties without the fear of the file getting truncated or modified by an attacker.
How do you keep these promises on network and FUSE file systems? Surely there is still some trust involved for such descriptors?
What happens if you create a loop device on a sealed descriptor?
Why does memfd_create not create a file backed by a memory region in the current process? Wouldn't this be a far more generic primitive? Creating aliases of memory regions would be interesting for many things (not just libffi bypassing SELinux-enforced NX restrictions :-).
Hi
On Tue, Apr 8, 2014 at 3:00 PM, Florian Weimer fweimer@redhat.com wrote:
How do you keep these promises on network and FUSE file systems?
I don't. This is shmem only.
Thanks David
On 04/09/2014 11:31 PM, David Herrmann wrote:
On Tue, Apr 8, 2014 at 3:00 PM, Florian Weimer fweimer@redhat.com wrote:
How do you keep these promises on network and FUSE file systems?
I don't. This is shmem only.
Ah. What do you recommend for recipient to recognize such descriptors? Would they just try to seal them and reject them if this fails?
Hi
On Tue, Apr 22, 2014 at 11:10 AM, Florian Weimer fweimer@redhat.com wrote:
Ah. What do you recommend for recipient to recognize such descriptors? Would they just try to seal them and reject them if this fails?
This highly depends on your use-case. Please see the initial email in this thread. It describes 2 example use-cases. In both cases, the recipients read the current set of seals and verify that a given set of seals is set.
Thanks David
On 04/22/2014 01:55 PM, David Herrmann wrote:
Hi
On Tue, Apr 22, 2014 at 11:10 AM, Florian Weimer fweimer@redhat.com wrote:
Ah. What do you recommend for recipient to recognize such descriptors? Would they just try to seal them and reject them if this fails?
This highly depends on your use-case. Please see the initial email in this thread. It describes 2 example use-cases. In both cases, the recipients read the current set of seals and verify that a given set of seals is set.
I didn't find that very convincing. But in v2, seals are monotonic, so checking them should be reliable enough.
What happens when you create a loop device on a write-sealed descriptor?
Hi
On Tue, Apr 22, 2014 at 2:44 PM, Florian Weimer fweimer@redhat.com wrote:
I didn't find that very convincing. But in v2, seals are monotonic, so checking them should be reliable enough.
Ok.
What happens when you create a loop device on a write-sealed descriptor?
Any write-back to the loop-device will fail with EPERM as soon as the fd gets write-sealed. See __do_lo_send_write() in drivers/block/loop.c. It's up to the loop-device to forward the error via bio_endio() to the caller for proper error-handling.
Thanks David
On 04/08/2014 06:00 AM, Florian Weimer wrote:
On 03/19/2014 08:06 PM, David Herrmann wrote:
Unlike existing techniques that provide similar protection, sealing allows file-sharing without any trust-relationship. This is enforced by rejecting seal modifications if you don't own an exclusive reference to the given file. So if you own a file-descriptor, you can be sure that no-one besides you can modify the seals on the given file. This allows mapping shared files from untrusted parties without the fear of the file getting truncated or modified by an attacker.
How do you keep these promises on network and FUSE file systems? Surely there is still some trust involved for such descriptors?
What happens if you create a loop device on a sealed descriptor?
Why does memfd_create not create a file backed by a memory region in the current process? Wouldn't this be a far more generic primitive? Creating aliases of memory regions would be interesting for many things (not just libffi bypassing SELinux-enforced NX restrictions :-).
If you write a patch to prevent selinux from enforcing NX, I will ack that patch with all my might. I don't know how far it would get me, but I think that selinux has no business going anywhere near execmem.
Adding a clone mode to mremap might be a better bet. But memfd solves that problem, too, albeit messily.
--Andy
dri-devel@lists.freedesktop.org