Hi all,
This is v2 of the fdinfo patches. The main update is adding path field only for files with anon inodes. Rebased on 5.19-rc3.
The previous cover letter is copied below for convenience.
Thanks, Kalesh
-----------
Processes can pin shared memory by keeping a handle to it through a file descriptor; for instance dmabufs, memfd, and ashmem (in Android).
In the case of a memory leak, to identify the process pinning the memory, userspace needs to: - Iterate the /proc/<pid>/fd/* for each process - Do a readlink on each entry to identify the type of memory from the file path. - stat() each entry to get the size of the memory.
The file permissions on /proc/<pid>/fd/* only allows for the owner or root to perform the operations above; and so is not suitable for capturing the system-wide state in a production environment.
This issue was addressed for dmabufs by making /proc/*/fdinfo/* accessible to a process with PTRACE_MODE_READ_FSCREDS credentials[1] To allow the same kind of tracking for other types of shared memory, add the following fields to /proc/<pid>/fdinfo/<fd>:
path - This allows identifying the type of memory based on common prefixes: e.g. "/memfd...", "/dmabuf...", "/dev/ashmem..."
This was not an issued when dmabuf tracking was introduced because the exp_name field of dmabuf fdinfo could be used to distinguish dmabuf fds from other types.
size - To track the amount of memory that is being pinned.
dmabufs expose size as an additional field in fdinfo. Remove this and make it a common field for all fds.
Access to /proc/<pid>/fdinfo is governed by PTRACE_MODE_READ_FSCREDS -- the same as for /proc/<pid>/maps which also exposes the path and size for mapped memory regions.
This allows for a system process with PTRACE_MODE_READ_FSCREDS to account the pinned per-process memory via fdinfo.
Kalesh Singh (2): procfs: Add 'size' to /proc/<pid>/fdinfo/ procfs: Add 'path' to /proc/<pid>/fdinfo/
Documentation/filesystems/proc.rst | 22 ++++++++++++++++++++-- drivers/dma-buf/dma-buf.c | 1 - fs/libfs.c | 9 +++++++++ fs/proc/fd.c | 18 ++++++++++++++---- include/linux/fs.h | 1 + 5 files changed, 44 insertions(+), 7 deletions(-)
base-commit: a111daf0c53ae91e71fd2bfe7497862d14132e3e
To be able to account the amount of memory a process is keeping pinned by open file descriptors add a 'size' field to fdinfo output.
dmabufs fds already expose a 'size' field for this reason, remove this and make it a common field for all fds. This allows tracking of other types of memory (e.g. memfd and ashmem in Android).
Signed-off-by: Kalesh Singh kaleshsingh@google.com Reviewed-by: Christian König christian.koenig@amd.com ---
Changes in v2: - Add Christian's Reviewed-by
Changes from rfc: - Split adding 'size' and 'path' into a separate patches, per Christian - Split fdinfo seq_printf into separate lines, per Christian - Fix indentation (use tabs) in documentaion, per Randy
Documentation/filesystems/proc.rst | 12 ++++++++++-- drivers/dma-buf/dma-buf.c | 1 - fs/proc/fd.c | 9 +++++---- 3 files changed, 15 insertions(+), 7 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 1bc91fb8c321..779c05528e87 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1886,13 +1886,14 @@ if precise results are needed. 3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file --------------------------------------------------------------- This file provides information associated with an opened file. The regular -files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'. +files have at least five fields -- 'pos', 'flags', 'mnt_id', 'ino', and 'size'. + The 'pos' represents the current offset of the opened file in decimal form [see lseek(2) for details], 'flags' denotes the octal O_xxx mask the file has been created with [see open(2) for details] and 'mnt_id' represents mount ID of the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo for details]. 'ino' represents the inode number of -the file. +the file, and 'size' represents the size of the file in bytes.
A typical output is::
@@ -1900,6 +1901,7 @@ A typical output is:: flags: 0100002 mnt_id: 19 ino: 63107 + size: 0
All locks associated with a file descriptor are shown in its fdinfo too::
@@ -1917,6 +1919,7 @@ Eventfd files flags: 04002 mnt_id: 9 ino: 63107 + size: 0 eventfd-count: 5a
where 'eventfd-count' is hex value of a counter. @@ -1930,6 +1933,7 @@ Signalfd files flags: 04002 mnt_id: 9 ino: 63107 + size: 0 sigmask: 0000000000000200
where 'sigmask' is hex value of the signal mask associated @@ -1944,6 +1948,7 @@ Epoll files flags: 02 mnt_id: 9 ino: 63107 + size: 0 tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7
where 'tfd' is a target file descriptor number in decimal form, @@ -1962,6 +1967,7 @@ For inotify files the format is the following:: flags: 02000000 mnt_id: 9 ino: 63107 + size: 0 inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
where 'wd' is a watch descriptor in decimal form, i.e. a target file @@ -1985,6 +1991,7 @@ For fanotify files the format is:: flags: 02 mnt_id: 9 ino: 63107 + size: 0 fanotify flags:10 event-flags:0 fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 @@ -2010,6 +2017,7 @@ Timerfd files flags: 02 mnt_id: 9 ino: 63107 + size: 0 clockid: 0 ticks: 0 settime flags: 01 diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 32f55640890c..5f2ae38c960f 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -378,7 +378,6 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file) { struct dma_buf *dmabuf = file->private_data;
- seq_printf(m, "size:\t%zu\n", dmabuf->size); /* Don't count the temporary reference taken inside procfs seq_show */ seq_printf(m, "count:\t%ld\n", file_count(dmabuf->file) - 1); seq_printf(m, "exp_name:\t%s\n", dmabuf->exp_name); diff --git a/fs/proc/fd.c b/fs/proc/fd.c index 913bef0d2a36..464bc3f55759 100644 --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -54,10 +54,11 @@ static int seq_show(struct seq_file *m, void *v) if (ret) return ret;
- seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\nino:\t%lu\n", - (long long)file->f_pos, f_flags, - real_mount(file->f_path.mnt)->mnt_id, - file_inode(file)->i_ino); + seq_printf(m, "pos:\t%lli\n", (long long)file->f_pos); + seq_printf(m, "flags:\t0%o\n", f_flags); + seq_printf(m, "mnt_id:\t%i\n", real_mount(file->f_path.mnt)->mnt_id); + seq_printf(m, "ino:\t%lu\n", file_inode(file)->i_ino); + seq_printf(m, "size:\t%lli\n", (long long)file_inode(file)->i_size);
/* show_fd_locks() never deferences files so a stale value is safe */ show_fd_locks(m, file, files);
In order to identify the type of memory a process has pinned through its open fds, add the file path to fdinfo output. This allows identifying memory types based on common prefixes: e.g. "/memfd...", "/dmabuf...", "/dev/ashmem...".
To be cautious, only expose the paths for anonymous inodes, and this also avoids printing path names with strange characters.
Access to /proc/<pid>/fdinfo is governed by PTRACE_MODE_READ_FSCREDS the same as /proc/<pid>/maps which also exposes the file path of mappings; so the security permissions for accessing path is consistent with that of /proc/<pid>/maps.
Signed-off-by: Kalesh Singh kaleshsingh@google.com ---
Changes in v2: - Only add path field for files with anon inodes.
Changes from rfc: - Split adding 'size' and 'path' into a separate patches, per Christian - Fix indentation (use tabs) in documentaion, per Randy
Documentation/filesystems/proc.rst | 10 ++++++++++ fs/libfs.c | 9 +++++++++ fs/proc/fd.c | 13 +++++++++++-- include/linux/fs.h | 1 + 4 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 779c05528e87..ca23a9af4845 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1907,6 +1907,9 @@ All locks associated with a file descriptor are shown in its fdinfo too::
lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF
+Files with anonymous inodes have an additional 'path' field which represents +the anonymous file path. + The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags pair provide additional information particular to the objects they represent.
@@ -1920,6 +1923,7 @@ Eventfd files mnt_id: 9 ino: 63107 size: 0 + path: anon_inode:[eventfd] eventfd-count: 5a
where 'eventfd-count' is hex value of a counter. @@ -1934,6 +1938,7 @@ Signalfd files mnt_id: 9 ino: 63107 size: 0 + path: anon_inode:[signalfd] sigmask: 0000000000000200
where 'sigmask' is hex value of the signal mask associated @@ -1949,6 +1954,7 @@ Epoll files mnt_id: 9 ino: 63107 size: 0 + path: anon_inode:[eventpoll] tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7
where 'tfd' is a target file descriptor number in decimal form, @@ -1968,6 +1974,7 @@ For inotify files the format is the following:: mnt_id: 9 ino: 63107 size: 0 + path: anon_inode:inotify inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
where 'wd' is a watch descriptor in decimal form, i.e. a target file @@ -1992,6 +1999,7 @@ For fanotify files the format is:: mnt_id: 9 ino: 63107 size: 0 + path: anon_inode:[fanotify] fanotify flags:10 event-flags:0 fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 @@ -2018,6 +2026,7 @@ Timerfd files mnt_id: 9 ino: 63107 size: 0 + path: anon_inode:[timerfd] clockid: 0 ticks: 0 settime flags: 01 @@ -2042,6 +2051,7 @@ DMA Buffer files mnt_id: 9 ino: 63107 size: 32768 + path: /dmabuf: count: 2 exp_name: system-heap
diff --git a/fs/libfs.c b/fs/libfs.c index 31b0ddf01c31..6911749b4da7 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -1217,6 +1217,15 @@ void kfree_link(void *p) } EXPORT_SYMBOL(kfree_link);
+static const struct address_space_operations anon_aops = { + .dirty_folio = noop_dirty_folio, +}; + +bool is_anon_inode(struct inode *inode) +{ + return inode->i_mapping->a_ops == &anon_aops; +} + struct inode *alloc_anon_inode(struct super_block *s) { static const struct address_space_operations anon_aops = { diff --git a/fs/proc/fd.c b/fs/proc/fd.c index 464bc3f55759..5bac79a2fa51 100644 --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -23,6 +23,7 @@ static int seq_show(struct seq_file *m, void *v) struct files_struct *files = NULL; int f_flags = 0, ret = -ENOENT; struct file *file = NULL; + struct inode *inode = NULL; struct task_struct *task;
task = get_proc_task(m->private); @@ -54,11 +55,19 @@ static int seq_show(struct seq_file *m, void *v) if (ret) return ret;
+ inode = file_inode(file); + seq_printf(m, "pos:\t%lli\n", (long long)file->f_pos); seq_printf(m, "flags:\t0%o\n", f_flags); seq_printf(m, "mnt_id:\t%i\n", real_mount(file->f_path.mnt)->mnt_id); - seq_printf(m, "ino:\t%lu\n", file_inode(file)->i_ino); - seq_printf(m, "size:\t%lli\n", (long long)file_inode(file)->i_size); + seq_printf(m, "ino:\t%lu\n", inode->i_ino); + seq_printf(m, "size:\t%lli\n", (long long)inode->i_size); + + if (is_anon_inode(inode)) { + seq_puts(m, "path:\t"); + seq_file_path(m, file, "\n"); + seq_putc(m, '\n'); + }
/* show_fd_locks() never deferences files so a stale value is safe */ show_fd_locks(m, file, files); diff --git a/include/linux/fs.h b/include/linux/fs.h index 9ad5e3520fae..73449e620b66 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3111,6 +3111,7 @@ extern void page_put_link(void *); extern int page_symlink(struct inode *inode, const char *symname, int len); extern const struct inode_operations page_symlink_inode_operations; extern void kfree_link(void *); +extern bool is_anon_inode(struct inode *inode); void generic_fillattr(struct user_namespace *, struct inode *, struct kstat *); void generic_fill_statx_attr(struct inode *inode, struct kstat *stat); extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
dri-devel@lists.freedesktop.org