(reposted from my comments at http://lwn.net/Articles/593918/)
I may have thought of a way to subvert SEAL_WRITE using O_DIRECT and asynchronous I/O. I am not sure if this is a real problem or not, but better to ask, right?
The exploit would go like this:
1) mmap() the shared memory 2) open some *other* file with O_DIRECT 3) prepare a read-type iocb for the *other* file pointing to the mmap()ed memory 4) io_submit(), but don't wait for it to complete 5) munmap() the shared memory 6) SEAL_WRITE the shared memory 7) the "sealed" memory is overwritten by DMA from the disk drive at some point in the future when the I/O completes
So this exploit effectively changes the contents of the supposedly write-protected memory after SEAL_WRITE is granted.
For O_DIRECT the kernel pins the submitted pages in memory for DMA by incrementing the page reference counts when the I/O is submitted, allowing the pages to be modified by DMA even if they are no longer mapped in the address space of the process. This is different from a regular read(), which uses the CPU to copy the data and will fail if the pages are not mapped.
I am sure there are also other direct I/O mechanisms in the kernel that can be used to setup a DMA transfer to change the contents of unmapped memory; the SCSI generic driver comes to mind.
I suppose SEAL_WRITE could iterate over all the pages in the file and check to make sure no page refcount is greater than the "expected" value, and return an error instead of granting the seal if a page is found with an unexpected extra reference that might have been added by e.g. get_user_pages() for direct I/O. But looking over shmem_set_seals() in patch 2/6, it doesn't seem to do that.
Tony Battersby Cybernetics
Hi
On Fri, Apr 11, 2014 at 11:36 PM, Andy Lutomirski luto@amacapital.net wrote:
A quick grep of the kernel tree finds exactly zero code paths incrementing i_mmap_writable outside of mmap and fork.
Or do you mean a different kind of write ref? What am I missing here?
Sorry, I meant i_writecount.
Thanks David
Hi
On Thu, Apr 10, 2014 at 11:33 PM, Tony Battersby tonyb@cybernetics.com wrote:
For O_DIRECT the kernel pins the submitted pages in memory for DMA by incrementing the page reference counts when the I/O is submitted, allowing the pages to be modified by DMA even if they are no longer mapped in the address space of the process. This is different from a regular read(), which uses the CPU to copy the data and will fail if the pages are not mapped.
Can you please provide an example code-path? For instance, file_read_actor() does not pin any pages but only keeps the user-space address and resolves it once it has data to write.
Thanks David
On Fri, Apr 11, 2014 at 2:42 PM, David Herrmann dh.herrmann@gmail.com wrote:
Hi
On Fri, Apr 11, 2014 at 11:36 PM, Andy Lutomirski luto@amacapital.net wrote:
A quick grep of the kernel tree finds exactly zero code paths incrementing i_mmap_writable outside of mmap and fork.
Or do you mean a different kind of write ref? What am I missing here?
Sorry, I meant i_writecount.
I bet this is missing from lots of places. For example, I can't find any write_access stuff in the rdma code.
I suspect that the VM_DENYWRITE code is just generally racy.
--Andy
On 04/10/2014 05:22 PM, David Herrmann wrote:
Hi
On Thu, Apr 10, 2014 at 11:33 PM, Tony Battersby tonyb@cybernetics.com wrote:
For O_DIRECT the kernel pins the submitted pages in memory for DMA by incrementing the page reference counts when the I/O is submitted, allowing the pages to be modified by DMA even if they are no longer mapped in the address space of the process. This is different from a regular read(), which uses the CPU to copy the data and will fail if the pages are not mapped.
Can you please provide an example code-path? For instance, file_read_actor() does not pin any pages but only keeps the user-space address and resolves it once it has data to write.
This may be an issue for anything in the kernel that calls get_user_pages and holds onto the result at any time that mmap_sem isn't held.
I don't know exactly what does that, but RDMA comes to mind. So does (ugh!) vmsplice, although I suspect that vmsplice doesn't write.
--Andy
Hi
On Sat, Apr 12, 2014 at 12:07 AM, Andy Lutomirski luto@amacapital.net wrote:
I bet this is missing from lots of places. For example, I can't find any write_access stuff in the rdma code.
I suspect that the VM_DENYWRITE code is just generally racy.
So what does S_IMMUTABLE do to prevent such races? I somehow suspect it's broken in that regard, too.
I really dislike pinning pages like this, but if people want to keep it I guess I have to scan all shmem-inode pages before changing seals.
Thanks David
Andy Lutomirski wrote:
On 04/10/2014 05:22 PM, David Herrmann wrote:
Hi
On Thu, Apr 10, 2014 at 11:33 PM, Tony Battersby tonyb@cybernetics.com wrote:
For O_DIRECT the kernel pins the submitted pages in memory for DMA by incrementing the page reference counts when the I/O is submitted, allowing the pages to be modified by DMA even if they are no longer mapped in the address space of the process. This is different from a regular read(), which uses the CPU to copy the data and will fail if the pages are not mapped.
Can you please provide an example code-path? For instance, file_read_actor() does not pin any pages but only keeps the user-space address and resolves it once it has data to write.
This may be an issue for anything in the kernel that calls get_user_pages and holds onto the result at any time that mmap_sem isn't held.
Exactly. For O_DIRECT, that would be the call to get_user_pages_fast() from dio_refill_pages() in fs/direct-io.c, which is ultimately called from blkdev_direct_IO().
From the comment for get_user_pages_fast(): "Attempt to pin user pages
in memory..."
Tony
Hi
On Fri, Apr 11, 2014 at 3:43 PM, Tony Battersby tonyb@cybernetics.com wrote:
Exactly. For O_DIRECT, that would be the call to get_user_pages_fast() from dio_refill_pages() in fs/direct-io.c, which is ultimately called from blkdev_direct_IO().
If you drop mmap_sem after pinning a page without taking a write-ref, you break i_mmap_writable / VM_DENYWRITE. In memfd I rely on i_mmap_writable to work, same thing is done by exec() (and the old, now disabled, MAP_DENYWRITE).
I don't know whether I should care. I mean, everyone pinning pages and writing to it without holding the mmap_sem has to take a write-ref for each page or it breaks i_mmap_writable. So this seems to be a bug in direct-IO, not in anyone relying on it, right?
Thanks David
On 04/11/2014 02:31 PM, David Herrmann wrote:
Hi
On Fri, Apr 11, 2014 at 3:43 PM, Tony Battersby tonyb@cybernetics.com wrote:
Exactly. For O_DIRECT, that would be the call to get_user_pages_fast() from dio_refill_pages() in fs/direct-io.c, which is ultimately called from blkdev_direct_IO().
If you drop mmap_sem after pinning a page without taking a write-ref, you break i_mmap_writable / VM_DENYWRITE. In memfd I rely on i_mmap_writable to work, same thing is done by exec() (and the old, now disabled, MAP_DENYWRITE).
I don't know whether I should care. I mean, everyone pinning pages and writing to it without holding the mmap_sem has to take a write-ref for each page or it breaks i_mmap_writable. So this seems to be a bug in direct-IO, not in anyone relying on it, right?
A quick grep of the kernel tree finds exactly zero code paths incrementing i_mmap_writable outside of mmap and fork.
Or do you mean a different kind of write ref? What am I missing here?
--Andy
dri-devel@lists.freedesktop.org