Hi all,
This developed from a discussion with Jason, starting with some patches touching get_vaddr_frame that I typed up.
The problem is that way back VM_IO | VM_PFNMAP mappings were pretty static, and so just following the ptes to derive a pfn and then use that somewhere else was ok.
But we're no longer in such a world, there's tons of little races and some fundamental problems.
This series here is an attempt to at least scope the problem, it's all the issues I've found with quite some code reading all over the tree: - first part tries to move mm/frame-vector.c away, it's fundamentally an unsafe thing - two patches to close follow_pfn races by holding pt locks - two pci patches where I spotted inconsinstencies between the 3 different ways userspace can map pci bars - and finally some patches to mark up the remaining issue
No testing beyond "it compiles", this is very much an rfc to figure out whether this makes sense, whether it's a real thing, and how to fix this up properly.
Cheers, Daniel
Daniel Vetter (13): drm/exynos: Stop using frame_vector helpers drm/exynos: Use FOLL_LONGTERM for g2d cmdlists misc/habana: Stop using frame_vector helpers misc/habana: Use FOLL_LONGTERM for userptr mm/frame-vector: Use FOLL_LONGTERM media: videobuf2: Move frame_vector into media subsystem mm: close race in generic_access_phys s390/pci: Remove races against pte updates PCI: obey iomem restrictions for procfs mmap PCI: revoke mappings like devmem mm: add unsafe_follow_pfn media/videbuf1|2: Mark follow_pfn usage as unsafe vfio/type1: Mark follow_pfn as unsafe
arch/s390/pci/pci_mmio.c | 98 +++++++++++-------- drivers/char/mem.c | 16 ++- drivers/gpu/drm/exynos/Kconfig | 1 - drivers/gpu/drm/exynos/exynos_drm_g2d.c | 49 +++++----- drivers/media/common/videobuf2/Kconfig | 1 - drivers/media/common/videobuf2/Makefile | 1 + .../media/common/videobuf2}/frame_vector.c | 40 +++----- drivers/media/platform/omap/Kconfig | 1 - drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +- drivers/misc/habanalabs/Kconfig | 1 - drivers/misc/habanalabs/common/habanalabs.h | 3 +- drivers/misc/habanalabs/common/memory.c | 52 +++++----- drivers/pci/mmap.c | 3 + drivers/pci/proc.c | 5 + drivers/vfio/vfio_iommu_type1.c | 4 +- include/linux/ioport.h | 2 + include/linux/mm.h | 47 +-------- include/media/videobuf2-core.h | 42 ++++++++ mm/Kconfig | 3 - mm/Makefile | 1 - mm/memory.c | 76 +++++++++++++- mm/nommu.c | 17 ++++ security/Kconfig | 13 +++ 23 files changed, 296 insertions(+), 182 deletions(-) rename {mm => drivers/media/common/videobuf2}/frame_vector.c (90%)
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Inki Dae inki.dae@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Kukjin Kim kgene@kernel.org Cc: Krzysztof Kozlowski krzk@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org --- drivers/gpu/drm/exynos/Kconfig | 1 - drivers/gpu/drm/exynos/exynos_drm_g2d.c | 48 ++++++++++++------------- 2 files changed, 22 insertions(+), 27 deletions(-)
diff --git a/drivers/gpu/drm/exynos/Kconfig b/drivers/gpu/drm/exynos/Kconfig index 6417f374b923..43257ef3c09d 100644 --- a/drivers/gpu/drm/exynos/Kconfig +++ b/drivers/gpu/drm/exynos/Kconfig @@ -88,7 +88,6 @@ comment "Sub-drivers" config DRM_EXYNOS_G2D bool "G2D" depends on VIDEO_SAMSUNG_S5P_G2D=n || COMPILE_TEST - select FRAME_VECTOR help Choose this option if you want to use Exynos G2D for DRM.
diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c index 967a5cdc120e..c83f6faac9de 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c @@ -205,7 +205,8 @@ struct g2d_cmdlist_userptr { dma_addr_t dma_addr; unsigned long userptr; unsigned long size; - struct frame_vector *vec; + struct page **pages; + unsigned int npages; struct sg_table *sgt; atomic_t refcount; bool in_pool; @@ -378,7 +379,7 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d, bool force) { struct g2d_cmdlist_userptr *g2d_userptr = obj; - struct page **pages; + int i;
if (!obj) return; @@ -398,15 +399,11 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d, dma_unmap_sgtable(to_dma_dev(g2d->drm_dev), g2d_userptr->sgt, DMA_BIDIRECTIONAL, 0);
- pages = frame_vector_pages(g2d_userptr->vec); - if (!IS_ERR(pages)) { - int i; + for (i = 0; i < g2d_userptr->npages; i++) + set_page_dirty_lock(g2d_userptr->pages[i]);
- for (i = 0; i < frame_vector_count(g2d_userptr->vec); i++) - set_page_dirty_lock(pages[i]); - } - put_vaddr_frames(g2d_userptr->vec); - frame_vector_destroy(g2d_userptr->vec); + unpin_user_pages(g2d_userptr->pages, g2d_userptr->npages); + kvfree(g2d_userptr->pages);
if (!g2d_userptr->out_of_list) list_del_init(&g2d_userptr->list); @@ -474,35 +471,34 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, offset = userptr & ~PAGE_MASK; end = PAGE_ALIGN(userptr + size); npages = (end - start) >> PAGE_SHIFT; - g2d_userptr->vec = frame_vector_create(npages); - if (!g2d_userptr->vec) { + g2d_userptr->pages = kvmalloc_array(npages, sizeof(*g2d_userptr->pages), + GFP_KERNEL); + if (!g2d_userptr->pages) { ret = -ENOMEM; goto err_free; }
- ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE, - g2d_userptr->vec); + ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE, + g2d_userptr->pages); if (ret != npages) { DRM_DEV_ERROR(g2d->dev, "failed to get user pages from userptr.\n"); if (ret < 0) - goto err_destroy_framevec; - ret = -EFAULT; - goto err_put_framevec; - } - if (frame_vector_to_pages(g2d_userptr->vec) < 0) { + goto err_destroy_pages; + npages = ret; ret = -EFAULT; - goto err_put_framevec; + goto err_unpin_pages; } + g2d_userptr->npages = npages;
sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); if (!sgt) { ret = -ENOMEM; - goto err_put_framevec; + goto err_unpin_pages; }
ret = sg_alloc_table_from_pages(sgt, - frame_vector_pages(g2d_userptr->vec), + g2d_userptr->pages, npages, offset, size, GFP_KERNEL); if (ret < 0) { DRM_DEV_ERROR(g2d->dev, "failed to get sgt from pages.\n"); @@ -538,11 +534,11 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, err_free_sgt: kfree(sgt);
-err_put_framevec: - put_vaddr_frames(g2d_userptr->vec); +err_unpin_pages: + unpin_user_pages(g2d_userptr->pages, npages);
-err_destroy_framevec: - frame_vector_destroy(g2d_userptr->vec); +err_destroy_pages: + kvfree(g2d_userptr->pages);
err_free: kfree(g2d_userptr);
On 10/7/20 9:44 AM, Daniel Vetter wrote:
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Inki Dae inki.dae@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Kukjin Kim kgene@kernel.org Cc: Krzysztof Kozlowski krzk@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org
drivers/gpu/drm/exynos/Kconfig | 1 - drivers/gpu/drm/exynos/exynos_drm_g2d.c | 48 ++++++++++++------------- 2 files changed, 22 insertions(+), 27 deletions(-)
diff --git a/drivers/gpu/drm/exynos/Kconfig b/drivers/gpu/drm/exynos/Kconfig index 6417f374b923..43257ef3c09d 100644 --- a/drivers/gpu/drm/exynos/Kconfig +++ b/drivers/gpu/drm/exynos/Kconfig @@ -88,7 +88,6 @@ comment "Sub-drivers" config DRM_EXYNOS_G2D bool "G2D" depends on VIDEO_SAMSUNG_S5P_G2D=n || COMPILE_TEST
- select FRAME_VECTOR help Choose this option if you want to use Exynos G2D for DRM.
diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c index 967a5cdc120e..c83f6faac9de 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c @@ -205,7 +205,8 @@ struct g2d_cmdlist_userptr { dma_addr_t dma_addr; unsigned long userptr; unsigned long size;
- struct frame_vector *vec;
- struct page **pages;
- unsigned int npages; struct sg_table *sgt; atomic_t refcount; bool in_pool;
@@ -378,7 +379,7 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d, bool force) { struct g2d_cmdlist_userptr *g2d_userptr = obj;
- struct page **pages;
- int i;
The above line can also be deleted, see below.
if (!obj) return; @@ -398,15 +399,11 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d, dma_unmap_sgtable(to_dma_dev(g2d->drm_dev), g2d_userptr->sgt, DMA_BIDIRECTIONAL, 0);
- pages = frame_vector_pages(g2d_userptr->vec);
- if (!IS_ERR(pages)) {
int i;
- for (i = 0; i < g2d_userptr->npages; i++)
set_page_dirty_lock(g2d_userptr->pages[i]);
for (i = 0; i < frame_vector_count(g2d_userptr->vec); i++)
set_page_dirty_lock(pages[i]);
- }
- put_vaddr_frames(g2d_userptr->vec);
- frame_vector_destroy(g2d_userptr->vec);
- unpin_user_pages(g2d_userptr->pages, g2d_userptr->npages);
- kvfree(g2d_userptr->pages);
You can avoid writing your own loop, and just simplify the whole thing down to two lines:
unpin_user_pages_dirty_lock(g2d_userptr->pages, g2d_userptr->npages, true); kvfree(g2d_userptr->pages);
if (!g2d_userptr->out_of_list) list_del_init(&g2d_userptr->list); @@ -474,35 +471,34 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, offset = userptr & ~PAGE_MASK; end = PAGE_ALIGN(userptr + size); npages = (end - start) >> PAGE_SHIFT;
- g2d_userptr->vec = frame_vector_create(npages);
- if (!g2d_userptr->vec) {
- g2d_userptr->pages = kvmalloc_array(npages, sizeof(*g2d_userptr->pages),
GFP_KERNEL);
- if (!g2d_userptr->pages) { ret = -ENOMEM; goto err_free; }
- ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
g2d_userptr->vec);
- ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
if (ret != npages) { DRM_DEV_ERROR(g2d->dev, "failed to get user pages from userptr.\n"); if (ret < 0)g2d_userptr->pages);
goto err_destroy_framevec;
ret = -EFAULT;
goto err_put_framevec;
- }
- if (frame_vector_to_pages(g2d_userptr->vec) < 0) {
goto err_destroy_pages;
ret = -EFAULT;npages = ret;
goto err_put_framevec;
goto err_unpin_pages;
}
g2d_userptr->npages = npages;
sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); if (!sgt) { ret = -ENOMEM;
goto err_put_framevec;
goto err_unpin_pages;
}
ret = sg_alloc_table_from_pages(sgt,
frame_vector_pages(g2d_userptr->vec),
if (ret < 0) { DRM_DEV_ERROR(g2d->dev, "failed to get sgt from pages.\n");g2d_userptr->pages, npages, offset, size, GFP_KERNEL);
@@ -538,11 +534,11 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, err_free_sgt: kfree(sgt);
-err_put_framevec:
- put_vaddr_frames(g2d_userptr->vec);
+err_unpin_pages:
- unpin_user_pages(g2d_userptr->pages, npages);
-err_destroy_framevec:
- frame_vector_destroy(g2d_userptr->vec);
+err_destroy_pages:
kvfree(g2d_userptr->pages);
err_free: kfree(g2d_userptr);
The rest all looks good, you've avoided the usual API pitfalls. :)
thanks,
On Wed, Oct 7, 2020 at 10:33 PM John Hubbard jhubbard@nvidia.com wrote:
On 10/7/20 9:44 AM, Daniel Vetter wrote:
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Inki Dae inki.dae@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Kukjin Kim kgene@kernel.org Cc: Krzysztof Kozlowski krzk@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org
drivers/gpu/drm/exynos/Kconfig | 1 - drivers/gpu/drm/exynos/exynos_drm_g2d.c | 48 ++++++++++++------------- 2 files changed, 22 insertions(+), 27 deletions(-)
diff --git a/drivers/gpu/drm/exynos/Kconfig b/drivers/gpu/drm/exynos/Kconfig index 6417f374b923..43257ef3c09d 100644 --- a/drivers/gpu/drm/exynos/Kconfig +++ b/drivers/gpu/drm/exynos/Kconfig @@ -88,7 +88,6 @@ comment "Sub-drivers" config DRM_EXYNOS_G2D bool "G2D" depends on VIDEO_SAMSUNG_S5P_G2D=n || COMPILE_TEST
select FRAME_VECTOR help Choose this option if you want to use Exynos G2D for DRM.
diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c index 967a5cdc120e..c83f6faac9de 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c @@ -205,7 +205,8 @@ struct g2d_cmdlist_userptr { dma_addr_t dma_addr; unsigned long userptr; unsigned long size;
struct frame_vector *vec;
struct page **pages;
unsigned int npages; struct sg_table *sgt; atomic_t refcount; bool in_pool;
@@ -378,7 +379,7 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d, bool force) { struct g2d_cmdlist_userptr *g2d_userptr = obj;
struct page **pages;
int i;
The above line can also be deleted, see below.
if (!obj) return;
@@ -398,15 +399,11 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d, dma_unmap_sgtable(to_dma_dev(g2d->drm_dev), g2d_userptr->sgt, DMA_BIDIRECTIONAL, 0);
pages = frame_vector_pages(g2d_userptr->vec);
if (!IS_ERR(pages)) {
int i;
for (i = 0; i < g2d_userptr->npages; i++)
set_page_dirty_lock(g2d_userptr->pages[i]);
for (i = 0; i < frame_vector_count(g2d_userptr->vec); i++)
set_page_dirty_lock(pages[i]);
}
put_vaddr_frames(g2d_userptr->vec);
frame_vector_destroy(g2d_userptr->vec);
unpin_user_pages(g2d_userptr->pages, g2d_userptr->npages);
kvfree(g2d_userptr->pages);
You can avoid writing your own loop, and just simplify the whole thing down to two lines:
unpin_user_pages_dirty_lock(g2d_userptr->pages, g2d_userptr->npages, true); kvfree(g2d_userptr->pages);
Oh nice, this is neat. I'll also roll it out in the habanalabs patch, that has the same thing. Well almost, it only uses set_page_dirty, not the _lock variant. But I have no idea whether that matters or not? -Daniel
if (!g2d_userptr->out_of_list) list_del_init(&g2d_userptr->list);
@@ -474,35 +471,34 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, offset = userptr & ~PAGE_MASK; end = PAGE_ALIGN(userptr + size); npages = (end - start) >> PAGE_SHIFT;
g2d_userptr->vec = frame_vector_create(npages);
if (!g2d_userptr->vec) {
g2d_userptr->pages = kvmalloc_array(npages, sizeof(*g2d_userptr->pages),
GFP_KERNEL);
if (!g2d_userptr->pages) { ret = -ENOMEM; goto err_free; }
ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
g2d_userptr->vec);
ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
g2d_userptr->pages); if (ret != npages) { DRM_DEV_ERROR(g2d->dev, "failed to get user pages from userptr.\n"); if (ret < 0)
goto err_destroy_framevec;
ret = -EFAULT;
goto err_put_framevec;
}
if (frame_vector_to_pages(g2d_userptr->vec) < 0) {
goto err_destroy_pages;
npages = ret; ret = -EFAULT;
goto err_put_framevec;
goto err_unpin_pages; }
g2d_userptr->npages = npages; sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); if (!sgt) { ret = -ENOMEM;
goto err_put_framevec;
goto err_unpin_pages; } ret = sg_alloc_table_from_pages(sgt,
frame_vector_pages(g2d_userptr->vec),
g2d_userptr->pages, npages, offset, size, GFP_KERNEL); if (ret < 0) { DRM_DEV_ERROR(g2d->dev, "failed to get sgt from pages.\n");
@@ -538,11 +534,11 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, err_free_sgt: kfree(sgt);
-err_put_framevec:
put_vaddr_frames(g2d_userptr->vec);
+err_unpin_pages:
unpin_user_pages(g2d_userptr->pages, npages);
-err_destroy_framevec:
frame_vector_destroy(g2d_userptr->vec);
+err_destroy_pages:
kvfree(g2d_userptr->pages);
err_free: kfree(g2d_userptr);
The rest all looks good, you've avoided the usual API pitfalls. :)
thanks,
John Hubbard NVIDIA
On 10/7/20 2:32 PM, Daniel Vetter wrote:
On Wed, Oct 7, 2020 at 10:33 PM John Hubbard jhubbard@nvidia.com wrote:
On 10/7/20 9:44 AM, Daniel Vetter wrote:
...
@@ -398,15 +399,11 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d, dma_unmap_sgtable(to_dma_dev(g2d->drm_dev), g2d_userptr->sgt, DMA_BIDIRECTIONAL, 0);
pages = frame_vector_pages(g2d_userptr->vec);
if (!IS_ERR(pages)) {
int i;
for (i = 0; i < g2d_userptr->npages; i++)
set_page_dirty_lock(g2d_userptr->pages[i]);
for (i = 0; i < frame_vector_count(g2d_userptr->vec); i++)
set_page_dirty_lock(pages[i]);
}
put_vaddr_frames(g2d_userptr->vec);
frame_vector_destroy(g2d_userptr->vec);
unpin_user_pages(g2d_userptr->pages, g2d_userptr->npages);
kvfree(g2d_userptr->pages);
You can avoid writing your own loop, and just simplify the whole thing down to two lines:
unpin_user_pages_dirty_lock(g2d_userptr->pages, g2d_userptr->npages, true); kvfree(g2d_userptr->pages);
Oh nice, this is neat. I'll also roll it out in the habanalabs patch, that has the same thing. Well almost, it only uses set_page_dirty, not the _lock variant. But I have no idea whether that matters or not?
It matters. And invariably, call sites that use set_page_dirty() instead of set_page_dirty_lock() were already wrong. Which is why I never had to provide anything like "unpin_user_pages_dirty (not locked)".
Although in habanalabs case, I just reviewed patch 3 and I think they *were* correctly using set_page_dirty_lock()...
thanks,
On Wed, Oct 7, 2020 at 11:37 PM John Hubbard jhubbard@nvidia.com wrote:
On 10/7/20 2:32 PM, Daniel Vetter wrote:
On Wed, Oct 7, 2020 at 10:33 PM John Hubbard jhubbard@nvidia.com wrote:
On 10/7/20 9:44 AM, Daniel Vetter wrote:
...
@@ -398,15 +399,11 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d, dma_unmap_sgtable(to_dma_dev(g2d->drm_dev), g2d_userptr->sgt, DMA_BIDIRECTIONAL, 0);
pages = frame_vector_pages(g2d_userptr->vec);
if (!IS_ERR(pages)) {
int i;
for (i = 0; i < g2d_userptr->npages; i++)
set_page_dirty_lock(g2d_userptr->pages[i]);
for (i = 0; i < frame_vector_count(g2d_userptr->vec); i++)
set_page_dirty_lock(pages[i]);
}
put_vaddr_frames(g2d_userptr->vec);
frame_vector_destroy(g2d_userptr->vec);
unpin_user_pages(g2d_userptr->pages, g2d_userptr->npages);
kvfree(g2d_userptr->pages);
You can avoid writing your own loop, and just simplify the whole thing down to two lines:
unpin_user_pages_dirty_lock(g2d_userptr->pages, g2d_userptr->npages, true); kvfree(g2d_userptr->pages);
Oh nice, this is neat. I'll also roll it out in the habanalabs patch, that has the same thing. Well almost, it only uses set_page_dirty, not the _lock variant. But I have no idea whether that matters or not?
It matters. And invariably, call sites that use set_page_dirty() instead of set_page_dirty_lock() were already wrong. Which is why I never had to provide anything like "unpin_user_pages_dirty (not locked)".
Although in habanalabs case, I just reviewed patch 3 and I think they *were* correctly using set_page_dirty_lock()...
Yeah I mixed that up with some other code I read, habanalabs is using _lock. I have seen a pile of gup/pup code though that only uses set_page_dirty. And looking around I did not really parse the comment above set_page_dirty(). I guess just using the _lock variant shouldn't hurt too much. I've found a comment though from the infiniband umem notifier that it's sometimes called with the page locked, and sometimes not, so life is complicated there. But how it avoids races I didn't understand. -Daniel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
The exynos g2d interface is very unusual, but it looks like the userptr objects are persistent. Hence they need FOLL_LONGTERM.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Inki Dae inki.dae@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Kukjin Kim kgene@kernel.org Cc: Krzysztof Kozlowski krzk@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org --- drivers/gpu/drm/exynos/exynos_drm_g2d.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c index c83f6faac9de..514fd000feb1 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c @@ -478,7 +478,8 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, goto err_free; }
- ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE, + ret = pin_user_pages_fast(start, npages, + FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, g2d_userptr->pages); if (ret != npages) { DRM_DEV_ERROR(g2d->dev,
On 10/7/20 9:44 AM, Daniel Vetter wrote:
The exynos g2d interface is very unusual, but it looks like the userptr objects are persistent. Hence they need FOLL_LONGTERM.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Inki Dae inki.dae@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Kukjin Kim kgene@kernel.org Cc: Krzysztof Kozlowski krzk@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org
drivers/gpu/drm/exynos/exynos_drm_g2d.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c index c83f6faac9de..514fd000feb1 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c @@ -478,7 +478,8 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, goto err_free; }
- ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
- ret = pin_user_pages_fast(start, npages,
if (ret != npages) { DRM_DEV_ERROR(g2d->dev,FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, g2d_userptr->pages);
Looks good from a pin_user_pages_fast() point of view. I'm of course not a exynos developer, so we still need a look from one of those, ideally, but:
Reviewed-by: John Hubbard jhubbard@nvidia.com
thanks,
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Oded Gabbay oded.gabbay@gmail.com Cc: Omer Shpigelman oshpigelman@habana.ai Cc: Ofir Bitton obitton@habana.ai Cc: Tomer Tayar ttayar@habana.ai Cc: Moti Haimovski mhaimovski@habana.ai Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Pawel Piskorski ppiskorski@habana.ai --- drivers/misc/habanalabs/Kconfig | 1 - drivers/misc/habanalabs/common/habanalabs.h | 3 +- drivers/misc/habanalabs/common/memory.c | 51 +++++++++------------ 3 files changed, 23 insertions(+), 32 deletions(-)
diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig index 8eb5d38c618e..2f04187f7167 100644 --- a/drivers/misc/habanalabs/Kconfig +++ b/drivers/misc/habanalabs/Kconfig @@ -6,7 +6,6 @@ config HABANA_AI tristate "HabanaAI accelerators (habanalabs)" depends on PCI && HAS_IOMEM - select FRAME_VECTOR select DMA_SHARED_BUFFER select GENERIC_ALLOCATOR select HWMON diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h index edbd627b29d2..c1b3ad613b15 100644 --- a/drivers/misc/habanalabs/common/habanalabs.h +++ b/drivers/misc/habanalabs/common/habanalabs.h @@ -881,7 +881,8 @@ struct hl_ctx_mgr { struct hl_userptr { enum vm_type_t vm_type; /* must be first */ struct list_head job_node; - struct frame_vector *vec; + struct page **pages; + unsigned int npages; struct sg_table *sgt; enum dma_data_direction dir; struct list_head debugfs_list; diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index 5ff4688683fd..ef89cfa2f95a 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -1281,45 +1281,41 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, return -EFAULT; }
- userptr->vec = frame_vector_create(npages); - if (!userptr->vec) { + userptr->pages = kvmalloc_array(npages, sizeof(*userptr->pages), + GFP_KERNEL); + if (!userptr->pages) { dev_err(hdev->dev, "Failed to create frame vector\n"); return -ENOMEM; }
- rc = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE, - userptr->vec); + rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE, + userptr->pages);
if (rc != npages) { dev_err(hdev->dev, "Failed to map host memory, user ptr probably wrong\n"); if (rc < 0) - goto destroy_framevec; + goto destroy_pages; + npages = rc; rc = -EFAULT; - goto put_framevec; - } - - if (frame_vector_to_pages(userptr->vec) < 0) { - dev_err(hdev->dev, - "Failed to translate frame vector to pages\n"); - rc = -EFAULT; - goto put_framevec; + goto put_pages; } + userptr->npages = npages;
rc = sg_alloc_table_from_pages(userptr->sgt, - frame_vector_pages(userptr->vec), - npages, offset, size, GFP_ATOMIC); + userptr->pages, + npages, offset, size, GFP_ATOMIC); if (rc < 0) { dev_err(hdev->dev, "failed to create SG table from pages\n"); - goto put_framevec; + goto put_pages; }
return 0;
-put_framevec: - put_vaddr_frames(userptr->vec); -destroy_framevec: - frame_vector_destroy(userptr->vec); +put_pages: + unpin_user_pages(userptr->pages, npages); +destroy_pages: + kvfree(userptr->pages); return rc; }
@@ -1405,7 +1401,7 @@ int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u64 size, */ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) { - struct page **pages; + int i;
hl_debugfs_remove_userptr(hdev, userptr);
@@ -1414,15 +1410,10 @@ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) userptr->sgt->nents, userptr->dir);
- pages = frame_vector_pages(userptr->vec); - if (!IS_ERR(pages)) { - int i; - - for (i = 0; i < frame_vector_count(userptr->vec); i++) - set_page_dirty_lock(pages[i]); - } - put_vaddr_frames(userptr->vec); - frame_vector_destroy(userptr->vec); + for (i = 0; i < userptr->npages; i++) + set_page_dirty_lock(userptr->pages[i]); + unpin_user_pages(userptr->pages, userptr->npages); + kvfree(userptr->pages);
list_del(&userptr->job_node);
On 10/7/20 9:44 AM, Daniel Vetter wrote: ...
@@ -1414,15 +1410,10 @@ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) userptr->sgt->nents, userptr->dir);
- pages = frame_vector_pages(userptr->vec);
- if (!IS_ERR(pages)) {
int i;
for (i = 0; i < frame_vector_count(userptr->vec); i++)
set_page_dirty_lock(pages[i]);
- }
- put_vaddr_frames(userptr->vec);
- frame_vector_destroy(userptr->vec);
- for (i = 0; i < userptr->npages; i++)
set_page_dirty_lock(userptr->pages[i]);
- unpin_user_pages(userptr->pages, userptr->npages);
- kvfree(userptr->pages);
Same thing here as in patch 1: you can further simplify by using unpin_user_pages_dirty_lock().
list_del(&userptr->job_node);
thanks,
These are persistent, not just for the duration of a dma operation.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Oded Gabbay oded.gabbay@gmail.com Cc: Omer Shpigelman oshpigelman@habana.ai Cc: Ofir Bitton obitton@habana.ai Cc: Tomer Tayar ttayar@habana.ai Cc: Moti Haimovski mhaimovski@habana.ai Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Pawel Piskorski ppiskorski@habana.ai --- drivers/misc/habanalabs/common/memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index ef89cfa2f95a..94bef8faa82a 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -1288,7 +1288,8 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, return -ENOMEM; }
- rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE, + rc = pin_user_pages_fast(start, npages, + FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, userptr->pages);
if (rc != npages) {
On 10/7/20 9:44 AM, Daniel Vetter wrote:
These are persistent, not just for the duration of a dma operation.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Oded Gabbay oded.gabbay@gmail.com Cc: Omer Shpigelman oshpigelman@habana.ai Cc: Ofir Bitton obitton@habana.ai Cc: Tomer Tayar ttayar@habana.ai Cc: Moti Haimovski mhaimovski@habana.ai Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Pawel Piskorski ppiskorski@habana.ai
drivers/misc/habanalabs/common/memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index ef89cfa2f95a..94bef8faa82a 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -1288,7 +1288,8 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, return -ENOMEM; }
- rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
rc = pin_user_pages_fast(start, npages,
FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, userptr->pages);
if (rc != npages) {
Again, from a pin_user_pages_fast() point of view, and not being at all familiar with the habana driver (but their use of this really does seem clearly _LONGTERM!):
Reviewed-by: John Hubbard jhubbard@nvidia.com
thanks,
This is used by media/videbuf2 for persistent dma mappings, not just for a single dma operation and then freed again, so needs FOLL_LONGTERM.
Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to locking issues. Rework the code to pull the pup path out from the mmap_sem critical section as suggested by Jason.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org --- mm/frame_vector.c | 36 +++++++++++------------------------- 1 file changed, 11 insertions(+), 25 deletions(-)
diff --git a/mm/frame_vector.c b/mm/frame_vector.c index 10f82d5643b6..39db520a51dc 100644 --- a/mm/frame_vector.c +++ b/mm/frame_vector.c @@ -38,7 +38,6 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, struct vm_area_struct *vma; int ret = 0; int err; - int locked;
if (nr_frames == 0) return 0; @@ -48,35 +47,22 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
start = untagged_addr(start);
+ ret = pin_user_pages_fast(start, nr_frames, + FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, + (struct page **)(vec->ptrs)); + if (ret > 0) { + vec->got_ref = true; + vec->is_pfns = false; + goto out_unlocked; + } + mmap_read_lock(mm); - locked = 1; vma = find_vma_intersection(mm, start, start + 1); if (!vma) { ret = -EFAULT; goto out; }
- /* - * While get_vaddr_frames() could be used for transient (kernel - * controlled lifetime) pinning of memory pages all current - * users establish long term (userspace controlled lifetime) - * page pinning. Treat get_vaddr_frames() like - * get_user_pages_longterm() and disallow it for filesystem-dax - * mappings. - */ - if (vma_is_fsdax(vma)) { - ret = -EOPNOTSUPP; - goto out; - } - - if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) { - vec->got_ref = true; - vec->is_pfns = false; - ret = pin_user_pages_locked(start, nr_frames, - gup_flags, (struct page **)(vec->ptrs), &locked); - goto out; - } - vec->got_ref = false; vec->is_pfns = true; do { @@ -101,8 +87,8 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, vma = find_vma_intersection(mm, start, start + 1); } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP)); out: - if (locked) - mmap_read_unlock(mm); + mmap_read_unlock(mm); +out_unlocked: if (!ret) ret = -EFAULT; if (ret > 0)
On Wed, Oct 07, 2020 at 06:44:18PM +0200, Daniel Vetter wrote:
- /*
* While get_vaddr_frames() could be used for transient (kernel
* controlled lifetime) pinning of memory pages all current
* users establish long term (userspace controlled lifetime)
* page pinning. Treat get_vaddr_frames() like
* get_user_pages_longterm() and disallow it for filesystem-dax
* mappings.
*/
- if (vma_is_fsdax(vma)) {
ret = -EOPNOTSUPP;
goto out;
- }
- if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
vec->got_ref = true;
vec->is_pfns = false;
ret = pin_user_pages_locked(start, nr_frames,
gup_flags, (struct page **)(vec->ptrs), &locked);
goto out;
- }
The vm_flags still need to be checked before going into the while loop. If the break is taken then nothing would check vm_flags
Jason
On Wed, Oct 7, 2020 at 6:53 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 06:44:18PM +0200, Daniel Vetter wrote:
/*
* While get_vaddr_frames() could be used for transient (kernel
* controlled lifetime) pinning of memory pages all current
* users establish long term (userspace controlled lifetime)
* page pinning. Treat get_vaddr_frames() like
* get_user_pages_longterm() and disallow it for filesystem-dax
* mappings.
*/
if (vma_is_fsdax(vma)) {
ret = -EOPNOTSUPP;
goto out;
}
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
vec->got_ref = true;
vec->is_pfns = false;
ret = pin_user_pages_locked(start, nr_frames,
gup_flags, (struct page **)(vec->ptrs), &locked);
goto out;
}
The vm_flags still need to be checked before going into the while loop. If the break is taken then nothing would check vm_flags
Hm right that's a bin inconsistent. follow_pfn also checks for this, so I think we can just ditch this entirely both here and in the do {} while () check, simplifying the latter to just while (vma). Well, just make it a real loop with less confusing control flow probably.
Or prefer I keep this and touch the code less? -Daniel
On Wed, Oct 07, 2020 at 07:12:24PM +0200, Daniel Vetter wrote:
On Wed, Oct 7, 2020 at 6:53 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 06:44:18PM +0200, Daniel Vetter wrote:
/*
* While get_vaddr_frames() could be used for transient (kernel
* controlled lifetime) pinning of memory pages all current
* users establish long term (userspace controlled lifetime)
* page pinning. Treat get_vaddr_frames() like
* get_user_pages_longterm() and disallow it for filesystem-dax
* mappings.
*/
if (vma_is_fsdax(vma)) {
ret = -EOPNOTSUPP;
goto out;
}
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
vec->got_ref = true;
vec->is_pfns = false;
ret = pin_user_pages_locked(start, nr_frames,
gup_flags, (struct page **)(vec->ptrs), &locked);
goto out;
}
The vm_flags still need to be checked before going into the while loop. If the break is taken then nothing would check vm_flags
Hm right that's a bin inconsistent. follow_pfn also checks for this, so I think we can just ditch this entirely both here and in the do {} while () check, simplifying the latter to just while (vma). Well, just make it a real loop with less confusing control flow probably.
It does read very poorly with the redundant check, espeically since I keep forgetting follow_pfn does it too :\
Jason
On 10/7/20 9:44 AM, Daniel Vetter wrote:
This is used by media/videbuf2 for persistent dma mappings, not just for a single dma operation and then freed again, so needs FOLL_LONGTERM.
Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to locking issues. Rework the code to pull the pup path out from the mmap_sem critical section as suggested by Jason.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org
mm/frame_vector.c | 36 +++++++++++------------------------- 1 file changed, 11 insertions(+), 25 deletions(-)
diff --git a/mm/frame_vector.c b/mm/frame_vector.c index 10f82d5643b6..39db520a51dc 100644 --- a/mm/frame_vector.c +++ b/mm/frame_vector.c @@ -38,7 +38,6 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, struct vm_area_struct *vma; int ret = 0; int err;
int locked;
if (nr_frames == 0) return 0;
@@ -48,35 +47,22 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
start = untagged_addr(start);
- ret = pin_user_pages_fast(start, nr_frames,
FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
(struct page **)(vec->ptrs));
- if (ret > 0) {
vec->got_ref = true;
vec->is_pfns = false;
goto out_unlocked;
- }
This part looks good, and changing to _fast is a potential performance improvement, too.
- mmap_read_lock(mm);
locked = 1; vma = find_vma_intersection(mm, start, start + 1); if (!vma) { ret = -EFAULT; goto out; }
/*
* While get_vaddr_frames() could be used for transient (kernel
* controlled lifetime) pinning of memory pages all current
* users establish long term (userspace controlled lifetime)
* page pinning. Treat get_vaddr_frames() like
* get_user_pages_longterm() and disallow it for filesystem-dax
* mappings.
*/
if (vma_is_fsdax(vma)) {
ret = -EOPNOTSUPP;
goto out;
}
Are you sure we don't need to check vma_is_fsdax() anymore?
- if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
vec->got_ref = true;
vec->is_pfns = false;
ret = pin_user_pages_locked(start, nr_frames,
gup_flags, (struct page **)(vec->ptrs), &locked);
goto out;
- }
- vec->got_ref = false; vec->is_pfns = true; do {
@@ -101,8 +87,8 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, vma = find_vma_intersection(mm, start, start + 1); } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP)); out:
- if (locked)
mmap_read_unlock(mm);
- mmap_read_unlock(mm);
+out_unlocked: if (!ret) ret = -EFAULT; if (ret > 0)
All of the error handling still looks accurate there.
thanks,
On Wed, Oct 7, 2020 at 11:13 PM John Hubbard jhubbard@nvidia.com wrote:
On 10/7/20 9:44 AM, Daniel Vetter wrote:
This is used by media/videbuf2 for persistent dma mappings, not just for a single dma operation and then freed again, so needs FOLL_LONGTERM.
Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to locking issues. Rework the code to pull the pup path out from the mmap_sem critical section as suggested by Jason.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org
mm/frame_vector.c | 36 +++++++++++------------------------- 1 file changed, 11 insertions(+), 25 deletions(-)
diff --git a/mm/frame_vector.c b/mm/frame_vector.c index 10f82d5643b6..39db520a51dc 100644 --- a/mm/frame_vector.c +++ b/mm/frame_vector.c @@ -38,7 +38,6 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, struct vm_area_struct *vma; int ret = 0; int err;
int locked; if (nr_frames == 0) return 0;
@@ -48,35 +47,22 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
start = untagged_addr(start);
ret = pin_user_pages_fast(start, nr_frames,
FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
(struct page **)(vec->ptrs));
if (ret > 0) {
vec->got_ref = true;
vec->is_pfns = false;
goto out_unlocked;
}
This part looks good, and changing to _fast is a potential performance improvement, too.
mmap_read_lock(mm);
locked = 1; vma = find_vma_intersection(mm, start, start + 1); if (!vma) { ret = -EFAULT; goto out; }
/*
* While get_vaddr_frames() could be used for transient (kernel
* controlled lifetime) pinning of memory pages all current
* users establish long term (userspace controlled lifetime)
* page pinning. Treat get_vaddr_frames() like
* get_user_pages_longterm() and disallow it for filesystem-dax
* mappings.
*/
if (vma_is_fsdax(vma)) {
ret = -EOPNOTSUPP;
goto out;
}
Are you sure we don't need to check vma_is_fsdax() anymore?
Since FOLL_LONGTERM checks for this and can only return struct page backed memory, and explicitly excludes VM_IO | VM_PFNMAP, was assuming this is not needed for follow_pfn. And the get_user_pages_locked this used back then didn't have the same check, hence why it was added (and FOLL_LONGTERM still doesn't work for the _locked versions, as you pointed out on the last round of this discussion).
But now that you're asking, I have no idea whether fsdax vma can also be of VM_IO | VM_PFNMAP type. I'm not seeing that set anywhere in fs/dax.c, but that says nothing :-)
Dan, you added this check originally, do we need it for VM_SPECIAL vmas too?
Thanks, Daniel
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
vec->got_ref = true;
vec->is_pfns = false;
ret = pin_user_pages_locked(start, nr_frames,
gup_flags, (struct page **)(vec->ptrs), &locked);
goto out;
}
vec->got_ref = false; vec->is_pfns = true; do {
@@ -101,8 +87,8 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, vma = find_vma_intersection(mm, start, start + 1); } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP)); out:
if (locked)
mmap_read_unlock(mm);
mmap_read_unlock(mm);
+out_unlocked: if (!ret) ret = -EFAULT; if (ret > 0)
All of the error handling still looks accurate there.
thanks,
John Hubbard NVIDIA
It's the only user. This also garbage collects the CONFIG_FRAME_VECTOR symbol from all over the tree (well just one place, somehow omap media driver still had this in its Kconfig, despite not using it).
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Daniel Vetter daniel.vetter@ffwll.ch --- drivers/media/common/videobuf2/Kconfig | 1 - drivers/media/common/videobuf2/Makefile | 1 + .../media/common/videobuf2}/frame_vector.c | 2 + drivers/media/platform/omap/Kconfig | 1 - include/linux/mm.h | 42 ------------------- include/media/videobuf2-core.h | 42 +++++++++++++++++++ mm/Kconfig | 3 -- mm/Makefile | 1 - 8 files changed, 45 insertions(+), 48 deletions(-) rename {mm => drivers/media/common/videobuf2}/frame_vector.c (99%)
diff --git a/drivers/media/common/videobuf2/Kconfig b/drivers/media/common/videobuf2/Kconfig index edbc99ebba87..d2223a12c95f 100644 --- a/drivers/media/common/videobuf2/Kconfig +++ b/drivers/media/common/videobuf2/Kconfig @@ -9,7 +9,6 @@ config VIDEOBUF2_V4L2
config VIDEOBUF2_MEMOPS tristate - select FRAME_VECTOR
config VIDEOBUF2_DMA_CONTIG tristate diff --git a/drivers/media/common/videobuf2/Makefile b/drivers/media/common/videobuf2/Makefile index 77bebe8b202f..54306f8d096c 100644 --- a/drivers/media/common/videobuf2/Makefile +++ b/drivers/media/common/videobuf2/Makefile @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 videobuf2-common-objs := videobuf2-core.o +videobuf2-common-objs += frame_vector.o
ifeq ($(CONFIG_TRACEPOINTS),y) videobuf2-common-objs += vb2-trace.o diff --git a/mm/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c similarity index 99% rename from mm/frame_vector.c rename to drivers/media/common/videobuf2/frame_vector.c index 39db520a51dc..b95f4f371681 100644 --- a/mm/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -8,6 +8,8 @@ #include <linux/pagemap.h> #include <linux/sched.h>
+#include <media/videobuf2-core.h> + /** * get_vaddr_frames() - map virtual addresses to pfns * @start: starting user address diff --git a/drivers/media/platform/omap/Kconfig b/drivers/media/platform/omap/Kconfig index f73b5893220d..de16de46c0f4 100644 --- a/drivers/media/platform/omap/Kconfig +++ b/drivers/media/platform/omap/Kconfig @@ -12,6 +12,5 @@ config VIDEO_OMAP2_VOUT depends on VIDEO_V4L2 select VIDEOBUF2_DMA_CONTIG select OMAP2_VRFB if ARCH_OMAP2 || ARCH_OMAP3 - select FRAME_VECTOR help V4L2 Display driver support for OMAP2/3 based boards. diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..acd60fbf1a5a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1743,48 +1743,6 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc); int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, struct task_struct *task, bool bypass_rlim);
-/* Container for pinned pfns / pages */ -struct frame_vector { - unsigned int nr_allocated; /* Number of frames we have space for */ - unsigned int nr_frames; /* Number of frames stored in ptrs array */ - bool got_ref; /* Did we pin pages by getting page ref? */ - bool is_pfns; /* Does array contain pages or pfns? */ - void *ptrs[]; /* Array of pinned pfns / pages. Use - * pfns_vector_pages() or pfns_vector_pfns() - * for access */ -}; - -struct frame_vector *frame_vector_create(unsigned int nr_frames); -void frame_vector_destroy(struct frame_vector *vec); -int get_vaddr_frames(unsigned long start, unsigned int nr_pfns, - unsigned int gup_flags, struct frame_vector *vec); -void put_vaddr_frames(struct frame_vector *vec); -int frame_vector_to_pages(struct frame_vector *vec); -void frame_vector_to_pfns(struct frame_vector *vec); - -static inline unsigned int frame_vector_count(struct frame_vector *vec) -{ - return vec->nr_frames; -} - -static inline struct page **frame_vector_pages(struct frame_vector *vec) -{ - if (vec->is_pfns) { - int err = frame_vector_to_pages(vec); - - if (err) - return ERR_PTR(err); - } - return (struct page **)(vec->ptrs); -} - -static inline unsigned long *frame_vector_pfns(struct frame_vector *vec) -{ - if (!vec->is_pfns) - frame_vector_to_pfns(vec); - return (unsigned long *)(vec->ptrs); -} - struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/media/videobuf2-core.h b/include/media/videobuf2-core.h index bbb3f26fbde9..a2e75ca0334f 100644 --- a/include/media/videobuf2-core.h +++ b/include/media/videobuf2-core.h @@ -1254,4 +1254,46 @@ bool vb2_request_object_is_buffer(struct media_request_object *obj); */ unsigned int vb2_request_buffer_cnt(struct media_request *req);
+/* Container for pinned pfns / pages in frame_vector.c */ +struct frame_vector { + unsigned int nr_allocated; /* Number of frames we have space for */ + unsigned int nr_frames; /* Number of frames stored in ptrs array */ + bool got_ref; /* Did we pin pages by getting page ref? */ + bool is_pfns; /* Does array contain pages or pfns? */ + void *ptrs[]; /* Array of pinned pfns / pages. Use + * pfns_vector_pages() or pfns_vector_pfns() + * for access */ +}; + +struct frame_vector *frame_vector_create(unsigned int nr_frames); +void frame_vector_destroy(struct frame_vector *vec); +int get_vaddr_frames(unsigned long start, unsigned int nr_pfns, + unsigned int gup_flags, struct frame_vector *vec); +void put_vaddr_frames(struct frame_vector *vec); +int frame_vector_to_pages(struct frame_vector *vec); +void frame_vector_to_pfns(struct frame_vector *vec); + +static inline unsigned int frame_vector_count(struct frame_vector *vec) +{ + return vec->nr_frames; +} + +static inline struct page **frame_vector_pages(struct frame_vector *vec) +{ + if (vec->is_pfns) { + int err = frame_vector_to_pages(vec); + + if (err) + return ERR_PTR(err); + } + return (struct page **)(vec->ptrs); +} + +static inline unsigned long *frame_vector_pfns(struct frame_vector *vec) +{ + if (!vec->is_pfns) + frame_vector_to_pfns(vec); + return (unsigned long *)(vec->ptrs); +} + #endif /* _MEDIA_VIDEOBUF2_CORE_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 6c974888f86f..da6c943fe9f1 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -815,9 +815,6 @@ config DEVICE_PRIVATE memory; i.e., memory that is only accessible from the device (or group of devices). You likely also want to select HMM_MIRROR.
-config FRAME_VECTOR - bool - config ARCH_USES_HIGH_VMA_FLAGS bool config ARCH_HAS_PKEYS diff --git a/mm/Makefile b/mm/Makefile index d5649f1c12c0..a025fd6c6afd 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -111,7 +111,6 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o -obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
On 10/7/20 9:44 AM, Daniel Vetter wrote:
It's the only user. This also garbage collects the CONFIG_FRAME_VECTOR symbol from all over the tree (well just one place, somehow omap media driver still had this in its Kconfig, despite not using it).
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Daniel Vetter daniel.vetter@ffwll.ch
Failed to spot any problems here. :)
Reviewed-by: John Hubbard jhubbard@nvidia.com
thanks,
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this.
Since ioremap might need to manipulate pagetables too we need to drop the pt lock and have a retry loop if we raced.
While at it, also add kerneldoc and improve the comment for the vma_ops->access function. It's for accessing, not for moving the memory from iomem to system memory, as the old comment seemed to suggest.
References: 28b2ee20c7cb ("access_process_vm device memory infrastructure") Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Rik van Riel riel@redhat.com Cc: Benjamin Herrensmidt benh@kernel.crashing.org Cc: Dave Airlie airlied@linux.ie Cc: Hugh Dickins hugh@veritas.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- include/linux/mm.h | 3 ++- mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 44 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index acd60fbf1a5a..2a16631c1fda 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -566,7 +566,8 @@ struct vm_operations_struct { vm_fault_t (*pfn_mkwrite)(struct vm_fault *vmf);
/* called by access_process_vm when get_user_pages() fails, typically - * for use by special VMAs that can switch between memory and hardware + * for use by special VMAs. See also generic_access_phys() for a generic + * implementation useful for any iomem mapping. */ int (*access)(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write); diff --git a/mm/memory.c b/mm/memory.c index fcfc4ca36eba..8d467e23b44e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4873,28 +4873,68 @@ int follow_phys(struct vm_area_struct *vma, return ret; }
+/** + * generic_access_phys - generic implementation for iomem mmap access + * @vma: the vma to access + * @addr: userspace addres, not relative offset within @vma + * @buf: buffer to read/write + * @len: length of transfer + * @write: set to FOLL_WRITE when writing, otherwise reading + * + * This is a generic implementation for &vm_operations_struct.access for an + * iomem mapping. This callback is used by access_process_vm() when the @vma is + * not page based. + */ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write) { resource_size_t phys_addr; unsigned long prot = 0; void __iomem *maddr; + pte_t *ptep, pte; + spinlock_t *ptl; int offset = addr & (PAGE_SIZE-1); + int ret = -EINVAL; + + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) + return -EINVAL; + +retry: + if (follow_pte(vma->vm_mm, addr, &ptep, &ptl)) + return -EINVAL; + pte = *ptep; + pte_unmap_unlock(ptep, ptl);
- if (follow_phys(vma, addr, write, &prot, &phys_addr)) + prot = pgprot_val(pte_pgprot(pte)); + phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT; + + if ((write & FOLL_WRITE) && !pte_write(pte)) return -EINVAL;
maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot); if (!maddr) return -ENOMEM;
+ if (follow_pte(vma->vm_mm, addr, &ptep, &ptl)) + goto out_unmap; + + if (pte_same(pte, *ptep)) { + pte_unmap_unlock(ptep, ptl); + iounmap(maddr); + + goto retry; + } + if (write) memcpy_toio(maddr + offset, buf, len); else memcpy_fromio(buf, maddr + offset, len); + ret = len; + pte_unmap_unlock(ptep, ptl); +out_unmap: iounmap(maddr);
- return len; + return ret; } EXPORT_SYMBOL_GPL(generic_access_phys); #endif
On Wed, Oct 07, 2020 at 06:44:20PM +0200, Daniel Vetter wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved
contiguous dma allocations have moved from dedicated carvetouts to cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this.
Since ioremap might need to manipulate pagetables too we need to drop the pt lock and have a retry loop if we raced.
While at it, also add kerneldoc and improve the comment for the vma_ops->access function. It's for accessing, not for moving the memory from iomem to system memory, as the old comment seemed to suggest.
References: 28b2ee20c7cb ("access_process_vm device memory infrastructure") Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Rik van Riel riel@redhat.com Cc: Benjamin Herrensmidt benh@kernel.crashing.org Cc: Dave Airlie airlied@linux.ie Cc: Hugh Dickins hugh@veritas.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com
include/linux/mm.h | 3 ++- mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 44 insertions(+), 3 deletions(-)
This does seem to solve the race with revoke_devmem(), but it is really ugly.
It would be much nicer to wrap a rwsem around this access and the unmap.
Any place using it has a nice linear translation from vm_off to pfn, so I don't think there is a such a good reason to use follow_pte in the first place.
ie why not the helper be this:
int generic_access_phys(unsigned long pfn, unsigned long pgprot, void *buf, size_t len, bool write)
Then something like dev/mem would compute pfn and obtain the lock:
dev_access(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write) { cpu_addr = vma->vm_pgoff*PAGE_SIZE + (addr - vma->vm_start));
/* FIXME: Has to be over each page of len */ if (!devmem_is_allowed_access(PHYS_PFN(cpu_addr/4096))) return -EPERM;
down_read(&mem_sem); generic_access_phys(cpu_addr/4096, pgprot_val(vma->vm_page_prot), buf, len, write); up_read(&mem_sem); }
The other cases looked simpler because they don't revoke, here the mmap_sem alone should be enough protection, they would just need to provide the linear translation to pfn.
What do you think?
Jason
On Wed, Oct 7, 2020 at 7:27 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 06:44:20PM +0200, Daniel Vetter wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved
contiguous dma allocations have moved from dedicated carvetouts to cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this.
Since ioremap might need to manipulate pagetables too we need to drop the pt lock and have a retry loop if we raced.
While at it, also add kerneldoc and improve the comment for the vma_ops->access function. It's for accessing, not for moving the memory from iomem to system memory, as the old comment seemed to suggest.
References: 28b2ee20c7cb ("access_process_vm device memory infrastructure") Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Rik van Riel riel@redhat.com Cc: Benjamin Herrensmidt benh@kernel.crashing.org Cc: Dave Airlie airlied@linux.ie Cc: Hugh Dickins hugh@veritas.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com
include/linux/mm.h | 3 ++- mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 44 insertions(+), 3 deletions(-)
This does seem to solve the race with revoke_devmem(), but it is really ugly.
It would be much nicer to wrap a rwsem around this access and the unmap.
Any place using it has a nice linear translation from vm_off to pfn, so I don't think there is a such a good reason to use follow_pte in the first place.
ie why not the helper be this:
int generic_access_phys(unsigned long pfn, unsigned long pgprot, void *buf, size_t len, bool write)
Then something like dev/mem would compute pfn and obtain the lock:
dev_access(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write) { cpu_addr = vma->vm_pgoff*PAGE_SIZE + (addr - vma->vm_start));
/* FIXME: Has to be over each page of len */ if (!devmem_is_allowed_access(PHYS_PFN(cpu_addr/4096))) return -EPERM; down_read(&mem_sem); generic_access_phys(cpu_addr/4096, pgprot_val(vma->vm_page_prot), buf, len, write); up_read(&mem_sem);
}
The other cases looked simpler because they don't revoke, here the mmap_sem alone should be enough protection, they would just need to provide the linear translation to pfn.
What do you think?
I think it'd fix the bug, until someone wires ->access up for drivers/gpu, or the next subsystem. This is also just for ptrace, so we really don't care when we stall the vm badly and other silly things. So I figured the somewhat ugly, but full generic solution is the better one, so that people who want to be able to ptrace read/write their iomem mmaps can just sprinkle this wherever they feel like.
But yeah if we go with most minimal fix, i.e. only trying to fix the current users, then your thing should work and is simpler. But it leaves the door open for future problems. -Daniel
On Wed, Oct 07, 2020 at 08:01:42PM +0200, Daniel Vetter wrote:
I think it'd fix the bug, until someone wires ->access up for drivers/gpu, or the next subsystem. This is also just for ptrace, so we really don't care when we stall the vm badly and other silly things. So I figured the somewhat ugly, but full generic solution is the better one, so that people who want to be able to ptrace read/write their iomem mmaps can just sprinkle this wherever they feel like.
But yeah if we go with most minimal fix, i.e. only trying to fix the current users, then your thing should work and is simpler. But it leaves the door open for future problems.
The only other idea I had was to fully make the 'vma of __iomem memory' some generic utility, completely take over the vm_ops.
We did something like this in RDMA, what I found was even just implementing mmap() using the kernel helpers turned out to be pretty tricky, many drivers did it wrong in small ways.
Jason
On 10/7/20 9:44 AM, Daniel Vetter wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved
contiguous dma allocations have moved from dedicated carvetouts to
s/carvetouts/carveouts/
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Thanks for putting these references into the log, it's very helpful. ...
diff --git a/mm/memory.c b/mm/memory.c index fcfc4ca36eba..8d467e23b44e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4873,28 +4873,68 @@ int follow_phys(struct vm_area_struct *vma, return ret; }
+/**
- generic_access_phys - generic implementation for iomem mmap access
- @vma: the vma to access
- @addr: userspace addres, not relative offset within @vma
- @buf: buffer to read/write
- @len: length of transfer
- @write: set to FOLL_WRITE when writing, otherwise reading
- This is a generic implementation for &vm_operations_struct.access for an
- iomem mapping. This callback is used by access_process_vm() when the @vma is
- not page based.
- */ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write) { resource_size_t phys_addr; unsigned long prot = 0; void __iomem *maddr;
- pte_t *ptep, pte;
- spinlock_t *ptl; int offset = addr & (PAGE_SIZE-1);
- int ret = -EINVAL;
- if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
return -EINVAL;
+retry:
- if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
return -EINVAL;
- pte = *ptep;
- pte_unmap_unlock(ptep, ptl);
- if (follow_phys(vma, addr, write, &prot, &phys_addr))
prot = pgprot_val(pte_pgprot(pte));
phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
if ((write & FOLL_WRITE) && !pte_write(pte)) return -EINVAL;
maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot); if (!maddr) return -ENOMEM;
if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
goto out_unmap;
if (pte_same(pte, *ptep)) {
The ioremap area is something I'm sorta new to, so a newbie question: is it possible for the same pte to already be there, ever? If so, we be stuck in an infinite loop here. I'm sure that's not the case, but it's not yet obvious to me why it's impossible. Resource reservations maybe?
thanks,
On Thu, Oct 8, 2020 at 2:44 AM John Hubbard jhubbard@nvidia.com wrote:
On 10/7/20 9:44 AM, Daniel Vetter wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved
contiguous dma allocations have moved from dedicated carvetouts to
s/carvetouts/carveouts/
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Thanks for putting these references into the log, it's very helpful. ...
diff --git a/mm/memory.c b/mm/memory.c index fcfc4ca36eba..8d467e23b44e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4873,28 +4873,68 @@ int follow_phys(struct vm_area_struct *vma, return ret; }
+/**
- generic_access_phys - generic implementation for iomem mmap access
- @vma: the vma to access
- @addr: userspace addres, not relative offset within @vma
- @buf: buffer to read/write
- @len: length of transfer
- @write: set to FOLL_WRITE when writing, otherwise reading
- This is a generic implementation for &vm_operations_struct.access for an
- iomem mapping. This callback is used by access_process_vm() when the @vma is
- not page based.
- */ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write) { resource_size_t phys_addr; unsigned long prot = 0; void __iomem *maddr;
pte_t *ptep, pte;
spinlock_t *ptl; int offset = addr & (PAGE_SIZE-1);
int ret = -EINVAL;
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
return -EINVAL;
+retry:
if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
return -EINVAL;
pte = *ptep;
pte_unmap_unlock(ptep, ptl);
if (follow_phys(vma, addr, write, &prot, &phys_addr))
prot = pgprot_val(pte_pgprot(pte));
phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT;
if ((write & FOLL_WRITE) && !pte_write(pte)) return -EINVAL; maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot); if (!maddr) return -ENOMEM;
if (follow_pte(vma->vm_mm, addr, &ptep, &ptl))
goto out_unmap;
if (pte_same(pte, *ptep)) {
The ioremap area is something I'm sorta new to, so a newbie question: is it possible for the same pte to already be there, ever? If so, we be stuck in an infinite loop here. I'm sure that's not the case, but it's not yet obvious to me why it's impossible. Resource reservations maybe?
It's just buggy, it should be !pte_same. And I need to figure out how to test this I guess. -Daniel
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this.
Since zpci_memcpy_from|toio seems to not do anything nefarious with locks we just need to open code get_pfn and follow_pfn and make sure we drop the locks only after we've done. The write function also needs the copy_from_user move, since we can't take userspace faults while holding the mmap sem.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Niklas Schnelle schnelle@linux.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: linux-s390@vger.kernel.org --- arch/s390/pci/pci_mmio.c | 98 +++++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 41 deletions(-)
diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 401cf670a243..4d194cb09372 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -119,33 +119,15 @@ static inline int __memcpy_toio_inuser(void __iomem *dst, return rc; }
-static long get_pfn(unsigned long user_addr, unsigned long access, - unsigned long *pfn) -{ - struct vm_area_struct *vma; - long ret; - - mmap_read_lock(current->mm); - ret = -EINVAL; - vma = find_vma(current->mm, user_addr); - if (!vma) - goto out; - ret = -EACCES; - if (!(vma->vm_flags & access)) - goto out; - ret = follow_pfn(vma, user_addr, pfn); -out: - mmap_read_unlock(current->mm); - return ret; -} - SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, const void __user *, user_buffer, size_t, length) { u8 local_buf[64]; void __iomem *io_addr; void *buf; - unsigned long pfn; + struct vm_area_struct *vma; + pte_t *ptep; + spinlock_t *ptl; long ret;
if (!zpci_is_enabled()) @@ -158,7 +140,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, * We only support write access to MIO capable devices if we are on * a MIO enabled system. Otherwise we would have to check for every * address if it is a special ZPCI_ADDR and would have to do - * a get_pfn() which we don't need for MIO capable devices. Currently + * a pfn lookup which we don't need for MIO capable devices. Currently * ISM devices are the only devices without MIO support and there is no * known need for accessing these from userspace. */ @@ -176,21 +158,37 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, } else buf = local_buf;
- ret = get_pfn(mmio_addr, VM_WRITE, &pfn); + ret = -EFAULT; + if (copy_from_user(buf, user_buffer, length)) + goto out_free; + + mmap_read_lock(current->mm); + ret = -EINVAL; + vma = find_vma(current->mm, mmio_addr); + if (!vma) + goto out_unlock_mmap; + ret = -EACCES; + if (!(vma->vm_flags & VM_WRITE)) + goto out_unlock_mmap; + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) + goto out_unlock_mmap; + + ret = follow_pte_pmd(vma->vm_mm, mmio_addr, NULL, &ptep, NULL, &ptl); if (ret) - goto out; - io_addr = (void __iomem *)((pfn << PAGE_SHIFT) | + goto out_unlock_mmap; + + io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) | (mmio_addr & ~PAGE_MASK));
- ret = -EFAULT; if ((unsigned long) io_addr < ZPCI_IOMAP_ADDR_BASE) - goto out; - - if (copy_from_user(buf, user_buffer, length)) - goto out; + goto out_unlock_pt;
ret = zpci_memcpy_toio(io_addr, buf, length); -out: +out_unlock_pt: + pte_unmap_unlock(ptep, ptl); +out_unlock_mmap: + mmap_read_unlock(current->mm); +out_free: if (buf != local_buf) kfree(buf); return ret; @@ -274,7 +272,9 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, u8 local_buf[64]; void __iomem *io_addr; void *buf; - unsigned long pfn; + struct vm_area_struct *vma; + pte_t *ptep; + spinlock_t *ptl; long ret;
if (!zpci_is_enabled()) @@ -287,7 +287,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, * We only support read access to MIO capable devices if we are on * a MIO enabled system. Otherwise we would have to check for every * address if it is a special ZPCI_ADDR and would have to do - * a get_pfn() which we don't need for MIO capable devices. Currently + * a pfn lookup which we don't need for MIO capable devices. Currently * ISM devices are the only devices without MIO support and there is no * known need for accessing these from userspace. */ @@ -306,22 +306,38 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, buf = local_buf; }
- ret = get_pfn(mmio_addr, VM_READ, &pfn); + mmap_read_lock(current->mm); + ret = -EINVAL; + vma = find_vma(current->mm, mmio_addr); + if (!vma) + goto out_unlock_mmap; + ret = -EACCES; + if (!(vma->vm_flags & VM_WRITE)) + goto out_unlock_mmap; + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) + goto out_unlock_mmap; + + ret = follow_pte_pmd(vma->vm_mm, mmio_addr, NULL, &ptep, NULL, &ptl); if (ret) - goto out; - io_addr = (void __iomem *)((pfn << PAGE_SHIFT) | (mmio_addr & ~PAGE_MASK)); + goto out_unlock_mmap; + + io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) | + (mmio_addr & ~PAGE_MASK));
if ((unsigned long) io_addr < ZPCI_IOMAP_ADDR_BASE) { ret = -EFAULT; - goto out; + goto out_unlock_pt; } ret = zpci_memcpy_fromio(buf, io_addr, length); - if (ret) - goto out; - if (copy_to_user(user_buffer, buf, length)) + +out_unlock_pt: + pte_unmap_unlock(ptep, ptl); +out_unlock_mmap: + mmap_read_unlock(current->mm); + + if (!ret && copy_to_user(user_buffer, buf, length)) ret = -EFAULT;
-out: if (buf != local_buf) kfree(buf); return ret;
On Wed, 7 Oct 2020 18:44:21 +0200 Daniel Vetter daniel.vetter@ffwll.ch wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this.
Since zpci_memcpy_from|toio seems to not do anything nefarious with locks we just need to open code get_pfn and follow_pfn and make sure we drop the locks only after we've done. The write function also needs the copy_from_user move, since we can't take userspace faults while holding the mmap sem.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Niklas Schnelle schnelle@linux.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: linux-s390@vger.kernel.org
arch/s390/pci/pci_mmio.c | 98 +++++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 41 deletions(-)
Looks good, thanks. Also survived some basic function test. Only some minor nitpick, see below.
Reviewed-by: Gerald Schaefer gerald.schaefer@linux.ibm.com
diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 401cf670a243..4d194cb09372 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -119,33 +119,15 @@ static inline int __memcpy_toio_inuser(void __iomem *dst, return rc; }
-static long get_pfn(unsigned long user_addr, unsigned long access,
unsigned long *pfn)
-{
- struct vm_area_struct *vma;
- long ret;
- mmap_read_lock(current->mm);
- ret = -EINVAL;
- vma = find_vma(current->mm, user_addr);
- if (!vma)
goto out;
- ret = -EACCES;
- if (!(vma->vm_flags & access))
goto out;
- ret = follow_pfn(vma, user_addr, pfn);
-out:
- mmap_read_unlock(current->mm);
- return ret;
-}
SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, const void __user *, user_buffer, size_t, length) { u8 local_buf[64]; void __iomem *io_addr; void *buf;
- unsigned long pfn;
struct vm_area_struct *vma;
pte_t *ptep;
spinlock_t *ptl; long ret;
if (!zpci_is_enabled())
@@ -158,7 +140,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, * We only support write access to MIO capable devices if we are on * a MIO enabled system. Otherwise we would have to check for every * address if it is a special ZPCI_ADDR and would have to do
* a get_pfn() which we don't need for MIO capable devices. Currently
* a pfn lookup which we don't need for MIO capable devices. Currently
*/
- ISM devices are the only devices without MIO support and there is no
- known need for accessing these from userspace.
@@ -176,21 +158,37 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, } else buf = local_buf;
- ret = get_pfn(mmio_addr, VM_WRITE, &pfn);
- ret = -EFAULT;
- if (copy_from_user(buf, user_buffer, length))
goto out_free;
- mmap_read_lock(current->mm);
- ret = -EINVAL;
- vma = find_vma(current->mm, mmio_addr);
- if (!vma)
goto out_unlock_mmap;
- ret = -EACCES;
- if (!(vma->vm_flags & VM_WRITE))
goto out_unlock_mmap;
- if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
goto out_unlock_mmap;
That check for VM_IO | VM_PFNMAP was previously hidden inside follow_pfn(), and that would have returned -EINVAL in this case. With your change, we now return -EACCES. Not sure how important that is, but it feels wrong. Maybe move the VM_IO | VM_PFNMAP check up, before the ret = -EACCES?
[...]
@@ -306,22 +306,38 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, buf = local_buf; }
- ret = get_pfn(mmio_addr, VM_READ, &pfn);
- mmap_read_lock(current->mm);
- ret = -EINVAL;
- vma = find_vma(current->mm, mmio_addr);
- if (!vma)
goto out_unlock_mmap;
- ret = -EACCES;
- if (!(vma->vm_flags & VM_WRITE))
goto out_unlock_mmap;
- if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
goto out_unlock_mmap;
Same here with VM_IO | VM_PFNMAP and -EINVAL.
On Thu, Oct 8, 2020 at 6:44 PM Gerald Schaefer gerald.schaefer@linux.ibm.com wrote:
On Wed, 7 Oct 2020 18:44:21 +0200 Daniel Vetter daniel.vetter@ffwll.ch wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this.
Since zpci_memcpy_from|toio seems to not do anything nefarious with locks we just need to open code get_pfn and follow_pfn and make sure we drop the locks only after we've done. The write function also needs the copy_from_user move, since we can't take userspace faults while holding the mmap sem.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Niklas Schnelle schnelle@linux.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: linux-s390@vger.kernel.org
arch/s390/pci/pci_mmio.c | 98 +++++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 41 deletions(-)
Looks good, thanks. Also survived some basic function test. Only some minor nitpick, see below.
Reviewed-by: Gerald Schaefer gerald.schaefer@linux.ibm.com
diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 401cf670a243..4d194cb09372 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -119,33 +119,15 @@ static inline int __memcpy_toio_inuser(void __iomem *dst, return rc; }
-static long get_pfn(unsigned long user_addr, unsigned long access,
unsigned long *pfn)
-{
struct vm_area_struct *vma;
long ret;
mmap_read_lock(current->mm);
ret = -EINVAL;
vma = find_vma(current->mm, user_addr);
if (!vma)
goto out;
ret = -EACCES;
if (!(vma->vm_flags & access))
goto out;
ret = follow_pfn(vma, user_addr, pfn);
-out:
mmap_read_unlock(current->mm);
return ret;
-}
SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, const void __user *, user_buffer, size_t, length) { u8 local_buf[64]; void __iomem *io_addr; void *buf;
unsigned long pfn;
struct vm_area_struct *vma;
pte_t *ptep;
spinlock_t *ptl; long ret; if (!zpci_is_enabled())
@@ -158,7 +140,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, * We only support write access to MIO capable devices if we are on * a MIO enabled system. Otherwise we would have to check for every * address if it is a special ZPCI_ADDR and would have to do
* a get_pfn() which we don't need for MIO capable devices. Currently
* a pfn lookup which we don't need for MIO capable devices. Currently * ISM devices are the only devices without MIO support and there is no * known need for accessing these from userspace. */
@@ -176,21 +158,37 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, } else buf = local_buf;
ret = get_pfn(mmio_addr, VM_WRITE, &pfn);
ret = -EFAULT;
if (copy_from_user(buf, user_buffer, length))
goto out_free;
mmap_read_lock(current->mm);
ret = -EINVAL;
vma = find_vma(current->mm, mmio_addr);
if (!vma)
goto out_unlock_mmap;
ret = -EACCES;
if (!(vma->vm_flags & VM_WRITE))
goto out_unlock_mmap;
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
goto out_unlock_mmap;
That check for VM_IO | VM_PFNMAP was previously hidden inside follow_pfn(), and that would have returned -EINVAL in this case. With your change, we now return -EACCES. Not sure how important that is, but it feels wrong. Maybe move the VM_IO | VM_PFNMAP check up, before the ret = -EACCES?
I tried to keep the errno unchanged, but fumbled this. Will fix in the next round, thanks a lot for reviewing and testing.
For merging I think this one here would be best through the s390 tree, since it can be merged without any of the others in here.
Thanks, Daniel
[...]
@@ -306,22 +306,38 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, buf = local_buf; }
ret = get_pfn(mmio_addr, VM_READ, &pfn);
mmap_read_lock(current->mm);
ret = -EINVAL;
vma = find_vma(current->mm, mmio_addr);
if (!vma)
goto out_unlock_mmap;
ret = -EACCES;
if (!(vma->vm_flags & VM_WRITE))
goto out_unlock_mmap;
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
goto out_unlock_mmap;
Same here with VM_IO | VM_PFNMAP and -EINVAL.
There's three ways to access pci bars from userspace: /dev/mem, sysfs files, and the old proc interface. Two check against iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM, this starts to matter, since we don't want random userspace having access to pci bars while a driver is loaded and using it.
Fix this.
References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges") Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org --- drivers/pci/proc.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c index d35186b01d98..3a2f90beb4cb 100644 --- a/drivers/pci/proc.c +++ b/drivers/pci/proc.c @@ -274,6 +274,11 @@ static int proc_bus_pci_mmap(struct file *file, struct vm_area_struct *vma) else return -EINVAL; } + + if (dev->resource[i].flags & IORESOURCE_MEM && + iomem_is_exclusive(dev->resource[i].start)) + return -EINVAL; + ret = pci_mmap_page_range(dev, i, vma, fpriv->mmap_state, write_combine); if (ret < 0)
On Wed, Oct 07, 2020 at 06:44:22PM +0200, Daniel Vetter wrote:
There's three ways to access pci bars from userspace: /dev/mem, sysfs files, and the old proc interface. Two check against iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM, this starts to matter, since we don't want random userspace having access to pci bars while a driver is loaded and using it.
Fix this.
Please mention *how* you're fixing this. I know you can sort of deduce it from the first paragraph, but it's easy to save readers the trouble.
s/pci/PCI/ s/bars/BARs/ Capitalize subject to match other patches.
References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges") Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org
drivers/pci/proc.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c index d35186b01d98..3a2f90beb4cb 100644 --- a/drivers/pci/proc.c +++ b/drivers/pci/proc.c @@ -274,6 +274,11 @@ static int proc_bus_pci_mmap(struct file *file, struct vm_area_struct *vma) else return -EINVAL; }
- if (dev->resource[i].flags & IORESOURCE_MEM &&
iomem_is_exclusive(dev->resource[i].start))
return -EINVAL;
- ret = pci_mmap_page_range(dev, i, vma, fpriv->mmap_state, write_combine); if (ret < 0)
-- 2.28.0
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
Note this only works for ARCH_GENERIC_PCI_MMAP_RESOURCE. But that seems to be a subset of architectures support STRICT_DEVMEM, so we should be good.
The only difference in access checks left is that sysfs pci mmap does not check for CAP_RAWIO. But I think that makes some sense compared to /dev/mem and proc, where one file gives you access to everything and no ownership applies.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org --- drivers/char/mem.c | 16 +++++++++++++++- drivers/pci/mmap.c | 3 +++ include/linux/ioport.h | 2 ++ 3 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/drivers/char/mem.c b/drivers/char/mem.c index abd4ffdc8cde..5e58a326d4ee 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -810,6 +810,7 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig) }
static struct inode *devmem_inode; +static struct vfsmount *devmem_vfs_mount;
#ifdef CONFIG_IO_STRICT_DEVMEM void revoke_devmem(struct resource *res) @@ -843,6 +844,20 @@ void revoke_devmem(struct resource *res)
unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1); } + +struct file *devmem_getfile(void) +{ + struct file *file; + + file = alloc_file_pseudo(devmem_inode, devmem_vfs_mount, "devmem", + O_RDWR, &kmem_fops); + if (IS_ERR(file)) + return NULL; + + file->f_mapping = devmem_indoe->i_mapping; + + return file; +} #endif
static int open_port(struct inode *inode, struct file *filp) @@ -1010,7 +1025,6 @@ static struct file_system_type devmem_fs_type = {
static int devmem_init_inode(void) { - static struct vfsmount *devmem_vfs_mount; static int devmem_fs_cnt; struct inode *inode; int rc; diff --git a/drivers/pci/mmap.c b/drivers/pci/mmap.c index b8c9011987f4..63786cc9c746 100644 --- a/drivers/pci/mmap.c +++ b/drivers/pci/mmap.c @@ -7,6 +7,7 @@ * Author: David Woodhouse dwmw2@infradead.org */
+#include <linux/file.h> #include <linux/kernel.h> #include <linux/mm.h> #include <linux/pci.h> @@ -64,6 +65,8 @@ int pci_mmap_resource_range(struct pci_dev *pdev, int bar, vma->vm_pgoff += (pci_resource_start(pdev, bar) >> PAGE_SHIFT);
vma->vm_ops = &pci_phys_vm_ops; + fput(vma->vm_file); + vma->vm_file = devmem_getfile();
return io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, vma->vm_end - vma->vm_start, diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 6c2b06fe8beb..83238cba19fe 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -304,8 +304,10 @@ struct resource *request_free_mem_region(struct resource *base,
#ifdef CONFIG_IO_STRICT_DEVMEM void revoke_devmem(struct resource *res); +struct file *devm_getfile(void); #else static inline void revoke_devmem(struct resource *res) { }; +static inline struct file *devmem_getfile(void) { return NULL; }; #endif
#endif /* __ASSEMBLY__ */
Capitalize subject, like other patches in this series and previous drivers/pci history.
On Wed, Oct 07, 2020 at 06:44:23PM +0200, Daniel Vetter wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
s/pci/PCI/ in commit logs and comments.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
Note this only works for ARCH_GENERIC_PCI_MMAP_RESOURCE. But that seems to be a subset of architectures support STRICT_DEVMEM, so we should be good.
The only difference in access checks left is that sysfs pci mmap does not check for CAP_RAWIO. But I think that makes some sense compared to /dev/mem and proc, where one file gives you access to everything and no ownership applies.
--- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -810,6 +810,7 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig) }
static struct inode *devmem_inode; +static struct vfsmount *devmem_vfs_mount;
#ifdef CONFIG_IO_STRICT_DEVMEM void revoke_devmem(struct resource *res) @@ -843,6 +844,20 @@ void revoke_devmem(struct resource *res)
unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1); }
+struct file *devmem_getfile(void) +{
- struct file *file;
- file = alloc_file_pseudo(devmem_inode, devmem_vfs_mount, "devmem",
O_RDWR, &kmem_fops);
- if (IS_ERR(file))
return NULL;
- file->f_mapping = devmem_indoe->i_mapping;
"devmem_indoe"? Obviously not compiled, I guess?
--- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -304,8 +304,10 @@ struct resource *request_free_mem_region(struct resource *base,
#ifdef CONFIG_IO_STRICT_DEVMEM void revoke_devmem(struct resource *res); +struct file *devm_getfile(void); #else static inline void revoke_devmem(struct resource *res) { }; +static inline struct file *devmem_getfile(void) { return NULL; };
I guess these names are supposed to match?
#endif
#endif /* __ASSEMBLY__ */
2.28.0
On Wed, Oct 7, 2020 at 8:41 PM Bjorn Helgaas helgaas@kernel.org wrote:
Capitalize subject, like other patches in this series and previous drivers/pci history.
On Wed, Oct 07, 2020 at 06:44:23PM +0200, Daniel Vetter wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
s/pci/PCI/ in commit logs and comments.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
Note this only works for ARCH_GENERIC_PCI_MMAP_RESOURCE. But that seems to be a subset of architectures support STRICT_DEVMEM, so we should be good.
The only difference in access checks left is that sysfs pci mmap does not check for CAP_RAWIO. But I think that makes some sense compared to /dev/mem and proc, where one file gives you access to everything and no ownership applies.
--- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -810,6 +810,7 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig) }
static struct inode *devmem_inode; +static struct vfsmount *devmem_vfs_mount;
#ifdef CONFIG_IO_STRICT_DEVMEM void revoke_devmem(struct resource *res) @@ -843,6 +844,20 @@ void revoke_devmem(struct resource *res)
unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
}
+struct file *devmem_getfile(void) +{
struct file *file;
file = alloc_file_pseudo(devmem_inode, devmem_vfs_mount, "devmem",
O_RDWR, &kmem_fops);
if (IS_ERR(file))
return NULL;
file->f_mapping = devmem_indoe->i_mapping;
"devmem_indoe"? Obviously not compiled, I guess?
Yeah apologies, I forgot to compile this with CONFIG_IO_STRICT_DEVMEM set. The entire series is more rfc about the overall problem really, I need to also figure out how to even this this somehow. I guess there's nothing really ready made here? -Daniel
--- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -304,8 +304,10 @@ struct resource *request_free_mem_region(struct resource *base,
#ifdef CONFIG_IO_STRICT_DEVMEM void revoke_devmem(struct resource *res); +struct file *devm_getfile(void); #else static inline void revoke_devmem(struct resource *res) { }; +static inline struct file *devmem_getfile(void) { return NULL; };
I guess these names are supposed to match?
#endif
#endif /* __ASSEMBLY__ */
2.28.0
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
Ooh, yes, lets.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
I don't think you want to share the devmem inode for this, this should be based off the sysfs inode which I believe there is already only one instance per resource. In contrast /dev/mem can have multiple inodes because anyone can just mknod a new character device file, the same problem does not exist for sysfs.
On Wed, Oct 7, 2020 at 9:33 PM Dan Williams dan.j.williams@intel.com wrote:
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
Ooh, yes, lets.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
I don't think you want to share the devmem inode for this, this should be based off the sysfs inode which I believe there is already only one instance per resource. In contrast /dev/mem can have multiple inodes because anyone can just mknod a new character device file, the same problem does not exist for sysfs.
But then I need to find the right one, plus I also need to find the right one for the procfs side. That gets messy, and I already have no idea how to really test this. Shared address_space is the same trick we're using in drm (where we have multiple things all pointing to the same underlying resources, through different files), and it gets the job done. So that's why I figured the shared address_space is the cleaner solution since then unmap_mapping_range takes care of iterating over all vma for us. I guess I could reimplement that logic with our own locking and everything in revoke_devmem, but feels a bit silly. But it would also solve the problem of having mutliple different mknod of /dev/kmem with different address_space behind them. Also because of how remap_pfn_range works, all these vma do use the same pgoff already anyway. -Daniel
On Wed, Oct 7, 2020 at 12:49 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Wed, Oct 7, 2020 at 9:33 PM Dan Williams dan.j.williams@intel.com wrote:
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
Ooh, yes, lets.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
I don't think you want to share the devmem inode for this, this should be based off the sysfs inode which I believe there is already only one instance per resource. In contrast /dev/mem can have multiple inodes because anyone can just mknod a new character device file, the same problem does not exist for sysfs.
But then I need to find the right one, plus I also need to find the right one for the procfs side. That gets messy, and I already have no idea how to really test this. Shared address_space is the same trick we're using in drm (where we have multiple things all pointing to the same underlying resources, through different files), and it gets the job done. So that's why I figured the shared address_space is the cleaner solution since then unmap_mapping_range takes care of iterating over all vma for us. I guess I could reimplement that logic with our own locking and everything in revoke_devmem, but feels a bit silly. But it would also solve the problem of having mutliple different mknod of /dev/kmem with different address_space behind them. Also because of how remap_pfn_range works, all these vma do use the same pgoff already anyway.
True, remap_pfn_range() makes sure that ->pgoff is an absolute physical address offset for all use cases. So you might be able to just point proc_bus_pci_open() at the shared devmem address space. For sysfs it's messier. I think you would need to somehow get the inode from kernfs_fop_open() to adjust its address space, but only if the bin_file will ultimately be used for PCI memory.
On Wed, Oct 7, 2020 at 3:23 PM Dan Williams dan.j.williams@intel.com wrote:
On Wed, Oct 7, 2020 at 12:49 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Wed, Oct 7, 2020 at 9:33 PM Dan Williams dan.j.williams@intel.com wrote:
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
Ooh, yes, lets.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
I don't think you want to share the devmem inode for this, this should be based off the sysfs inode which I believe there is already only one instance per resource. In contrast /dev/mem can have multiple inodes because anyone can just mknod a new character device file, the same problem does not exist for sysfs.
But then I need to find the right one, plus I also need to find the right one for the procfs side. That gets messy, and I already have no idea how to really test this. Shared address_space is the same trick we're using in drm (where we have multiple things all pointing to the same underlying resources, through different files), and it gets the job done. So that's why I figured the shared address_space is the cleaner solution since then unmap_mapping_range takes care of iterating over all vma for us. I guess I could reimplement that logic with our own locking and everything in revoke_devmem, but feels a bit silly. But it would also solve the problem of having mutliple different mknod of /dev/kmem with different address_space behind them. Also because of how remap_pfn_range works, all these vma do use the same pgoff already anyway.
True, remap_pfn_range() makes sure that ->pgoff is an absolute physical address offset for all use cases. So you might be able to just point proc_bus_pci_open() at the shared devmem address space. For sysfs it's messier. I think you would need to somehow get the inode from kernfs_fop_open() to adjust its address space, but only if the bin_file will ultimately be used for PCI memory.
To me this seems like a new sysfs_create_bin_file() flavor that registers the file with the common devmem address_space.
On Thu, Oct 8, 2020 at 12:29 AM Dan Williams dan.j.williams@intel.com wrote:
On Wed, Oct 7, 2020 at 3:23 PM Dan Williams dan.j.williams@intel.com wrote:
On Wed, Oct 7, 2020 at 12:49 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Wed, Oct 7, 2020 at 9:33 PM Dan Williams dan.j.williams@intel.com wrote:
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
Ooh, yes, lets.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
I don't think you want to share the devmem inode for this, this should be based off the sysfs inode which I believe there is already only one instance per resource. In contrast /dev/mem can have multiple inodes because anyone can just mknod a new character device file, the same problem does not exist for sysfs.
But then I need to find the right one, plus I also need to find the right one for the procfs side. That gets messy, and I already have no idea how to really test this. Shared address_space is the same trick we're using in drm (where we have multiple things all pointing to the same underlying resources, through different files), and it gets the job done. So that's why I figured the shared address_space is the cleaner solution since then unmap_mapping_range takes care of iterating over all vma for us. I guess I could reimplement that logic with our own locking and everything in revoke_devmem, but feels a bit silly. But it would also solve the problem of having mutliple different mknod of /dev/kmem with different address_space behind them. Also because of how remap_pfn_range works, all these vma do use the same pgoff already anyway.
True, remap_pfn_range() makes sure that ->pgoff is an absolute physical address offset for all use cases. So you might be able to just point proc_bus_pci_open() at the shared devmem address space. For sysfs it's messier. I think you would need to somehow get the inode from kernfs_fop_open() to adjust its address space, but only if the bin_file will ultimately be used for PCI memory.
Just read the code a bit more, and for proc it's impossible. There's only a single file, and before you mmap it you have to call a few ioctl to select the right pci resource on that device you want to mmap. Which includes legacy ioport stuff, and at least for now those don't get revoked (maybe they should, but I'm looking at iomem here now). Setting the mapping too early in ->open means that on architectures which can do ioport as mmaps (not many, but powerpc is among them) we'd shoot down these mmaps too.
Looking at the code there's the generic implementation, which consults pci_iobar_pfn. And the only other implementation for sparc looks similar, they separate iomem vs ioport through different pfn. So I think this should indeed work.
To me this seems like a new sysfs_create_bin_file() flavor that registers the file with the common devmem address_space.
Hm I think we could just add a i_mapping member to bin_attributes and let the normal open code set that up for us. That should work. mmapable binary sysfs file is already a similar special case. -Daniel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Oct 07, 2020 at 12:33:06PM -0700, Dan Williams wrote:
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
Ooh, yes, lets.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
I don't think you want to share the devmem inode for this, this should be based off the sysfs inode which I believe there is already only one instance per resource. In contrast /dev/mem can have multiple inodes because anyone can just mknod a new character device file, the same problem does not exist for sysfs.
The inode does not come from the filesystem char/mem.c creates a singular anon inode in devmem_init_inode()
Seems OK to use this more widely, but it feels a bit weird to live in char/memory.c.
This is what got me thinking maybe this needs to be a bit bigger generic infrastructure - eg enter this scheme from fops mmap and everything else is in mm/user_iomem.c
Jason
On Thu, Oct 8, 2020 at 1:24 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 12:33:06PM -0700, Dan Williams wrote:
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
Ooh, yes, lets.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
I don't think you want to share the devmem inode for this, this should be based off the sysfs inode which I believe there is already only one instance per resource. In contrast /dev/mem can have multiple inodes because anyone can just mknod a new character device file, the same problem does not exist for sysfs.
The inode does not come from the filesystem char/mem.c creates a singular anon inode in devmem_init_inode()
Seems OK to use this more widely, but it feels a bit weird to live in char/memory.c.
This is what got me thinking maybe this needs to be a bit bigger generic infrastructure - eg enter this scheme from fops mmap and everything else is in mm/user_iomem.c
Yeah moving it to iomem and renaming it to have an iomem_prefix instead of devmem sounds like a good idea. -Daniel
On Wed, Oct 7, 2020 at 4:25 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 12:33:06PM -0700, Dan Williams wrote:
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
Ooh, yes, lets.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
I don't think you want to share the devmem inode for this, this should be based off the sysfs inode which I believe there is already only one instance per resource. In contrast /dev/mem can have multiple inodes because anyone can just mknod a new character device file, the same problem does not exist for sysfs.
The inode does not come from the filesystem char/mem.c creates a singular anon inode in devmem_init_inode()
That's not quite right, An inode does come from the filesystem I just arranged for that inode's i_mapping to be set to a common instance.
Seems OK to use this more widely, but it feels a bit weird to live in char/memory.c.
Sure, now that more users have arrived it should move somewhere common.
This is what got me thinking maybe this needs to be a bit bigger generic infrastructure - eg enter this scheme from fops mmap and everything else is in mm/user_iomem.c
It still requires every file that can map physical memory to have its ->open fop do
inode->i_mapping = devmem_inode->i_mapping; filp->f_mapping = inode->i_mapping;
I don't see how you can centralize that part.
On Thu, Oct 8, 2020 at 9:50 AM Dan Williams dan.j.williams@intel.com wrote:
On Wed, Oct 7, 2020 at 4:25 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 12:33:06PM -0700, Dan Williams wrote:
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
Ooh, yes, lets.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
I don't think you want to share the devmem inode for this, this should be based off the sysfs inode which I believe there is already only one instance per resource. In contrast /dev/mem can have multiple inodes because anyone can just mknod a new character device file, the same problem does not exist for sysfs.
The inode does not come from the filesystem char/mem.c creates a singular anon inode in devmem_init_inode()
That's not quite right, An inode does come from the filesystem I just arranged for that inode's i_mapping to be set to a common instance.
Seems OK to use this more widely, but it feels a bit weird to live in char/memory.c.
Sure, now that more users have arrived it should move somewhere common.
This is what got me thinking maybe this needs to be a bit bigger generic infrastructure - eg enter this scheme from fops mmap and everything else is in mm/user_iomem.c
It still requires every file that can map physical memory to have its ->open fop do
inode->i_mapping = devmem_inode->i_mapping; filp->f_mapping = inode->i_mapping;
I don't see how you can centralize that part.
btw, why are you setting inode->i_mapping? The inode is already published, changing that looks risky. And I don't think it's needed, vma_link() only looks at filp->f_mapping, and in our drm_open() we only set that one. -Daniel
On Thu, Oct 8, 2020 at 1:13 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Thu, Oct 8, 2020 at 9:50 AM Dan Williams dan.j.williams@intel.com wrote:
On Wed, Oct 7, 2020 at 4:25 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 12:33:06PM -0700, Dan Williams wrote:
On Wed, Oct 7, 2020 at 11:11 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access pci bars: sysfs and proc mmap support. Let's plug that hole.
Ooh, yes, lets.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. Usually that's done at ->open time, but that's a bit tricky here with all the entry points and arch code. So instead create a fake file and adjust vma->vm_file.
I don't think you want to share the devmem inode for this, this should be based off the sysfs inode which I believe there is already only one instance per resource. In contrast /dev/mem can have multiple inodes because anyone can just mknod a new character device file, the same problem does not exist for sysfs.
The inode does not come from the filesystem char/mem.c creates a singular anon inode in devmem_init_inode()
That's not quite right, An inode does come from the filesystem I just arranged for that inode's i_mapping to be set to a common instance.
Seems OK to use this more widely, but it feels a bit weird to live in char/memory.c.
Sure, now that more users have arrived it should move somewhere common.
This is what got me thinking maybe this needs to be a bit bigger generic infrastructure - eg enter this scheme from fops mmap and everything else is in mm/user_iomem.c
It still requires every file that can map physical memory to have its ->open fop do
inode->i_mapping = devmem_inode->i_mapping; filp->f_mapping = inode->i_mapping;
I don't see how you can centralize that part.
btw, why are you setting inode->i_mapping? The inode is already published, changing that looks risky. And I don't think it's needed, vma_link() only looks at filp->f_mapping, and in our drm_open() we only set that one.
I think you're right it is unnecessary for devmem, but I don't think it's dangerous to do it from the very first open before anything is using the address space. It's copy-paste from what all the other "shared address space" implementers do. For example, block-devices in bd_acquire(). However, the rationale for block_devices to do it is so that page cache pages can be associated with the address space in the absence of an f_mapping. Without filesystem page writeback to coordinate I don't see any devmem code paths that would operate on the inode->i_mapping.
On Thu, Oct 08, 2020 at 12:49:54AM -0700, Dan Williams wrote:
This is what got me thinking maybe this needs to be a bit bigger generic infrastructure - eg enter this scheme from fops mmap and everything else is in mm/user_iomem.c
It still requires every file that can map physical memory to have its ->open fop do
Common infrastructure would have to create a dummy struct file at mmap time with the global inode and attach that to the VMA.
Jason
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: kvm@vger.kernel.org --- include/linux/mm.h | 2 ++ mm/memory.c | 32 +++++++++++++++++++++++++++++++- mm/nommu.c | 17 +++++++++++++++++ security/Kconfig | 13 +++++++++++++ 4 files changed, 63 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2a16631c1fda..ec8c90928fc9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1653,6 +1653,8 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long address, pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); +int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address, + unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, unsigned int flags, unsigned long *prot, resource_size_t *phys); int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, diff --git a/mm/memory.c b/mm/memory.c index 8d467e23b44e..8db7ad1c261c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4821,7 +4821,12 @@ EXPORT_SYMBOL(follow_pte_pmd); * @address: user virtual address * @pfn: location to store found PFN * - * Only IO mappings and raw PFN mappings are allowed. + * Only IO mappings and raw PFN mappings are allowed. Note that callers must + * ensure coherency with pte updates by using a &mmu_notifier to follow updates. + * If this is not feasible, or the access to the @pfn is only very short term, + * use follow_pte_pmd() instead and hold the pagetable lock for the duration of + * the access instead. Any caller not following these requirements must use + * unsafe_follow_pfn() instead. * * Return: zero and the pfn at @pfn on success, -ve otherwise. */ @@ -4844,6 +4849,31 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL(follow_pfn);
+/** + * unsafe_follow_pfn - look up PFN at a user virtual address + * @vma: memory mapping + * @address: user virtual address + * @pfn: location to store found PFN + * + * Only IO mappings and raw PFN mappings are allowed. + * + * Returns zero and the pfn at @pfn on success, -ve otherwise. + */ +int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address, + unsigned long *pfn) +{ +#ifdef CONFIG_STRICT_FOLLOW_PFN + pr_info("unsafe follow_pfn usage rejected, see CONFIG_STRICT_FOLLOW_PFN\n"); + return -EINVAL; +#else + WARN_ONCE(1, "unsafe follow_pfn usage\n"); + add_taint(TAINT_USER, LOCKDEP_STILL_OK); + + return follow_pfn(vma, address, pfn); +#endif +} +EXPORT_SYMBOL(unsafe_follow_pfn); + #ifdef CONFIG_HAVE_IOREMAP_PROT int follow_phys(struct vm_area_struct *vma, unsigned long address, unsigned int flags, diff --git a/mm/nommu.c b/mm/nommu.c index 75a327149af1..3db2910f0d64 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -132,6 +132,23 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL(follow_pfn);
+/** + * unsafe_follow_pfn - look up PFN at a user virtual address + * @vma: memory mapping + * @address: user virtual address + * @pfn: location to store found PFN + * + * Only IO mappings and raw PFN mappings are allowed. + * + * Returns zero and the pfn at @pfn on success, -ve otherwise. + */ +int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address, + unsigned long *pfn) +{ + return follow_pfn(vma, address, pfn); +} +EXPORT_SYMBOL(unsafe_follow_pfn); + LIST_HEAD(vmap_area_list);
void vfree(const void *addr) diff --git a/security/Kconfig b/security/Kconfig index 7561f6f99f1d..48945402e103 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -230,6 +230,19 @@ config STATIC_USERMODEHELPER_PATH If you wish for all usermode helper programs to be disabled, specify an empty string here (i.e. "").
+config STRICT_FOLLOW_PFN + bool "Disable unsafe use of follow_pfn" + depends on MMU + help + Some functionality in the kernel follows userspace mappings to iomem + ranges in an unsafe matter. Examples include v4l userptr for zero-copy + buffers sharing. + + If this option is switched on, such access is rejected. Only enable + this option when you must run userspace which requires this. + + If in doubt, say Y. + source "security/selinux/Kconfig" source "security/smack/Kconfig" source "security/tomoyo/Kconfig"
On Wed, Oct 07, 2020 at 06:44:24PM +0200, Daniel Vetter wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: kvm@vger.kernel.org
include/linux/mm.h | 2 ++ mm/memory.c | 32 +++++++++++++++++++++++++++++++- mm/nommu.c | 17 +++++++++++++++++ security/Kconfig | 13 +++++++++++++ 4 files changed, 63 insertions(+), 1 deletion(-)
Makes sense to me.
I wonder if we could change the original follow_pfn to require the ptep and then lockdep_assert_held() it against the page table lock?
+int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address,
- unsigned long *pfn)
+{ +#ifdef CONFIG_STRICT_FOLLOW_PFN
- pr_info("unsafe follow_pfn usage rejected, see
CONFIG_STRICT_FOLLOW_PFN\n");
Wonder if we can print something useful here, like the current PID/process name?
diff --git a/security/Kconfig b/security/Kconfig index 7561f6f99f1d..48945402e103 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -230,6 +230,19 @@ config STATIC_USERMODEHELPER_PATH If you wish for all usermode helper programs to be disabled, specify an empty string here (i.e. "").
+config STRICT_FOLLOW_PFN
- bool "Disable unsafe use of follow_pfn"
- depends on MMU
I would probably invert this CONFIG_ALLOW_UNSAFE_FOLLOW_PFN default n
Jason
On Wed, Oct 7, 2020 at 7:36 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 06:44:24PM +0200, Daniel Vetter wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: kvm@vger.kernel.org
include/linux/mm.h | 2 ++ mm/memory.c | 32 +++++++++++++++++++++++++++++++- mm/nommu.c | 17 +++++++++++++++++ security/Kconfig | 13 +++++++++++++ 4 files changed, 63 insertions(+), 1 deletion(-)
Makes sense to me.
I wonder if we could change the original follow_pfn to require the ptep and then lockdep_assert_held() it against the page table lock?
The safe variant with the pagetable lock is follow_pte_pmd. The only way to make follow_pfn safe is if you have an mmu notifier and corresponding retry logic. That is not covered by lockdep (it would splat if we annotate the retry side), so I'm not sure how you'd check for that?
Checking for ptep lock doesn't work here, since the one leftover safe user of this (kvm) doesn't need that at all, because it has the mmu notifier.
Also follow_pte_pmd will splat with lockdep if you get it wrong, since the function leaves you with the right ptlock lock when it returns. If you forget to unlock that, lockdep will complain.
So I think we're as good as it gets, since I really have no idea how to make sure follow_pfn callers do have an mmu notifier registered.
+int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address,
unsigned long *pfn)
+{ +#ifdef CONFIG_STRICT_FOLLOW_PFN
pr_info("unsafe follow_pfn usage rejected, see
CONFIG_STRICT_FOLLOW_PFN\n");
Wonder if we can print something useful here, like the current PID/process name?
Yeah adding comm/pid here makes sense.
diff --git a/security/Kconfig b/security/Kconfig index 7561f6f99f1d..48945402e103 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -230,6 +230,19 @@ config STATIC_USERMODEHELPER_PATH If you wish for all usermode helper programs to be disabled, specify an empty string here (i.e. "").
+config STRICT_FOLLOW_PFN
bool "Disable unsafe use of follow_pfn"
depends on MMU
I would probably invert this CONFIG_ALLOW_UNSAFE_FOLLOW_PFN default n
I've followed the few other CONFIG_STRICT_FOO I've seen, which are all explicit enables and default to "do not break uapi, damn the (security) bugs". Which is I think how this should be done. It is in the security section though, so hopefully competent distros will enable this all. -Daniel
On Wed, Oct 07, 2020 at 08:10:34PM +0200, Daniel Vetter wrote:
On Wed, Oct 7, 2020 at 7:36 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 06:44:24PM +0200, Daniel Vetter wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: kvm@vger.kernel.org include/linux/mm.h | 2 ++ mm/memory.c | 32 +++++++++++++++++++++++++++++++- mm/nommu.c | 17 +++++++++++++++++ security/Kconfig | 13 +++++++++++++ 4 files changed, 63 insertions(+), 1 deletion(-)
Makes sense to me.
I wonder if we could change the original follow_pfn to require the ptep and then lockdep_assert_held() it against the page table lock?
The safe variant with the pagetable lock is follow_pte_pmd. The only way to make follow_pfn safe is if you have an mmu notifier and corresponding retry logic. That is not covered by lockdep (it would splat if we annotate the retry side), so I'm not sure how you'd check for that?
Right OK.
Checking for ptep lock doesn't work here, since the one leftover safe user of this (kvm) doesn't need that at all, because it has the mmu notifier.
Ah, so a better name and/or function kdoc for follow_pfn is probably a good iead in this patch as well.
So I think we're as good as it gets, since I really have no idea how to make sure follow_pfn callers do have an mmu notifier registered.
Yah, can't be done. Most mmu notifier users should be using hmm_range_fault anyhow, kvm is really very special here.
I've followed the few other CONFIG_STRICT_FOO I've seen, which are all explicit enables and default to "do not break uapi, damn the (security) bugs". Which is I think how this should be done. It is in the security section though, so hopefully competent distros will enable this all.
I thought the strict ones were more general and less clear security worries, not bugs like this.
This is "allow a user triggerable use after free bug to exist in the kernel"
Jason
On Wed, Oct 7, 2020 at 9:00 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 08:10:34PM +0200, Daniel Vetter wrote:
On Wed, Oct 7, 2020 at 7:36 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 06:44:24PM +0200, Daniel Vetter wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: kvm@vger.kernel.org include/linux/mm.h | 2 ++ mm/memory.c | 32 +++++++++++++++++++++++++++++++- mm/nommu.c | 17 +++++++++++++++++ security/Kconfig | 13 +++++++++++++ 4 files changed, 63 insertions(+), 1 deletion(-)
Makes sense to me.
I wonder if we could change the original follow_pfn to require the ptep and then lockdep_assert_held() it against the page table lock?
The safe variant with the pagetable lock is follow_pte_pmd. The only way to make follow_pfn safe is if you have an mmu notifier and corresponding retry logic. That is not covered by lockdep (it would splat if we annotate the retry side), so I'm not sure how you'd check for that?
Right OK.
Checking for ptep lock doesn't work here, since the one leftover safe user of this (kvm) doesn't need that at all, because it has the mmu notifier.
Ah, so a better name and/or function kdoc for follow_pfn is probably a good iead in this patch as well.
I did change that already to mention that you need an mmu notifier, and that follow_pte_pmd respectively unsafe_follow_pfn are the alternatives. Do you want more or something else here?
Note that I left the kerneldoc for the nommu.c case unchanged, since without an mmu all bets are off anyway.
So I think we're as good as it gets, since I really have no idea how to make sure follow_pfn callers do have an mmu notifier registered.
Yah, can't be done. Most mmu notifier users should be using hmm_range_fault anyhow, kvm is really very special here.
We could pass an mmu notifier to follow_pfn and check that it has a registration for vma->vm_mm, but that feels like overkill when kvm is the only legit user for this.
I've followed the few other CONFIG_STRICT_FOO I've seen, which are all explicit enables and default to "do not break uapi, damn the (security) bugs". Which is I think how this should be done. It is in the security section though, so hopefully competent distros will enable this all.
I thought the strict ones were more general and less clear security worries, not bugs like this.
This is "allow a user triggerable use after free bug to exist in the kernel"
Since at most you get at GFP_MOVEABLE stuff I'm not sure you can use this to pull the kernel over the table. Maybe best way is if you get a gpu pagetable somehow into your pfn and then use that to access abitrary stuff, but there's still an iommu. I think leveraging this is going to be very tricky, and pretty much has to be device or driver specific somehow. -Daniel
The media model assumes that buffers are all preallocated, so that when a media pipeline is running we never miss a deadline because the buffers aren't allocated or available.
This means we cannot fix the v4l follow_pfn usage through mmu_notifier, without breaking how this all works. The only real fix is to deprecate userptr support for VM_IO | VM_PFNMAP mappings and tell everyone to cut over to dma-buf memory sharing for zerocopy.
userptr for normal memory will keep working as-is.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Laurent Dufour ldufour@linux.ibm.com Cc: Vlastimil Babka vbabka@suse.cz Cc: Daniel Jordan daniel.m.jordan@oracle.com Cc: Michel Lespinasse walken@google.com --- drivers/media/common/videobuf2/frame_vector.c | 2 +- drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c index b95f4f371681..d56eb6258f09 100644 --- a/drivers/media/common/videobuf2/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -71,7 +71,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, unsigned long *nums = frame_vector_pfns(vec);
while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) { - err = follow_pfn(vma, start, &nums[ret]); + err = unsafe_follow_pfn(vma, start, &nums[ret]); if (err) { if (ret == 0) ret = err; diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c index 52312ce2ba05..821c4a76ab96 100644 --- a/drivers/media/v4l2-core/videobuf-dma-contig.c +++ b/drivers/media/v4l2-core/videobuf-dma-contig.c @@ -183,7 +183,7 @@ static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem, user_address = untagged_baddr;
while (pages_done < (mem->size >> PAGE_SHIFT)) { - ret = follow_pfn(vma, user_address, &this_pfn); + ret = unsafe_follow_pfn(vma, user_address, &this_pfn); if (ret) break;
The code seems to stuff these pfns into iommu pts (or something like that, I didn't follow), but there's no mmu_notifier to ensure that access is synchronized with pte updates.
Hence mark these as unsafe. This means that with CONFIG_STRICT_FOLLOW_PFN, these will be rejected.
Real fix is to wire up an mmu_notifier ... somehow. Probably means any invalidate is a fatal fault for this vfio device, but then this shouldn't ever happen if userspace is reasonable.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Alex Williamson alex.williamson@redhat.com Cc: Cornelia Huck cohuck@redhat.com Cc: kvm@vger.kernel.org --- drivers/vfio/vfio_iommu_type1.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 5fbf0c1f7433..a4d53f3d0a35 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -421,7 +421,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, { int ret;
- ret = follow_pfn(vma, vaddr, pfn); + ret = unsafe_follow_pfn(vma, vaddr, pfn); if (ret) { bool unlocked = false;
@@ -435,7 +435,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, if (ret) return ret;
- ret = follow_pfn(vma, vaddr, pfn); + ret = unsafe_follow_pfn(vma, vaddr, pfn); }
return ret;
On Wed, Oct 07, 2020 at 06:44:26PM +0200, Daniel Vetter wrote:
The code seems to stuff these pfns into iommu pts (or something like that, I didn't follow), but there's no mmu_notifier to ensure that access is synchronized with pte updates.
Hence mark these as unsafe. This means that with CONFIG_STRICT_FOLLOW_PFN, these will be rejected.
Real fix is to wire up an mmu_notifier ... somehow. Probably means any invalidate is a fatal fault for this vfio device, but then this shouldn't ever happen if userspace is reasonable.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Alex Williamson alex.williamson@redhat.com Cc: Cornelia Huck cohuck@redhat.com Cc: kvm@vger.kernel.org
drivers/vfio/vfio_iommu_type1.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 5fbf0c1f7433..a4d53f3d0a35 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -421,7 +421,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, { int ret;
- ret = follow_pfn(vma, vaddr, pfn);
- ret = unsafe_follow_pfn(vma, vaddr, pfn); if (ret) { bool unlocked = false;
@@ -435,7 +435,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, if (ret) return ret;
ret = follow_pfn(vma, vaddr, pfn);
}ret = unsafe_follow_pfn(vma, vaddr, pfn);
This is actually being commonly used, so it needs fixing.
When I talked to Alex about this last we had worked out a patch series that adds a test on vm_ops that the vma came from vfio in the first place. The VMA's created by VFIO are 'safe' as the PTEs are never changed.
Jason
On Wed, Oct 7, 2020 at 7:39 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Oct 07, 2020 at 06:44:26PM +0200, Daniel Vetter wrote:
The code seems to stuff these pfns into iommu pts (or something like that, I didn't follow), but there's no mmu_notifier to ensure that access is synchronized with pte updates.
Hence mark these as unsafe. This means that with CONFIG_STRICT_FOLLOW_PFN, these will be rejected.
Real fix is to wire up an mmu_notifier ... somehow. Probably means any invalidate is a fatal fault for this vfio device, but then this shouldn't ever happen if userspace is reasonable.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Alex Williamson alex.williamson@redhat.com Cc: Cornelia Huck cohuck@redhat.com Cc: kvm@vger.kernel.org
drivers/vfio/vfio_iommu_type1.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 5fbf0c1f7433..a4d53f3d0a35 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -421,7 +421,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, { int ret;
ret = follow_pfn(vma, vaddr, pfn);
ret = unsafe_follow_pfn(vma, vaddr, pfn); if (ret) { bool unlocked = false;
@@ -435,7 +435,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, if (ret) return ret;
ret = follow_pfn(vma, vaddr, pfn);
ret = unsafe_follow_pfn(vma, vaddr, pfn); }
This is actually being commonly used, so it needs fixing.
When I talked to Alex about this last we had worked out a patch series that adds a test on vm_ops that the vma came from vfio in the first place. The VMA's created by VFIO are 'safe' as the PTEs are never changed.
Hm, but wouldn't need that the semi-nasty vma_open trick to make sure that vma doesn't untimely disappear? Or is the idea to look up the underlying vfio object, and refcount that directly? -Daniel
On Wed, Oct 07, 2020 at 08:14:06PM +0200, Daniel Vetter wrote:
Hm, but wouldn't need that the semi-nasty vma_open trick to make sure that vma doesn't untimely disappear? Or is the idea to look up the underlying vfio object, and refcount that directly?
Ah, the patches Alex was working on had the refcount I think, it does need co-ordination across multiple VFIO instances IIRC.
At least a simple check would guarentee we only have exposed PCI BAR pages which is not as bad security wise as the other stuff.
Jason
dri-devel@lists.freedesktop.org