Hi all,
Round two of my patch series to clamp down a bunch of races and gaps around follow_pfn and other access to iomem mmaps. Previous version:
v1: https://lore.kernel.org/dri-devel/20201007164426.1812530-1-daniel.vetter@ffw...
And the discussion that sparked this journey:
https://lore.kernel.org/dri-devel/20201007164426.1812530-1-daniel.vetter@ffw...
Changes in v2: - tons of small polish&fixes all over, thanks to all the reviewers who spotted issues - I managed to test at least the generic_access_phys and pci mmap revoke stuff with a few gdb sessions using our i915 debug tools (hence now also the drm/i915 patch to properly request all the pci bar regions) - reworked approach for the pci mmap revoke: Infrastructure moved into kernel/resource.c, address_space mapping is now set up at open time for everyone (which required some sysfs changes). Does indeed look a lot cleaner and a lot less invasive than I feared at first.
The big thing I can't test are all the frame_vector changes in habanalbas, exynos and media. Gerald has given the s390 patch a spin already.
Review, testing, feedback all very much welcome.
Cheers, Daniel
Daniel Vetter (17): drm/exynos: Stop using frame_vector helpers drm/exynos: Use FOLL_LONGTERM for g2d cmdlists misc/habana: Stop using frame_vector helpers misc/habana: Use FOLL_LONGTERM for userptr mm/frame-vector: Use FOLL_LONGTERM media: videobuf2: Move frame_vector into media subsystem mm: Close race in generic_access_phys s390/pci: Remove races against pte updates mm: Add unsafe_follow_pfn media/videbuf1|2: Mark follow_pfn usage as unsafe vfio/type1: Mark follow_pfn as unsafe PCI: Obey iomem restrictions for procfs mmap /dev/mem: Only set filp->f_mapping resource: Move devmem revoke code to resource framework sysfs: Support zapping of binary attr mmaps PCI: Revoke mappings like devmem drm/i915: Properly request PCI BARs
arch/s390/pci/pci_mmio.c | 98 +++++++++++-------- drivers/char/mem.c | 86 +--------------- drivers/gpu/drm/exynos/Kconfig | 1 - drivers/gpu/drm/exynos/exynos_drm_g2d.c | 48 ++++----- drivers/gpu/drm/i915/intel_uncore.c | 25 ++++- drivers/media/common/videobuf2/Kconfig | 1 - drivers/media/common/videobuf2/Makefile | 1 + .../media/common/videobuf2}/frame_vector.c | 54 ++++------ drivers/media/platform/omap/Kconfig | 1 - drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +- drivers/misc/habanalabs/Kconfig | 1 - drivers/misc/habanalabs/common/habanalabs.h | 3 +- drivers/misc/habanalabs/common/memory.c | 50 ++++------ drivers/pci/pci-sysfs.c | 4 + drivers/pci/proc.c | 6 ++ drivers/vfio/vfio_iommu_type1.c | 4 +- fs/sysfs/file.c | 11 +++ include/linux/ioport.h | 6 +- include/linux/mm.h | 47 +-------- include/linux/sysfs.h | 2 + include/media/videobuf2-core.h | 42 ++++++++ kernel/resource.c | 95 +++++++++++++++++- mm/Kconfig | 3 - mm/Makefile | 1 - mm/memory.c | 76 +++++++++++++- mm/nommu.c | 17 ++++ security/Kconfig | 13 +++ 27 files changed, 412 insertions(+), 286 deletions(-) rename {mm => drivers/media/common/videobuf2}/frame_vector.c (85%)
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Inki Dae inki.dae@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Kukjin Kim kgene@kernel.org Cc: Krzysztof Kozlowski krzk@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org -- v2: Use unpin_user_pages_dirty_lock (John) --- drivers/gpu/drm/exynos/Kconfig | 1 - drivers/gpu/drm/exynos/exynos_drm_g2d.c | 47 +++++++++++-------------- 2 files changed, 20 insertions(+), 28 deletions(-)
diff --git a/drivers/gpu/drm/exynos/Kconfig b/drivers/gpu/drm/exynos/Kconfig index 6417f374b923..43257ef3c09d 100644 --- a/drivers/gpu/drm/exynos/Kconfig +++ b/drivers/gpu/drm/exynos/Kconfig @@ -88,7 +88,6 @@ comment "Sub-drivers" config DRM_EXYNOS_G2D bool "G2D" depends on VIDEO_SAMSUNG_S5P_G2D=n || COMPILE_TEST - select FRAME_VECTOR help Choose this option if you want to use Exynos G2D for DRM.
diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c index 967a5cdc120e..ecede41af9b9 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c @@ -205,7 +205,8 @@ struct g2d_cmdlist_userptr { dma_addr_t dma_addr; unsigned long userptr; unsigned long size; - struct frame_vector *vec; + struct page **pages; + unsigned int npages; struct sg_table *sgt; atomic_t refcount; bool in_pool; @@ -378,7 +379,6 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d, bool force) { struct g2d_cmdlist_userptr *g2d_userptr = obj; - struct page **pages;
if (!obj) return; @@ -398,15 +398,9 @@ static void g2d_userptr_put_dma_addr(struct g2d_data *g2d, dma_unmap_sgtable(to_dma_dev(g2d->drm_dev), g2d_userptr->sgt, DMA_BIDIRECTIONAL, 0);
- pages = frame_vector_pages(g2d_userptr->vec); - if (!IS_ERR(pages)) { - int i; - - for (i = 0; i < frame_vector_count(g2d_userptr->vec); i++) - set_page_dirty_lock(pages[i]); - } - put_vaddr_frames(g2d_userptr->vec); - frame_vector_destroy(g2d_userptr->vec); + unpin_user_pages_dirty_lock(g2d_userptr->pages, g2d_userptr->npages, + true); + kvfree(g2d_userptr->pages);
if (!g2d_userptr->out_of_list) list_del_init(&g2d_userptr->list); @@ -474,35 +468,34 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, offset = userptr & ~PAGE_MASK; end = PAGE_ALIGN(userptr + size); npages = (end - start) >> PAGE_SHIFT; - g2d_userptr->vec = frame_vector_create(npages); - if (!g2d_userptr->vec) { + g2d_userptr->pages = kvmalloc_array(npages, sizeof(*g2d_userptr->pages), + GFP_KERNEL); + if (!g2d_userptr->pages) { ret = -ENOMEM; goto err_free; }
- ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE, - g2d_userptr->vec); + ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE, + g2d_userptr->pages); if (ret != npages) { DRM_DEV_ERROR(g2d->dev, "failed to get user pages from userptr.\n"); if (ret < 0) - goto err_destroy_framevec; - ret = -EFAULT; - goto err_put_framevec; - } - if (frame_vector_to_pages(g2d_userptr->vec) < 0) { + goto err_destroy_pages; + npages = ret; ret = -EFAULT; - goto err_put_framevec; + goto err_unpin_pages; } + g2d_userptr->npages = npages;
sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); if (!sgt) { ret = -ENOMEM; - goto err_put_framevec; + goto err_unpin_pages; }
ret = sg_alloc_table_from_pages(sgt, - frame_vector_pages(g2d_userptr->vec), + g2d_userptr->pages, npages, offset, size, GFP_KERNEL); if (ret < 0) { DRM_DEV_ERROR(g2d->dev, "failed to get sgt from pages.\n"); @@ -538,11 +531,11 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, err_free_sgt: kfree(sgt);
-err_put_framevec: - put_vaddr_frames(g2d_userptr->vec); +err_unpin_pages: + unpin_user_pages(g2d_userptr->pages, npages);
-err_destroy_framevec: - frame_vector_destroy(g2d_userptr->vec); +err_destroy_pages: + kvfree(g2d_userptr->pages);
err_free: kfree(g2d_userptr);
On 10/9/20 12:59 AM, Daniel Vetter wrote:
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Inki Dae inki.dae@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Kukjin Kim kgene@kernel.org Cc: Krzysztof Kozlowski krzk@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org -- v2: Use unpin_user_pages_dirty_lock (John)
drivers/gpu/drm/exynos/Kconfig | 1 - drivers/gpu/drm/exynos/exynos_drm_g2d.c | 47 +++++++++++-------------- 2 files changed, 20 insertions(+), 28 deletions(-)
Looks good.
Reviewed-by: John Hubbard jhubbard@nvidia.com
thanks,
The exynos g2d interface is very unusual, but it looks like the userptr objects are persistent. Hence they need FOLL_LONGTERM.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Inki Dae inki.dae@samsung.com Cc: Joonyoung Shim jy0922.shim@samsung.com Cc: Seung-Woo Kim sw0312.kim@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Kukjin Kim kgene@kernel.org Cc: Krzysztof Kozlowski krzk@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org --- drivers/gpu/drm/exynos/exynos_drm_g2d.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c b/drivers/gpu/drm/exynos/exynos_drm_g2d.c index ecede41af9b9..1e0c5a7f206e 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c @@ -475,7 +475,8 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct g2d_data *g2d, goto err_free; }
- ret = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE, + ret = pin_user_pages_fast(start, npages, + FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, g2d_userptr->pages); if (ret != npages) { DRM_DEV_ERROR(g2d->dev,
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Oded Gabbay oded.gabbay@gmail.com Cc: Omer Shpigelman oshpigelman@habana.ai Cc: Ofir Bitton obitton@habana.ai Cc: Tomer Tayar ttayar@habana.ai Cc: Moti Haimovski mhaimovski@habana.ai Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Pawel Piskorski ppiskorski@habana.ai -- v2: Use unpin_user_pages_dirty_lock (John) --- drivers/misc/habanalabs/Kconfig | 1 - drivers/misc/habanalabs/common/habanalabs.h | 3 +- drivers/misc/habanalabs/common/memory.c | 49 ++++++++------------- 3 files changed, 20 insertions(+), 33 deletions(-)
diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig index 8eb5d38c618e..2f04187f7167 100644 --- a/drivers/misc/habanalabs/Kconfig +++ b/drivers/misc/habanalabs/Kconfig @@ -6,7 +6,6 @@ config HABANA_AI tristate "HabanaAI accelerators (habanalabs)" depends on PCI && HAS_IOMEM - select FRAME_VECTOR select DMA_SHARED_BUFFER select GENERIC_ALLOCATOR select HWMON diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h index edbd627b29d2..c1b3ad613b15 100644 --- a/drivers/misc/habanalabs/common/habanalabs.h +++ b/drivers/misc/habanalabs/common/habanalabs.h @@ -881,7 +881,8 @@ struct hl_ctx_mgr { struct hl_userptr { enum vm_type_t vm_type; /* must be first */ struct list_head job_node; - struct frame_vector *vec; + struct page **pages; + unsigned int npages; struct sg_table *sgt; enum dma_data_direction dir; struct list_head debugfs_list; diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index 5ff4688683fd..327b64479f97 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -1281,45 +1281,41 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, return -EFAULT; }
- userptr->vec = frame_vector_create(npages); - if (!userptr->vec) { + userptr->pages = kvmalloc_array(npages, sizeof(*userptr->pages), + GFP_KERNEL); + if (!userptr->pages) { dev_err(hdev->dev, "Failed to create frame vector\n"); return -ENOMEM; }
- rc = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE, - userptr->vec); + rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE, + userptr->pages);
if (rc != npages) { dev_err(hdev->dev, "Failed to map host memory, user ptr probably wrong\n"); if (rc < 0) - goto destroy_framevec; + goto destroy_pages; + npages = rc; rc = -EFAULT; - goto put_framevec; - } - - if (frame_vector_to_pages(userptr->vec) < 0) { - dev_err(hdev->dev, - "Failed to translate frame vector to pages\n"); - rc = -EFAULT; - goto put_framevec; + goto put_pages; } + userptr->npages = npages;
rc = sg_alloc_table_from_pages(userptr->sgt, - frame_vector_pages(userptr->vec), - npages, offset, size, GFP_ATOMIC); + userptr->pages, + npages, offset, size, GFP_ATOMIC); if (rc < 0) { dev_err(hdev->dev, "failed to create SG table from pages\n"); - goto put_framevec; + goto put_pages; }
return 0;
-put_framevec: - put_vaddr_frames(userptr->vec); -destroy_framevec: - frame_vector_destroy(userptr->vec); +put_pages: + unpin_user_pages(userptr->pages, npages); +destroy_pages: + kvfree(userptr->pages); return rc; }
@@ -1405,8 +1401,6 @@ int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u64 size, */ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) { - struct page **pages; - hl_debugfs_remove_userptr(hdev, userptr);
if (userptr->dma_mapped) @@ -1414,15 +1408,8 @@ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) userptr->sgt->nents, userptr->dir);
- pages = frame_vector_pages(userptr->vec); - if (!IS_ERR(pages)) { - int i; - - for (i = 0; i < frame_vector_count(userptr->vec); i++) - set_page_dirty_lock(pages[i]); - } - put_vaddr_frames(userptr->vec); - frame_vector_destroy(userptr->vec); + unpin_user_pages_dirty_lock(userptr->pages, userptr->npages, true); + kvfree(userptr->pages);
list_del(&userptr->job_node);
On Fri, Oct 9, 2020 at 10:59 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Thanks for the patch Daniel.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Oded Gabbay oded.gabbay@gmail.com Cc: Omer Shpigelman oshpigelman@habana.ai Cc: Ofir Bitton obitton@habana.ai Cc: Tomer Tayar ttayar@habana.ai Cc: Moti Haimovski mhaimovski@habana.ai Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Pawel Piskorski ppiskorski@habana.ai -- v2: Use unpin_user_pages_dirty_lock (John)
drivers/misc/habanalabs/Kconfig | 1 - drivers/misc/habanalabs/common/habanalabs.h | 3 +- drivers/misc/habanalabs/common/memory.c | 49 ++++++++------------- 3 files changed, 20 insertions(+), 33 deletions(-)
diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig index 8eb5d38c618e..2f04187f7167 100644 --- a/drivers/misc/habanalabs/Kconfig +++ b/drivers/misc/habanalabs/Kconfig @@ -6,7 +6,6 @@ config HABANA_AI tristate "HabanaAI accelerators (habanalabs)" depends on PCI && HAS_IOMEM
select FRAME_VECTOR select DMA_SHARED_BUFFER select GENERIC_ALLOCATOR select HWMON
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h index edbd627b29d2..c1b3ad613b15 100644 --- a/drivers/misc/habanalabs/common/habanalabs.h +++ b/drivers/misc/habanalabs/common/habanalabs.h @@ -881,7 +881,8 @@ struct hl_ctx_mgr { struct hl_userptr { enum vm_type_t vm_type; /* must be first */ struct list_head job_node;
struct frame_vector *vec;
struct page **pages;
unsigned int npages;
Can you please update the kerneldoc comment section of this structure according to your changes ?
struct sg_table *sgt; enum dma_data_direction dir; struct list_head debugfs_list;
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index 5ff4688683fd..327b64479f97 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -1281,45 +1281,41 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, return -EFAULT; }
userptr->vec = frame_vector_create(npages);
if (!userptr->vec) {
userptr->pages = kvmalloc_array(npages, sizeof(*userptr->pages),
GFP_KERNEL);
if (!userptr->pages) { dev_err(hdev->dev, "Failed to create frame vector\n"); return -ENOMEM; }
rc = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
userptr->vec);
rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
userptr->pages); if (rc != npages) { dev_err(hdev->dev, "Failed to map host memory, user ptr probably wrong\n"); if (rc < 0)
goto destroy_framevec;
goto destroy_pages;
npages = rc; rc = -EFAULT;
goto put_framevec;
}
if (frame_vector_to_pages(userptr->vec) < 0) {
dev_err(hdev->dev,
"Failed to translate frame vector to pages\n");
rc = -EFAULT;
goto put_framevec;
goto put_pages; }
userptr->npages = npages; rc = sg_alloc_table_from_pages(userptr->sgt,
frame_vector_pages(userptr->vec),
npages, offset, size, GFP_ATOMIC);
userptr->pages,
npages, offset, size, GFP_ATOMIC);
I think that because the call to kvmalloc_array() is done with GFP_KERNEL, there is no point in using GFP_ATOMIC here. And actually, this path only needs to avoid yielding when using a special debug mode. So I suggest putting here GFP_KERNEL.
In the meanwhile, I'll run this patch (coupled with the next patch) in our C/I to make sure there are no regressions. Thanks, Oded
if (rc < 0) { dev_err(hdev->dev, "failed to create SG table from pages\n");
goto put_framevec;
goto put_pages; } return 0;
-put_framevec:
put_vaddr_frames(userptr->vec);
-destroy_framevec:
frame_vector_destroy(userptr->vec);
+put_pages:
unpin_user_pages(userptr->pages, npages);
+destroy_pages:
kvfree(userptr->pages); return rc;
}
@@ -1405,8 +1401,6 @@ int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u64 size, */ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) {
struct page **pages;
hl_debugfs_remove_userptr(hdev, userptr); if (userptr->dma_mapped)
@@ -1414,15 +1408,8 @@ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) userptr->sgt->nents, userptr->dir);
pages = frame_vector_pages(userptr->vec);
if (!IS_ERR(pages)) {
int i;
for (i = 0; i < frame_vector_count(userptr->vec); i++)
set_page_dirty_lock(pages[i]);
}
put_vaddr_frames(userptr->vec);
frame_vector_destroy(userptr->vec);
unpin_user_pages_dirty_lock(userptr->pages, userptr->npages, true);
kvfree(userptr->pages); list_del(&userptr->job_node);
-- 2.28.0
On Sat, Oct 10, 2020 at 10:27 PM Oded Gabbay oded.gabbay@gmail.com wrote:
On Fri, Oct 9, 2020 at 10:59 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Thanks for the patch Daniel.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Oded Gabbay oded.gabbay@gmail.com Cc: Omer Shpigelman oshpigelman@habana.ai Cc: Ofir Bitton obitton@habana.ai Cc: Tomer Tayar ttayar@habana.ai Cc: Moti Haimovski mhaimovski@habana.ai Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Pawel Piskorski ppiskorski@habana.ai -- v2: Use unpin_user_pages_dirty_lock (John)
drivers/misc/habanalabs/Kconfig | 1 - drivers/misc/habanalabs/common/habanalabs.h | 3 +- drivers/misc/habanalabs/common/memory.c | 49 ++++++++------------- 3 files changed, 20 insertions(+), 33 deletions(-)
diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig index 8eb5d38c618e..2f04187f7167 100644 --- a/drivers/misc/habanalabs/Kconfig +++ b/drivers/misc/habanalabs/Kconfig @@ -6,7 +6,6 @@ config HABANA_AI tristate "HabanaAI accelerators (habanalabs)" depends on PCI && HAS_IOMEM
select FRAME_VECTOR select DMA_SHARED_BUFFER select GENERIC_ALLOCATOR select HWMON
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h index edbd627b29d2..c1b3ad613b15 100644 --- a/drivers/misc/habanalabs/common/habanalabs.h +++ b/drivers/misc/habanalabs/common/habanalabs.h @@ -881,7 +881,8 @@ struct hl_ctx_mgr { struct hl_userptr { enum vm_type_t vm_type; /* must be first */ struct list_head job_node;
struct frame_vector *vec;
struct page **pages;
unsigned int npages;
Can you please update the kerneldoc comment section of this structure according to your changes ?
Apologies I missed the nice kerneldoc. I'll fix that in the next round.
struct sg_table *sgt; enum dma_data_direction dir; struct list_head debugfs_list;
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index 5ff4688683fd..327b64479f97 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -1281,45 +1281,41 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, return -EFAULT; }
userptr->vec = frame_vector_create(npages);
if (!userptr->vec) {
userptr->pages = kvmalloc_array(npages, sizeof(*userptr->pages),
GFP_KERNEL);
if (!userptr->pages) { dev_err(hdev->dev, "Failed to create frame vector\n"); return -ENOMEM; }
rc = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
userptr->vec);
rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
userptr->pages); if (rc != npages) { dev_err(hdev->dev, "Failed to map host memory, user ptr probably wrong\n"); if (rc < 0)
goto destroy_framevec;
goto destroy_pages;
npages = rc; rc = -EFAULT;
goto put_framevec;
}
if (frame_vector_to_pages(userptr->vec) < 0) {
dev_err(hdev->dev,
"Failed to translate frame vector to pages\n");
rc = -EFAULT;
goto put_framevec;
goto put_pages; }
userptr->npages = npages; rc = sg_alloc_table_from_pages(userptr->sgt,
frame_vector_pages(userptr->vec),
npages, offset, size, GFP_ATOMIC);
userptr->pages,
npages, offset, size, GFP_ATOMIC);
I think that because the call to kvmalloc_array() is done with GFP_KERNEL, there is no point in using GFP_ATOMIC here. And actually, this path only needs to avoid yielding when using a special debug mode. So I suggest putting here GFP_KERNEL.
Huh, I didn't even notice the GFP_ATOMIC here. This looks indeed strange and GFP_KERNEL should be perfectly fine in a function that also calls pin_user_pages (since that one can allocate and do worse stuff like userspace pagefaults).
But since that GFP_ATOMIC is there already I'll do that in a separate patch.
In the meanwhile, I'll run this patch (coupled with the next patch) in our C/I to make sure there are no regressions.
Excellent. I'll wait with v3 until that's done, just in case you hit a snag I need to fix.
Cheers, Daniel
Thanks, Oded
if (rc < 0) { dev_err(hdev->dev, "failed to create SG table from pages\n");
goto put_framevec;
goto put_pages; } return 0;
-put_framevec:
put_vaddr_frames(userptr->vec);
-destroy_framevec:
frame_vector_destroy(userptr->vec);
+put_pages:
unpin_user_pages(userptr->pages, npages);
+destroy_pages:
kvfree(userptr->pages); return rc;
}
@@ -1405,8 +1401,6 @@ int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u64 size, */ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) {
struct page **pages;
hl_debugfs_remove_userptr(hdev, userptr); if (userptr->dma_mapped)
@@ -1414,15 +1408,8 @@ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) userptr->sgt->nents, userptr->dir);
pages = frame_vector_pages(userptr->vec);
if (!IS_ERR(pages)) {
int i;
for (i = 0; i < frame_vector_count(userptr->vec); i++)
set_page_dirty_lock(pages[i]);
}
put_vaddr_frames(userptr->vec);
frame_vector_destroy(userptr->vec);
unpin_user_pages_dirty_lock(userptr->pages, userptr->npages, true);
kvfree(userptr->pages); list_del(&userptr->job_node);
-- 2.28.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Sat, Oct 10, 2020 at 11:32 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Sat, Oct 10, 2020 at 10:27 PM Oded Gabbay oded.gabbay@gmail.com wrote:
On Fri, Oct 9, 2020 at 10:59 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Thanks for the patch Daniel.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Oded Gabbay oded.gabbay@gmail.com Cc: Omer Shpigelman oshpigelman@habana.ai Cc: Ofir Bitton obitton@habana.ai Cc: Tomer Tayar ttayar@habana.ai Cc: Moti Haimovski mhaimovski@habana.ai Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Pawel Piskorski ppiskorski@habana.ai -- v2: Use unpin_user_pages_dirty_lock (John)
drivers/misc/habanalabs/Kconfig | 1 - drivers/misc/habanalabs/common/habanalabs.h | 3 +- drivers/misc/habanalabs/common/memory.c | 49 ++++++++------------- 3 files changed, 20 insertions(+), 33 deletions(-)
diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig index 8eb5d38c618e..2f04187f7167 100644 --- a/drivers/misc/habanalabs/Kconfig +++ b/drivers/misc/habanalabs/Kconfig @@ -6,7 +6,6 @@ config HABANA_AI tristate "HabanaAI accelerators (habanalabs)" depends on PCI && HAS_IOMEM
select FRAME_VECTOR select DMA_SHARED_BUFFER select GENERIC_ALLOCATOR select HWMON
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h index edbd627b29d2..c1b3ad613b15 100644 --- a/drivers/misc/habanalabs/common/habanalabs.h +++ b/drivers/misc/habanalabs/common/habanalabs.h @@ -881,7 +881,8 @@ struct hl_ctx_mgr { struct hl_userptr { enum vm_type_t vm_type; /* must be first */ struct list_head job_node;
struct frame_vector *vec;
struct page **pages;
unsigned int npages;
Can you please update the kerneldoc comment section of this structure according to your changes ?
Apologies I missed the nice kerneldoc. I'll fix that in the next round.
struct sg_table *sgt; enum dma_data_direction dir; struct list_head debugfs_list;
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index 5ff4688683fd..327b64479f97 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -1281,45 +1281,41 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, return -EFAULT; }
userptr->vec = frame_vector_create(npages);
if (!userptr->vec) {
userptr->pages = kvmalloc_array(npages, sizeof(*userptr->pages),
GFP_KERNEL);
if (!userptr->pages) { dev_err(hdev->dev, "Failed to create frame vector\n"); return -ENOMEM; }
rc = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
userptr->vec);
rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
userptr->pages); if (rc != npages) { dev_err(hdev->dev, "Failed to map host memory, user ptr probably wrong\n"); if (rc < 0)
goto destroy_framevec;
goto destroy_pages;
npages = rc; rc = -EFAULT;
goto put_framevec;
}
if (frame_vector_to_pages(userptr->vec) < 0) {
dev_err(hdev->dev,
"Failed to translate frame vector to pages\n");
rc = -EFAULT;
goto put_framevec;
goto put_pages; }
userptr->npages = npages; rc = sg_alloc_table_from_pages(userptr->sgt,
frame_vector_pages(userptr->vec),
npages, offset, size, GFP_ATOMIC);
userptr->pages,
npages, offset, size, GFP_ATOMIC);
I think that because the call to kvmalloc_array() is done with GFP_KERNEL, there is no point in using GFP_ATOMIC here. And actually, this path only needs to avoid yielding when using a special debug mode. So I suggest putting here GFP_KERNEL.
Huh, I didn't even notice the GFP_ATOMIC here. This looks indeed strange and GFP_KERNEL should be perfectly fine in a function that also calls pin_user_pages (since that one can allocate and do worse stuff like userspace pagefaults).
But since that GFP_ATOMIC is there already I'll do that in a separate patch.
Ok I read up on your usage of GFP_ATOMIC in habanalabs, and I'm not going to touch this. But I'm pretty sure it's broken.
You seem to have some requirement of not allocating memory with blocking (see hl_cb_alloc()), and that seems to be way you allocate tons of structures with GFP_ATOMIC. There's 2 pretty tough problems with that: - GFP_ATOMIC can fail, even when the system hasn't run out of memory yet. You _must_ have a fallback back to handle allocation failures for these. Quick survey shows you a ton of GFP_ATOMIC callsites, and very little fallback code - I've found none, but I didn't check the failure handlers all going up the possible callchains. - pin_user_pages can allocate memory, so you're breaking your own "no sleeping in these paths" rules.
This isn't going to get fixed with a quick oneliner patch, depending what's needed you're looking at a driver rearchitecture here :-/ Hence I'm not going to touch this in the next patch, but leave it all as-is.
Cheers, Daniel
In the meanwhile, I'll run this patch (coupled with the next patch) in our C/I to make sure there are no regressions.
Excellent. I'll wait with v3 until that's done, just in case you hit a snag I need to fix.
Cheers, Daniel
Thanks, Oded
if (rc < 0) { dev_err(hdev->dev, "failed to create SG table from pages\n");
goto put_framevec;
goto put_pages; } return 0;
-put_framevec:
put_vaddr_frames(userptr->vec);
-destroy_framevec:
frame_vector_destroy(userptr->vec);
+put_pages:
unpin_user_pages(userptr->pages, npages);
+destroy_pages:
kvfree(userptr->pages); return rc;
}
@@ -1405,8 +1401,6 @@ int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u64 size, */ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) {
struct page **pages;
hl_debugfs_remove_userptr(hdev, userptr); if (userptr->dma_mapped)
@@ -1414,15 +1408,8 @@ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) userptr->sgt->nents, userptr->dir);
pages = frame_vector_pages(userptr->vec);
if (!IS_ERR(pages)) {
int i;
for (i = 0; i < frame_vector_count(userptr->vec); i++)
set_page_dirty_lock(pages[i]);
}
put_vaddr_frames(userptr->vec);
frame_vector_destroy(userptr->vec);
unpin_user_pages_dirty_lock(userptr->pages, userptr->npages, true);
kvfree(userptr->pages); list_del(&userptr->job_node);
-- 2.28.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Sun, Oct 11, 2020 at 12:41 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Sat, Oct 10, 2020 at 11:32 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Sat, Oct 10, 2020 at 10:27 PM Oded Gabbay oded.gabbay@gmail.com wrote:
On Fri, Oct 9, 2020 at 10:59 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Thanks for the patch Daniel.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Oded Gabbay oded.gabbay@gmail.com Cc: Omer Shpigelman oshpigelman@habana.ai Cc: Ofir Bitton obitton@habana.ai Cc: Tomer Tayar ttayar@habana.ai Cc: Moti Haimovski mhaimovski@habana.ai Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Pawel Piskorski ppiskorski@habana.ai -- v2: Use unpin_user_pages_dirty_lock (John)
drivers/misc/habanalabs/Kconfig | 1 - drivers/misc/habanalabs/common/habanalabs.h | 3 +- drivers/misc/habanalabs/common/memory.c | 49 ++++++++------------- 3 files changed, 20 insertions(+), 33 deletions(-)
diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig index 8eb5d38c618e..2f04187f7167 100644 --- a/drivers/misc/habanalabs/Kconfig +++ b/drivers/misc/habanalabs/Kconfig @@ -6,7 +6,6 @@ config HABANA_AI tristate "HabanaAI accelerators (habanalabs)" depends on PCI && HAS_IOMEM
select FRAME_VECTOR select DMA_SHARED_BUFFER select GENERIC_ALLOCATOR select HWMON
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h index edbd627b29d2..c1b3ad613b15 100644 --- a/drivers/misc/habanalabs/common/habanalabs.h +++ b/drivers/misc/habanalabs/common/habanalabs.h @@ -881,7 +881,8 @@ struct hl_ctx_mgr { struct hl_userptr { enum vm_type_t vm_type; /* must be first */ struct list_head job_node;
struct frame_vector *vec;
struct page **pages;
unsigned int npages;
Can you please update the kerneldoc comment section of this structure according to your changes ?
Apologies I missed the nice kerneldoc. I'll fix that in the next round.
struct sg_table *sgt; enum dma_data_direction dir; struct list_head debugfs_list;
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index 5ff4688683fd..327b64479f97 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -1281,45 +1281,41 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, return -EFAULT; }
userptr->vec = frame_vector_create(npages);
if (!userptr->vec) {
userptr->pages = kvmalloc_array(npages, sizeof(*userptr->pages),
GFP_KERNEL);
if (!userptr->pages) { dev_err(hdev->dev, "Failed to create frame vector\n"); return -ENOMEM; }
rc = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
userptr->vec);
rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE,
userptr->pages); if (rc != npages) { dev_err(hdev->dev, "Failed to map host memory, user ptr probably wrong\n"); if (rc < 0)
goto destroy_framevec;
goto destroy_pages;
npages = rc; rc = -EFAULT;
goto put_framevec;
}
if (frame_vector_to_pages(userptr->vec) < 0) {
dev_err(hdev->dev,
"Failed to translate frame vector to pages\n");
rc = -EFAULT;
goto put_framevec;
goto put_pages; }
userptr->npages = npages; rc = sg_alloc_table_from_pages(userptr->sgt,
frame_vector_pages(userptr->vec),
npages, offset, size, GFP_ATOMIC);
userptr->pages,
npages, offset, size, GFP_ATOMIC);
I think that because the call to kvmalloc_array() is done with GFP_KERNEL, there is no point in using GFP_ATOMIC here. And actually, this path only needs to avoid yielding when using a special debug mode. So I suggest putting here GFP_KERNEL.
Huh, I didn't even notice the GFP_ATOMIC here. This looks indeed strange and GFP_KERNEL should be perfectly fine in a function that also calls pin_user_pages (since that one can allocate and do worse stuff like userspace pagefaults).
But since that GFP_ATOMIC is there already I'll do that in a separate patch.
Ok I read up on your usage of GFP_ATOMIC in habanalabs, and I'm not going to touch this. But I'm pretty sure it's broken.
You seem to have some requirement of not allocating memory with blocking (see hl_cb_alloc()), and that seems to be way you allocate tons of structures with GFP_ATOMIC. There's 2 pretty tough problems with that:
- GFP_ATOMIC can fail, even when the system hasn't run out of memory
yet. You _must_ have a fallback back to handle allocation failures for these. Quick survey shows you a ton of GFP_ATOMIC callsites, and very little fallback code - I've found none, but I didn't check the failure handlers all going up the possible callchains.
- pin_user_pages can allocate memory, so you're breaking your own "no
sleeping in these paths" rules.
This isn't going to get fixed with a quick oneliner patch, depending what's needed you're looking at a driver rearchitecture here :-/ Hence I'm not going to touch this in the next patch, but leave it all as-is.
Most of those requirements come from code that is only relevant in initial bringup and in our first ASIC (GOYA) for the first few months we had it. I'm going to remove all that code from the upstream driver as it's not needed there.
Then, I'll go and look at all other uses of GFP_ATOMIC to see how they can be improved/removed, maybe with pre-allocated stuff. Thanks for pointing this out. Oded
Cheers, Daniel
In the meanwhile, I'll run this patch (coupled with the next patch) in our C/I to make sure there are no regressions.
Excellent. I'll wait with v3 until that's done, just in case you hit a snag I need to fix.
Cheers, Daniel
Thanks, Oded
if (rc < 0) { dev_err(hdev->dev, "failed to create SG table from pages\n");
goto put_framevec;
goto put_pages; } return 0;
-put_framevec:
put_vaddr_frames(userptr->vec);
-destroy_framevec:
frame_vector_destroy(userptr->vec);
+put_pages:
unpin_user_pages(userptr->pages, npages);
+destroy_pages:
kvfree(userptr->pages); return rc;
}
@@ -1405,8 +1401,6 @@ int hl_pin_host_memory(struct hl_device *hdev, u64 addr, u64 size, */ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) {
struct page **pages;
hl_debugfs_remove_userptr(hdev, userptr); if (userptr->dma_mapped)
@@ -1414,15 +1408,8 @@ void hl_unpin_host_memory(struct hl_device *hdev, struct hl_userptr *userptr) userptr->sgt->nents, userptr->dir);
pages = frame_vector_pages(userptr->vec);
if (!IS_ERR(pages)) {
int i;
for (i = 0; i < frame_vector_count(userptr->vec); i++)
set_page_dirty_lock(pages[i]);
}
put_vaddr_frames(userptr->vec);
frame_vector_destroy(userptr->vec);
unpin_user_pages_dirty_lock(userptr->pages, userptr->npages, true);
kvfree(userptr->pages); list_del(&userptr->job_node);
-- 2.28.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 10/9/20 12:59 AM, Daniel Vetter wrote:
All we need are a pages array, pin_user_pages_fast can give us that directly. Plus this avoids the entire raw pfn side of get_vaddr_frames.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Oded Gabbay oded.gabbay@gmail.com Cc: Omer Shpigelman oshpigelman@habana.ai Cc: Ofir Bitton obitton@habana.ai Cc: Tomer Tayar ttayar@habana.ai Cc: Moti Haimovski mhaimovski@habana.ai Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Pawel Piskorski ppiskorski@habana.ai -- v2: Use unpin_user_pages_dirty_lock (John)
drivers/misc/habanalabs/Kconfig | 1 - drivers/misc/habanalabs/common/habanalabs.h | 3 +- drivers/misc/habanalabs/common/memory.c | 49 ++++++++------------- 3 files changed, 20 insertions(+), 33 deletions(-)
Reviewed-by: John Hubbard jhubbard@nvidia.com
thanks,
These are persistent, not just for the duration of a dma operation.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Oded Gabbay oded.gabbay@gmail.com Cc: Omer Shpigelman oshpigelman@habana.ai Cc: Ofir Bitton obitton@habana.ai Cc: Tomer Tayar ttayar@habana.ai Cc: Moti Haimovski mhaimovski@habana.ai Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Pawel Piskorski ppiskorski@habana.ai --- drivers/misc/habanalabs/common/memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/misc/habanalabs/common/memory.c b/drivers/misc/habanalabs/common/memory.c index 327b64479f97..767d3644c033 100644 --- a/drivers/misc/habanalabs/common/memory.c +++ b/drivers/misc/habanalabs/common/memory.c @@ -1288,7 +1288,8 @@ static int get_user_memory(struct hl_device *hdev, u64 addr, u64 size, return -ENOMEM; }
- rc = pin_user_pages_fast(start, npages, FOLL_FORCE | FOLL_WRITE, + rc = pin_user_pages_fast(start, npages, + FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, userptr->pages);
if (rc != npages) {
This is used by media/videbuf2 for persistent dma mappings, not just for a single dma operation and then freed again, so needs FOLL_LONGTERM.
Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to locking issues. Rework the code to pull the pup path out from the mmap_sem critical section as suggested by Jason.
By relying entirely on the vma checks in pin_user_pages and follow_pfn (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org -- v2: Streamline the code and further simplify the loop checks (Jason) --- mm/frame_vector.c | 50 ++++++++++++++--------------------------------- 1 file changed, 15 insertions(+), 35 deletions(-)
diff --git a/mm/frame_vector.c b/mm/frame_vector.c index 10f82d5643b6..d44779e56313 100644 --- a/mm/frame_vector.c +++ b/mm/frame_vector.c @@ -38,7 +38,6 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, struct vm_area_struct *vma; int ret = 0; int err; - int locked;
if (nr_frames == 0) return 0; @@ -48,40 +47,25 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
start = untagged_addr(start);
- mmap_read_lock(mm); - locked = 1; - vma = find_vma_intersection(mm, start, start + 1); - if (!vma) { - ret = -EFAULT; - goto out; - } - - /* - * While get_vaddr_frames() could be used for transient (kernel - * controlled lifetime) pinning of memory pages all current - * users establish long term (userspace controlled lifetime) - * page pinning. Treat get_vaddr_frames() like - * get_user_pages_longterm() and disallow it for filesystem-dax - * mappings. - */ - if (vma_is_fsdax(vma)) { - ret = -EOPNOTSUPP; - goto out; - } - - if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) { + ret = pin_user_pages_fast(start, nr_frames, + FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM, + (struct page **)(vec->ptrs)); + if (ret > 0) { vec->got_ref = true; vec->is_pfns = false; - ret = pin_user_pages_locked(start, nr_frames, - gup_flags, (struct page **)(vec->ptrs), &locked); - goto out; + goto out_unlocked; }
+ mmap_read_lock(mm); vec->got_ref = false; vec->is_pfns = true; do { unsigned long *nums = frame_vector_pfns(vec);
+ vma = find_vma_intersection(mm, start, start + 1); + if (!vma) + break; + while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) { err = follow_pfn(vma, start, &nums[ret]); if (err) { @@ -92,17 +76,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, start += PAGE_SIZE; ret++; } - /* - * We stop if we have enough pages or if VMA doesn't completely - * cover the tail page. - */ - if (ret >= nr_frames || start < vma->vm_end) + /* Bail out if VMA doesn't completely cover the tail page. */ + if (start < vma->vm_end) break; - vma = find_vma_intersection(mm, start, start + 1); - } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP)); + } while (ret < nr_frames); out: - if (locked) - mmap_read_unlock(mm); + mmap_read_unlock(mm); +out_unlocked: if (!ret) ret = -EFAULT; if (ret > 0)
On 10/9/20 12:59 AM, Daniel Vetter wrote: ...
@@ -48,40 +47,25 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
start = untagged_addr(start);
- mmap_read_lock(mm);
- locked = 1;
- vma = find_vma_intersection(mm, start, start + 1);
- if (!vma) {
ret = -EFAULT;
goto out;
- }
- /*
* While get_vaddr_frames() could be used for transient (kernel
* controlled lifetime) pinning of memory pages all current
* users establish long term (userspace controlled lifetime)
* page pinning. Treat get_vaddr_frames() like
* get_user_pages_longterm() and disallow it for filesystem-dax
* mappings.
*/
- if (vma_is_fsdax(vma)) {
ret = -EOPNOTSUPP;
goto out;
- }
- if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
- ret = pin_user_pages_fast(start, nr_frames,
FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
(struct page **)(vec->ptrs));
- if (ret > 0) {
None of the callers that we have today will accept anything less than ret == nr_frames. And the whole partially pinned region idea turns out to be just not useful for almost everyone, from what I recall of the gup/pup call sites. So I wonder if we should just have get_vaddr_frames do the cleanup here and return -EFAULT, if ret != nr_frames ?
thanks,
On Fri, Oct 16, 2020 at 9:54 AM John Hubbard jhubbard@nvidia.com wrote:
On 10/9/20 12:59 AM, Daniel Vetter wrote: ...
@@ -48,40 +47,25 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
start = untagged_addr(start);
mmap_read_lock(mm);
locked = 1;
vma = find_vma_intersection(mm, start, start + 1);
if (!vma) {
ret = -EFAULT;
goto out;
}
/*
* While get_vaddr_frames() could be used for transient (kernel
* controlled lifetime) pinning of memory pages all current
* users establish long term (userspace controlled lifetime)
* page pinning. Treat get_vaddr_frames() like
* get_user_pages_longterm() and disallow it for filesystem-dax
* mappings.
*/
if (vma_is_fsdax(vma)) {
ret = -EOPNOTSUPP;
goto out;
}
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
ret = pin_user_pages_fast(start, nr_frames,
FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
(struct page **)(vec->ptrs));
if (ret > 0) {
None of the callers that we have today will accept anything less than ret == nr_frames. And the whole partially pinned region idea turns out to be just not useful for almost everyone, from what I recall of the gup/pup call sites. So I wonder if we should just have get_vaddr_frames do the cleanup here and return -EFAULT, if ret != nr_frames ?
Yeah I noticed that the calling convention here is a bit funny. But I with these frame-vector helpers now being part of drivers/media it's up to media folks if they want to clean that up, or leave it as is.
If this would be in drm I'd say we'll have the loud warning and tainting due to CONFIG_STRICT_FOLLOW_PFN=n for 2-3 years. Then assuming no big complaints showed up, rip it all out and just directly call pup in each place that wants it (like I've done for habanalabs and exynos). -Daniel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
It's the only user. This also garbage collects the CONFIG_FRAME_VECTOR symbol from all over the tree (well just one place, somehow omap media driver still had this in its Kconfig, despite not using it).
Reviewed-by: John Hubbard jhubbard@nvidia.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Daniel Vetter daniel.vetter@ffwll.ch --- drivers/media/common/videobuf2/Kconfig | 1 - drivers/media/common/videobuf2/Makefile | 1 + .../media/common/videobuf2}/frame_vector.c | 2 + drivers/media/platform/omap/Kconfig | 1 - include/linux/mm.h | 42 ------------------- include/media/videobuf2-core.h | 42 +++++++++++++++++++ mm/Kconfig | 3 -- mm/Makefile | 1 - 8 files changed, 45 insertions(+), 48 deletions(-) rename {mm => drivers/media/common/videobuf2}/frame_vector.c (99%)
diff --git a/drivers/media/common/videobuf2/Kconfig b/drivers/media/common/videobuf2/Kconfig index edbc99ebba87..d2223a12c95f 100644 --- a/drivers/media/common/videobuf2/Kconfig +++ b/drivers/media/common/videobuf2/Kconfig @@ -9,7 +9,6 @@ config VIDEOBUF2_V4L2
config VIDEOBUF2_MEMOPS tristate - select FRAME_VECTOR
config VIDEOBUF2_DMA_CONTIG tristate diff --git a/drivers/media/common/videobuf2/Makefile b/drivers/media/common/videobuf2/Makefile index 77bebe8b202f..54306f8d096c 100644 --- a/drivers/media/common/videobuf2/Makefile +++ b/drivers/media/common/videobuf2/Makefile @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 videobuf2-common-objs := videobuf2-core.o +videobuf2-common-objs += frame_vector.o
ifeq ($(CONFIG_TRACEPOINTS),y) videobuf2-common-objs += vb2-trace.o diff --git a/mm/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c similarity index 99% rename from mm/frame_vector.c rename to drivers/media/common/videobuf2/frame_vector.c index d44779e56313..2b0b97761d15 100644 --- a/mm/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -8,6 +8,8 @@ #include <linux/pagemap.h> #include <linux/sched.h>
+#include <media/videobuf2-core.h> + /** * get_vaddr_frames() - map virtual addresses to pfns * @start: starting user address diff --git a/drivers/media/platform/omap/Kconfig b/drivers/media/platform/omap/Kconfig index f73b5893220d..de16de46c0f4 100644 --- a/drivers/media/platform/omap/Kconfig +++ b/drivers/media/platform/omap/Kconfig @@ -12,6 +12,5 @@ config VIDEO_OMAP2_VOUT depends on VIDEO_V4L2 select VIDEOBUF2_DMA_CONTIG select OMAP2_VRFB if ARCH_OMAP2 || ARCH_OMAP3 - select FRAME_VECTOR help V4L2 Display driver support for OMAP2/3 based boards. diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..acd60fbf1a5a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1743,48 +1743,6 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc); int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, struct task_struct *task, bool bypass_rlim);
-/* Container for pinned pfns / pages */ -struct frame_vector { - unsigned int nr_allocated; /* Number of frames we have space for */ - unsigned int nr_frames; /* Number of frames stored in ptrs array */ - bool got_ref; /* Did we pin pages by getting page ref? */ - bool is_pfns; /* Does array contain pages or pfns? */ - void *ptrs[]; /* Array of pinned pfns / pages. Use - * pfns_vector_pages() or pfns_vector_pfns() - * for access */ -}; - -struct frame_vector *frame_vector_create(unsigned int nr_frames); -void frame_vector_destroy(struct frame_vector *vec); -int get_vaddr_frames(unsigned long start, unsigned int nr_pfns, - unsigned int gup_flags, struct frame_vector *vec); -void put_vaddr_frames(struct frame_vector *vec); -int frame_vector_to_pages(struct frame_vector *vec); -void frame_vector_to_pfns(struct frame_vector *vec); - -static inline unsigned int frame_vector_count(struct frame_vector *vec) -{ - return vec->nr_frames; -} - -static inline struct page **frame_vector_pages(struct frame_vector *vec) -{ - if (vec->is_pfns) { - int err = frame_vector_to_pages(vec); - - if (err) - return ERR_PTR(err); - } - return (struct page **)(vec->ptrs); -} - -static inline unsigned long *frame_vector_pfns(struct frame_vector *vec) -{ - if (!vec->is_pfns) - frame_vector_to_pfns(vec); - return (unsigned long *)(vec->ptrs); -} - struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/media/videobuf2-core.h b/include/media/videobuf2-core.h index bbb3f26fbde9..a2e75ca0334f 100644 --- a/include/media/videobuf2-core.h +++ b/include/media/videobuf2-core.h @@ -1254,4 +1254,46 @@ bool vb2_request_object_is_buffer(struct media_request_object *obj); */ unsigned int vb2_request_buffer_cnt(struct media_request *req);
+/* Container for pinned pfns / pages in frame_vector.c */ +struct frame_vector { + unsigned int nr_allocated; /* Number of frames we have space for */ + unsigned int nr_frames; /* Number of frames stored in ptrs array */ + bool got_ref; /* Did we pin pages by getting page ref? */ + bool is_pfns; /* Does array contain pages or pfns? */ + void *ptrs[]; /* Array of pinned pfns / pages. Use + * pfns_vector_pages() or pfns_vector_pfns() + * for access */ +}; + +struct frame_vector *frame_vector_create(unsigned int nr_frames); +void frame_vector_destroy(struct frame_vector *vec); +int get_vaddr_frames(unsigned long start, unsigned int nr_pfns, + unsigned int gup_flags, struct frame_vector *vec); +void put_vaddr_frames(struct frame_vector *vec); +int frame_vector_to_pages(struct frame_vector *vec); +void frame_vector_to_pfns(struct frame_vector *vec); + +static inline unsigned int frame_vector_count(struct frame_vector *vec) +{ + return vec->nr_frames; +} + +static inline struct page **frame_vector_pages(struct frame_vector *vec) +{ + if (vec->is_pfns) { + int err = frame_vector_to_pages(vec); + + if (err) + return ERR_PTR(err); + } + return (struct page **)(vec->ptrs); +} + +static inline unsigned long *frame_vector_pfns(struct frame_vector *vec) +{ + if (!vec->is_pfns) + frame_vector_to_pfns(vec); + return (unsigned long *)(vec->ptrs); +} + #endif /* _MEDIA_VIDEOBUF2_CORE_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 6c974888f86f..da6c943fe9f1 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -815,9 +815,6 @@ config DEVICE_PRIVATE memory; i.e., memory that is only accessible from the device (or group of devices). You likely also want to select HMM_MIRROR.
-config FRAME_VECTOR - bool - config ARCH_USES_HIGH_VMA_FLAGS bool config ARCH_HAS_PKEYS diff --git a/mm/Makefile b/mm/Makefile index d5649f1c12c0..a025fd6c6afd 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -111,7 +111,6 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o -obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
Em Fri, 9 Oct 2020 09:59:23 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
It's the only user. This also garbage collects the CONFIG_FRAME_VECTOR symbol from all over the tree (well just one place, somehow omap media driver still had this in its Kconfig, despite not using it).
Reviewed-by: John Hubbard jhubbard@nvidia.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Daniel Vetter daniel.vetter@ffwll.ch
drivers/media/common/videobuf2/Kconfig | 1 - drivers/media/common/videobuf2/Makefile | 1 + .../media/common/videobuf2}/frame_vector.c | 2 + drivers/media/platform/omap/Kconfig | 1 - include/linux/mm.h | 42 ------------------- include/media/videobuf2-core.h | 42 +++++++++++++++++++ mm/Kconfig | 3 -- mm/Makefile | 1 - 8 files changed, 45 insertions(+), 48 deletions(-) rename {mm => drivers/media/common/videobuf2}/frame_vector.c (99%)
diff --git a/drivers/media/common/videobuf2/Kconfig b/drivers/media/common/videobuf2/Kconfig index edbc99ebba87..d2223a12c95f 100644 --- a/drivers/media/common/videobuf2/Kconfig +++ b/drivers/media/common/videobuf2/Kconfig @@ -9,7 +9,6 @@ config VIDEOBUF2_V4L2
config VIDEOBUF2_MEMOPS tristate
- select FRAME_VECTOR
config VIDEOBUF2_DMA_CONTIG tristate diff --git a/drivers/media/common/videobuf2/Makefile b/drivers/media/common/videobuf2/Makefile index 77bebe8b202f..54306f8d096c 100644 --- a/drivers/media/common/videobuf2/Makefile +++ b/drivers/media/common/videobuf2/Makefile @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 videobuf2-common-objs := videobuf2-core.o +videobuf2-common-objs += frame_vector.o
ifeq ($(CONFIG_TRACEPOINTS),y) videobuf2-common-objs += vb2-trace.o diff --git a/mm/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c similarity index 99% rename from mm/frame_vector.c rename to drivers/media/common/videobuf2/frame_vector.c index d44779e56313..2b0b97761d15 100644 --- a/mm/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -8,6 +8,8 @@ #include <linux/pagemap.h> #include <linux/sched.h>
+#include <media/videobuf2-core.h>
See my comment below...
/**
- get_vaddr_frames() - map virtual addresses to pfns
- @start: starting user address
diff --git a/drivers/media/platform/omap/Kconfig b/drivers/media/platform/omap/Kconfig index f73b5893220d..de16de46c0f4 100644 --- a/drivers/media/platform/omap/Kconfig +++ b/drivers/media/platform/omap/Kconfig @@ -12,6 +12,5 @@ config VIDEO_OMAP2_VOUT depends on VIDEO_V4L2 select VIDEOBUF2_DMA_CONTIG select OMAP2_VRFB if ARCH_OMAP2 || ARCH_OMAP3
- select FRAME_VECTOR help V4L2 Display driver support for OMAP2/3 based boards.
diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..acd60fbf1a5a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1743,48 +1743,6 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc); int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, struct task_struct *task, bool bypass_rlim);
-/* Container for pinned pfns / pages */ -struct frame_vector {
- unsigned int nr_allocated; /* Number of frames we have space for */
- unsigned int nr_frames; /* Number of frames stored in ptrs array */
- bool got_ref; /* Did we pin pages by getting page ref? */
- bool is_pfns; /* Does array contain pages or pfns? */
- void *ptrs[]; /* Array of pinned pfns / pages. Use
* pfns_vector_pages() or pfns_vector_pfns()
* for access */
-};
-struct frame_vector *frame_vector_create(unsigned int nr_frames); -void frame_vector_destroy(struct frame_vector *vec); -int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
unsigned int gup_flags, struct frame_vector *vec);
-void put_vaddr_frames(struct frame_vector *vec); -int frame_vector_to_pages(struct frame_vector *vec); -void frame_vector_to_pfns(struct frame_vector *vec);
-static inline unsigned int frame_vector_count(struct frame_vector *vec) -{
- return vec->nr_frames;
-}
-static inline struct page **frame_vector_pages(struct frame_vector *vec) -{
- if (vec->is_pfns) {
int err = frame_vector_to_pages(vec);
if (err)
return ERR_PTR(err);
- }
- return (struct page **)(vec->ptrs);
-}
-static inline unsigned long *frame_vector_pfns(struct frame_vector *vec) -{
- if (!vec->is_pfns)
frame_vector_to_pfns(vec);
- return (unsigned long *)(vec->ptrs);
-}
struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/media/videobuf2-core.h b/include/media/videobuf2-core.h index bbb3f26fbde9..a2e75ca0334f 100644 --- a/include/media/videobuf2-core.h +++ b/include/media/videobuf2-core.h @@ -1254,4 +1254,46 @@ bool vb2_request_object_is_buffer(struct media_request_object *obj); */ unsigned int vb2_request_buffer_cnt(struct media_request *req);
+/* Container for pinned pfns / pages in frame_vector.c */ +struct frame_vector {
- unsigned int nr_allocated; /* Number of frames we have space for */
- unsigned int nr_frames; /* Number of frames stored in ptrs array */
- bool got_ref; /* Did we pin pages by getting page ref? */
- bool is_pfns; /* Does array contain pages or pfns? */
- void *ptrs[]; /* Array of pinned pfns / pages. Use
* pfns_vector_pages() or pfns_vector_pfns()
* for access */
+};
+struct frame_vector *frame_vector_create(unsigned int nr_frames); +void frame_vector_destroy(struct frame_vector *vec); +int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
unsigned int gup_flags, struct frame_vector *vec);
+void put_vaddr_frames(struct frame_vector *vec); +int frame_vector_to_pages(struct frame_vector *vec); +void frame_vector_to_pfns(struct frame_vector *vec);
+static inline unsigned int frame_vector_count(struct frame_vector *vec) +{
- return vec->nr_frames;
+}
+static inline struct page **frame_vector_pages(struct frame_vector *vec) +{
- if (vec->is_pfns) {
int err = frame_vector_to_pages(vec);
if (err)
return ERR_PTR(err);
- }
- return (struct page **)(vec->ptrs);
+}
+static inline unsigned long *frame_vector_pfns(struct frame_vector *vec) +{
- if (!vec->is_pfns)
frame_vector_to_pfns(vec);
- return (unsigned long *)(vec->ptrs);
+}
#endif /* _MEDIA_VIDEOBUF2_CORE_H */
Please place those into a include/media/frame_vector.h file, instead of merging it directly at vb2 core header.
Then include the new header at videobuf2-core.h and at frame_vector.c.
With such changes:
Acked-by: Mauro Carvalho Chehab mchehab+huawei@kernel.org
diff --git a/mm/Kconfig b/mm/Kconfig index 6c974888f86f..da6c943fe9f1 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -815,9 +815,6 @@ config DEVICE_PRIVATE memory; i.e., memory that is only accessible from the device (or group of devices). You likely also want to select HMM_MIRROR.
-config FRAME_VECTOR
- bool
config ARCH_USES_HIGH_VMA_FLAGS bool config ARCH_HAS_PKEYS diff --git a/mm/Makefile b/mm/Makefile index d5649f1c12c0..a025fd6c6afd 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -111,7 +111,6 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o -obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
Thanks, Mauro
On Fri, Oct 09, 2020 at 12:14:17PM +0200, Mauro Carvalho Chehab wrote:
Em Fri, 9 Oct 2020 09:59:23 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
It's the only user. This also garbage collects the CONFIG_FRAME_VECTOR symbol from all over the tree (well just one place, somehow omap media driver still had this in its Kconfig, despite not using it).
Reviewed-by: John Hubbard jhubbard@nvidia.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Mauro Carvalho Chehab mchehab@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Daniel Vetter daniel.vetter@ffwll.ch
drivers/media/common/videobuf2/Kconfig | 1 - drivers/media/common/videobuf2/Makefile | 1 + .../media/common/videobuf2}/frame_vector.c | 2 + drivers/media/platform/omap/Kconfig | 1 - include/linux/mm.h | 42 ------------------- include/media/videobuf2-core.h | 42 +++++++++++++++++++ mm/Kconfig | 3 -- mm/Makefile | 1 - 8 files changed, 45 insertions(+), 48 deletions(-) rename {mm => drivers/media/common/videobuf2}/frame_vector.c (99%)
diff --git a/drivers/media/common/videobuf2/Kconfig b/drivers/media/common/videobuf2/Kconfig index edbc99ebba87..d2223a12c95f 100644 --- a/drivers/media/common/videobuf2/Kconfig +++ b/drivers/media/common/videobuf2/Kconfig @@ -9,7 +9,6 @@ config VIDEOBUF2_V4L2
config VIDEOBUF2_MEMOPS tristate
- select FRAME_VECTOR
config VIDEOBUF2_DMA_CONTIG tristate diff --git a/drivers/media/common/videobuf2/Makefile b/drivers/media/common/videobuf2/Makefile index 77bebe8b202f..54306f8d096c 100644 --- a/drivers/media/common/videobuf2/Makefile +++ b/drivers/media/common/videobuf2/Makefile @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 videobuf2-common-objs := videobuf2-core.o +videobuf2-common-objs += frame_vector.o
ifeq ($(CONFIG_TRACEPOINTS),y) videobuf2-common-objs += vb2-trace.o diff --git a/mm/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c similarity index 99% rename from mm/frame_vector.c rename to drivers/media/common/videobuf2/frame_vector.c index d44779e56313..2b0b97761d15 100644 --- a/mm/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -8,6 +8,8 @@ #include <linux/pagemap.h> #include <linux/sched.h>
+#include <media/videobuf2-core.h>
See my comment below...
/**
- get_vaddr_frames() - map virtual addresses to pfns
- @start: starting user address
diff --git a/drivers/media/platform/omap/Kconfig b/drivers/media/platform/omap/Kconfig index f73b5893220d..de16de46c0f4 100644 --- a/drivers/media/platform/omap/Kconfig +++ b/drivers/media/platform/omap/Kconfig @@ -12,6 +12,5 @@ config VIDEO_OMAP2_VOUT depends on VIDEO_V4L2 select VIDEOBUF2_DMA_CONTIG select OMAP2_VRFB if ARCH_OMAP2 || ARCH_OMAP3
- select FRAME_VECTOR help V4L2 Display driver support for OMAP2/3 based boards.
diff --git a/include/linux/mm.h b/include/linux/mm.h index 16b799a0522c..acd60fbf1a5a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1743,48 +1743,6 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc); int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, struct task_struct *task, bool bypass_rlim);
-/* Container for pinned pfns / pages */ -struct frame_vector {
- unsigned int nr_allocated; /* Number of frames we have space for */
- unsigned int nr_frames; /* Number of frames stored in ptrs array */
- bool got_ref; /* Did we pin pages by getting page ref? */
- bool is_pfns; /* Does array contain pages or pfns? */
- void *ptrs[]; /* Array of pinned pfns / pages. Use
* pfns_vector_pages() or pfns_vector_pfns()
* for access */
-};
-struct frame_vector *frame_vector_create(unsigned int nr_frames); -void frame_vector_destroy(struct frame_vector *vec); -int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
unsigned int gup_flags, struct frame_vector *vec);
-void put_vaddr_frames(struct frame_vector *vec); -int frame_vector_to_pages(struct frame_vector *vec); -void frame_vector_to_pfns(struct frame_vector *vec);
-static inline unsigned int frame_vector_count(struct frame_vector *vec) -{
- return vec->nr_frames;
-}
-static inline struct page **frame_vector_pages(struct frame_vector *vec) -{
- if (vec->is_pfns) {
int err = frame_vector_to_pages(vec);
if (err)
return ERR_PTR(err);
- }
- return (struct page **)(vec->ptrs);
-}
-static inline unsigned long *frame_vector_pfns(struct frame_vector *vec) -{
- if (!vec->is_pfns)
frame_vector_to_pfns(vec);
- return (unsigned long *)(vec->ptrs);
-}
struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/media/videobuf2-core.h b/include/media/videobuf2-core.h index bbb3f26fbde9..a2e75ca0334f 100644 --- a/include/media/videobuf2-core.h +++ b/include/media/videobuf2-core.h @@ -1254,4 +1254,46 @@ bool vb2_request_object_is_buffer(struct media_request_object *obj); */ unsigned int vb2_request_buffer_cnt(struct media_request *req);
+/* Container for pinned pfns / pages in frame_vector.c */ +struct frame_vector {
- unsigned int nr_allocated; /* Number of frames we have space for */
- unsigned int nr_frames; /* Number of frames stored in ptrs array */
- bool got_ref; /* Did we pin pages by getting page ref? */
- bool is_pfns; /* Does array contain pages or pfns? */
- void *ptrs[]; /* Array of pinned pfns / pages. Use
* pfns_vector_pages() or pfns_vector_pfns()
* for access */
+};
+struct frame_vector *frame_vector_create(unsigned int nr_frames); +void frame_vector_destroy(struct frame_vector *vec); +int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
unsigned int gup_flags, struct frame_vector *vec);
+void put_vaddr_frames(struct frame_vector *vec); +int frame_vector_to_pages(struct frame_vector *vec); +void frame_vector_to_pfns(struct frame_vector *vec);
+static inline unsigned int frame_vector_count(struct frame_vector *vec) +{
- return vec->nr_frames;
+}
+static inline struct page **frame_vector_pages(struct frame_vector *vec) +{
- if (vec->is_pfns) {
int err = frame_vector_to_pages(vec);
if (err)
return ERR_PTR(err);
- }
- return (struct page **)(vec->ptrs);
+}
+static inline unsigned long *frame_vector_pfns(struct frame_vector *vec) +{
- if (!vec->is_pfns)
frame_vector_to_pfns(vec);
- return (unsigned long *)(vec->ptrs);
+}
#endif /* _MEDIA_VIDEOBUF2_CORE_H */
Please place those into a include/media/frame_vector.h file, instead of merging it directly at vb2 core header.
Then include the new header at videobuf2-core.h and at frame_vector.c.
Makes sense, I'll do that for v3.
With such changes:
Acked-by: Mauro Carvalho Chehab mchehab+huawei@kernel.org
Thanks for taking a look. -Daniel
diff --git a/mm/Kconfig b/mm/Kconfig index 6c974888f86f..da6c943fe9f1 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -815,9 +815,6 @@ config DEVICE_PRIVATE memory; i.e., memory that is only accessible from the device (or group of devices). You likely also want to select HMM_MIRROR.
-config FRAME_VECTOR
- bool
config ARCH_USES_HIGH_VMA_FLAGS bool config ARCH_HAS_PKEYS diff --git a/mm/Makefile b/mm/Makefile index d5649f1c12c0..a025fd6c6afd 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -111,7 +111,6 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o -obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
Thanks, Mauro
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this.
Since ioremap might need to manipulate pagetables too we need to drop the pt lock and have a retry loop if we raced.
While at it, also add kerneldoc and improve the comment for the vma_ops->access function. It's for accessing, not for moving the memory from iomem to system memory, as the old comment seemed to suggest.
References: 28b2ee20c7cb ("access_process_vm device memory infrastructure") Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Rik van Riel riel@redhat.com Cc: Benjamin Herrensmidt benh@kernel.crashing.org Cc: Dave Airlie airlied@linux.ie Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com -- v2: Fix inversion in the retry check (John). --- include/linux/mm.h | 3 ++- mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 44 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index acd60fbf1a5a..2a16631c1fda 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -566,7 +566,8 @@ struct vm_operations_struct { vm_fault_t (*pfn_mkwrite)(struct vm_fault *vmf);
/* called by access_process_vm when get_user_pages() fails, typically - * for use by special VMAs that can switch between memory and hardware + * for use by special VMAs. See also generic_access_phys() for a generic + * implementation useful for any iomem mapping. */ int (*access)(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write); diff --git a/mm/memory.c b/mm/memory.c index fcfc4ca36eba..f7cbc4dde0ef 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4873,28 +4873,68 @@ int follow_phys(struct vm_area_struct *vma, return ret; }
+/** + * generic_access_phys - generic implementation for iomem mmap access + * @vma: the vma to access + * @addr: userspace addres, not relative offset within @vma + * @buf: buffer to read/write + * @len: length of transfer + * @write: set to FOLL_WRITE when writing, otherwise reading + * + * This is a generic implementation for &vm_operations_struct.access for an + * iomem mapping. This callback is used by access_process_vm() when the @vma is + * not page based. + */ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write) { resource_size_t phys_addr; unsigned long prot = 0; void __iomem *maddr; + pte_t *ptep, pte; + spinlock_t *ptl; int offset = addr & (PAGE_SIZE-1); + int ret = -EINVAL; + + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) + return -EINVAL; + +retry: + if (follow_pte(vma->vm_mm, addr, &ptep, &ptl)) + return -EINVAL; + pte = *ptep; + pte_unmap_unlock(ptep, ptl);
- if (follow_phys(vma, addr, write, &prot, &phys_addr)) + prot = pgprot_val(pte_pgprot(pte)); + phys_addr = (resource_size_t)pte_pfn(pte) << PAGE_SHIFT; + + if ((write & FOLL_WRITE) && !pte_write(pte)) return -EINVAL;
maddr = ioremap_prot(phys_addr, PAGE_ALIGN(len + offset), prot); if (!maddr) return -ENOMEM;
+ if (follow_pte(vma->vm_mm, addr, &ptep, &ptl)) + goto out_unmap; + + if (!pte_same(pte, *ptep)) { + pte_unmap_unlock(ptep, ptl); + iounmap(maddr); + + goto retry; + } + if (write) memcpy_toio(maddr + offset, buf, len); else memcpy_fromio(buf, maddr + offset, len); + ret = len; + pte_unmap_unlock(ptep, ptl); +out_unmap: iounmap(maddr);
- return len; + return ret; } EXPORT_SYMBOL_GPL(generic_access_phys); #endif
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this.
Since zpci_memcpy_from|toio seems to not do anything nefarious with locks we just need to open code get_pfn and follow_pfn and make sure we drop the locks only after we've done. The write function also needs the copy_from_user move, since we can't take userspace faults while holding the mmap sem.
Reviewed-by: Gerald Schaefer gerald.schaefer@linux.ibm.com
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Niklas Schnelle schnelle@linux.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: linux-s390@vger.kernel.org -- v2: Move VM_IO | VM_PFNMAP checks around so they keep returning EINVAL like before (Gerard) --- arch/s390/pci/pci_mmio.c | 98 +++++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 41 deletions(-)
diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 401cf670a243..1a6adbc68ee8 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -119,33 +119,15 @@ static inline int __memcpy_toio_inuser(void __iomem *dst, return rc; }
-static long get_pfn(unsigned long user_addr, unsigned long access, - unsigned long *pfn) -{ - struct vm_area_struct *vma; - long ret; - - mmap_read_lock(current->mm); - ret = -EINVAL; - vma = find_vma(current->mm, user_addr); - if (!vma) - goto out; - ret = -EACCES; - if (!(vma->vm_flags & access)) - goto out; - ret = follow_pfn(vma, user_addr, pfn); -out: - mmap_read_unlock(current->mm); - return ret; -} - SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, const void __user *, user_buffer, size_t, length) { u8 local_buf[64]; void __iomem *io_addr; void *buf; - unsigned long pfn; + struct vm_area_struct *vma; + pte_t *ptep; + spinlock_t *ptl; long ret;
if (!zpci_is_enabled()) @@ -158,7 +140,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, * We only support write access to MIO capable devices if we are on * a MIO enabled system. Otherwise we would have to check for every * address if it is a special ZPCI_ADDR and would have to do - * a get_pfn() which we don't need for MIO capable devices. Currently + * a pfn lookup which we don't need for MIO capable devices. Currently * ISM devices are the only devices without MIO support and there is no * known need for accessing these from userspace. */ @@ -176,21 +158,37 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, } else buf = local_buf;
- ret = get_pfn(mmio_addr, VM_WRITE, &pfn); + ret = -EFAULT; + if (copy_from_user(buf, user_buffer, length)) + goto out_free; + + mmap_read_lock(current->mm); + ret = -EINVAL; + vma = find_vma(current->mm, mmio_addr); + if (!vma) + goto out_unlock_mmap; + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) + goto out_unlock_mmap; + ret = -EACCES; + if (!(vma->vm_flags & VM_WRITE)) + goto out_unlock_mmap; + + ret = follow_pte_pmd(vma->vm_mm, mmio_addr, NULL, &ptep, NULL, &ptl); if (ret) - goto out; - io_addr = (void __iomem *)((pfn << PAGE_SHIFT) | + goto out_unlock_mmap; + + io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) | (mmio_addr & ~PAGE_MASK));
- ret = -EFAULT; if ((unsigned long) io_addr < ZPCI_IOMAP_ADDR_BASE) - goto out; - - if (copy_from_user(buf, user_buffer, length)) - goto out; + goto out_unlock_pt;
ret = zpci_memcpy_toio(io_addr, buf, length); -out: +out_unlock_pt: + pte_unmap_unlock(ptep, ptl); +out_unlock_mmap: + mmap_read_unlock(current->mm); +out_free: if (buf != local_buf) kfree(buf); return ret; @@ -274,7 +272,9 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, u8 local_buf[64]; void __iomem *io_addr; void *buf; - unsigned long pfn; + struct vm_area_struct *vma; + pte_t *ptep; + spinlock_t *ptl; long ret;
if (!zpci_is_enabled()) @@ -287,7 +287,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, * We only support read access to MIO capable devices if we are on * a MIO enabled system. Otherwise we would have to check for every * address if it is a special ZPCI_ADDR and would have to do - * a get_pfn() which we don't need for MIO capable devices. Currently + * a pfn lookup which we don't need for MIO capable devices. Currently * ISM devices are the only devices without MIO support and there is no * known need for accessing these from userspace. */ @@ -306,22 +306,38 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr, buf = local_buf; }
- ret = get_pfn(mmio_addr, VM_READ, &pfn); + mmap_read_lock(current->mm); + ret = -EINVAL; + vma = find_vma(current->mm, mmio_addr); + if (!vma) + goto out_unlock_mmap; + if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) + goto out_unlock_mmap; + ret = -EACCES; + if (!(vma->vm_flags & VM_WRITE)) + goto out_unlock_mmap; + + ret = follow_pte_pmd(vma->vm_mm, mmio_addr, NULL, &ptep, NULL, &ptl); if (ret) - goto out; - io_addr = (void __iomem *)((pfn << PAGE_SHIFT) | (mmio_addr & ~PAGE_MASK)); + goto out_unlock_mmap; + + io_addr = (void __iomem *)((pte_pfn(*ptep) << PAGE_SHIFT) | + (mmio_addr & ~PAGE_MASK));
if ((unsigned long) io_addr < ZPCI_IOMAP_ADDR_BASE) { ret = -EFAULT; - goto out; + goto out_unlock_pt; } ret = zpci_memcpy_fromio(buf, io_addr, length); - if (ret) - goto out; - if (copy_to_user(user_buffer, buf, length)) + +out_unlock_pt: + pte_unmap_unlock(ptep, ptl); +out_unlock_mmap: + mmap_read_unlock(current->mm); + + if (!ret && copy_to_user(user_buffer, buf, length)) ret = -EFAULT;
-out: if (buf != local_buf) kfree(buf); return ret;
Hi Daniel,
freshly back from my vacation I've just taken a look at your patch. First thanks for this fix and the detailed commit description. Definitely makes sense to fix this and you can add my
Acked-by: Niklas Schnelle schnelle@linux.ibm.com
Content wise it all looks sane and clear and since Gerald did the testing, I would have applied it to our tree already, but I got some trivial checkpatch violations that probably apply to the whole series. I've commented them inline below. If you confirm there I can do the fixups when applying or you can resend.
On 10/9/20 9:59 AM, Daniel Vetter wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
The above commit mention should use the format 'commit 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")' otherwise this results in a checkpatch ERROR.
("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this.
Since zpci_memcpy_from|toio seems to not do anything nefarious with locks we just need to open code get_pfn and follow_pfn and make sure we drop the locks only after we've done. The write function also needs
just a typo but just saw it "we're" instead of "we've"
the copy_from_user move, since we can't take userspace faults while holding the mmap sem.
Reviewed-by: Gerald Schaefer gerald.schaefer@linux.ibm.com
No empty line after the Revied-by tag.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Your Signed-off-by mail address does not match the one you're sending from, this yields a checkpatch warning when using git am with your mail. This is probably just a silly misconfiguration but since Signed-offs are signatures should I change this to "Daniel Vetter daniel.vetter@ffwll.ch" which is the one you're sending from and also in the MAINTAINERS file?
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com
The above Cc: line for Dan Williams is a duplicate
Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Niklas Schnelle schnelle@linux.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: linux-s390@vger.kernel.org -- v2: Move VM_IO | VM_PFNMAP checks around so they keep returning EINVAL like before (Gerard)
I think the above should go before the CC/Signed-off/Reviewev block.
arch/s390/pci/pci_mmio.c | 98 +++++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 41 deletions(-)
diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 401cf670a243..1a6adbc68ee8 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -119,33 +119,15 @@ static inline int __memcpy_toio_inuser(void __iomem *dst, return rc; }
-static long get_pfn(unsigned long user_addr, unsigned long access,
unsigned long *pfn)
-{
- struct vm_area_struct *vma;
- long ret;
- mmap_read_lock(current->mm);
- ret = -EINVAL;
- vma = find_vma(current->mm, user_addr);
- if (!vma)
goto out;
- ret = -EACCES;
- if (!(vma->vm_flags & access))
goto out;
- ret = follow_pfn(vma, user_addr, pfn);
-out:
- mmap_read_unlock(current->mm);
- return ret;
-}
SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, const void __user *, user_buffer, size_t, length) { u8 local_buf[64]; void __iomem *io_addr; void *buf;
- unsigned long pfn;
- struct vm_area_struct *vma;
- pte_t *ptep;
- spinlock_t *ptl;
With checkpatch.pl --strict the above yields a complained "CHECK: spinlock_t definition without comment" but I think that's really okay since your commit description is very clear. Same oin line 277.
... snip ...
On Mon, Oct 12, 2020 at 04:03:28PM +0200, Niklas Schnelle wrote:
Hi Daniel,
freshly back from my vacation I've just taken a look at your patch. First thanks for this fix and the detailed commit description. Definitely makes sense to fix this and you can add my
Acked-by: Niklas Schnelle schnelle@linux.ibm.com
Content wise it all looks sane and clear and since Gerald did the testing, I would have applied it to our tree already, but I got some trivial checkpatch violations that probably apply to the whole series. I've commented them inline below. If you confirm there I can do the fixups when applying or you can resend.
On 10/9/20 9:59 AM, Daniel Vetter wrote:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87
The above commit mention should use the format 'commit 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")' otherwise this results in a checkpatch ERROR.
("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea. Fix this.
Since zpci_memcpy_from|toio seems to not do anything nefarious with locks we just need to open code get_pfn and follow_pfn and make sure we drop the locks only after we've done. The write function also needs
just a typo but just saw it "we're" instead of "we've"
the copy_from_user move, since we can't take userspace faults while holding the mmap sem.
Reviewed-by: Gerald Schaefer gerald.schaefer@linux.ibm.com
No empty line after the Revied-by tag.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Your Signed-off-by mail address does not match the one you're sending from, this yields a checkpatch warning when using git am with your mail. This is probably just a silly misconfiguration but since Signed-offs are signatures should I change this to "Daniel Vetter daniel.vetter@ffwll.ch" which is the one you're sending from and also in the MAINTAINERS file?
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com
The above Cc: line for Dan Williams is a duplicate
Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Niklas Schnelle schnelle@linux.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: linux-s390@vger.kernel.org -- v2: Move VM_IO | VM_PFNMAP checks around so they keep returning EINVAL like before (Gerard)
I think the above should go before the CC/Signed-off/Reviewev block.
This is a per-subsystem bikeshed :-) drivers/gpu definitely wants it above, but most core subsystems want it below. I'll move it.
arch/s390/pci/pci_mmio.c | 98 +++++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 41 deletions(-)
diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 401cf670a243..1a6adbc68ee8 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -119,33 +119,15 @@ static inline int __memcpy_toio_inuser(void __iomem *dst, return rc; }
-static long get_pfn(unsigned long user_addr, unsigned long access,
unsigned long *pfn)
-{
- struct vm_area_struct *vma;
- long ret;
- mmap_read_lock(current->mm);
- ret = -EINVAL;
- vma = find_vma(current->mm, user_addr);
- if (!vma)
goto out;
- ret = -EACCES;
- if (!(vma->vm_flags & access))
goto out;
- ret = follow_pfn(vma, user_addr, pfn);
-out:
- mmap_read_unlock(current->mm);
- return ret;
-}
SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, const void __user *, user_buffer, size_t, length) { u8 local_buf[64]; void __iomem *io_addr; void *buf;
- unsigned long pfn;
- struct vm_area_struct *vma;
- pte_t *ptep;
- spinlock_t *ptl;
With checkpatch.pl --strict the above yields a complained "CHECK: spinlock_t definition without comment" but I think that's really okay since your commit description is very clear. Same oin line 277.
I think this is a falls positive, checkpatch doesn't realize that SYSCALL_DEFINE3 is a function, not a structure. And in a structure I'd have added the kerneldoc or comment.
I'll fix up all the nits you've found for the next round. Thanks for taking a look. -Daniel
... snip ...
Cc: linux-media@vger.kernel.org Cc: Niklas Schnelle schnelle@linux.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: linux-s390@vger.kernel.org -- v2: Move VM_IO | VM_PFNMAP checks around so they keep returning EINVAL like before (Gerard)
I think the above should go before the CC/Signed-off/Reviewev block.
This is a per-subsystem bikeshed :-) drivers/gpu definitely wants it above, but most core subsystems want it below. I'll move it.
Today I learned, thanks! That said I think most of the time I've actually not seen version change information in the commit message itself only in the cover letters. I really don't care just looked odd to me.
arch/s390/pci/pci_mmio.c | 98 +++++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 41 deletions(-)
diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 401cf670a243..1a6adbc68ee8 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -119,33 +119,15 @@ static inline int __memcpy_toio_inuser(void __iomem *dst, return rc; }
-static long get_pfn(unsigned long user_addr, unsigned long access,
unsigned long *pfn)
-{
- struct vm_area_struct *vma;
- long ret;
- mmap_read_lock(current->mm);
- ret = -EINVAL;
- vma = find_vma(current->mm, user_addr);
- if (!vma)
goto out;
- ret = -EACCES;
- if (!(vma->vm_flags & access))
goto out;
- ret = follow_pfn(vma, user_addr, pfn);
-out:
- mmap_read_unlock(current->mm);
- return ret;
-}
SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, const void __user *, user_buffer, size_t, length) { u8 local_buf[64]; void __iomem *io_addr; void *buf;
- unsigned long pfn;
- struct vm_area_struct *vma;
- pte_t *ptep;
- spinlock_t *ptl;
With checkpatch.pl --strict the above yields a complained "CHECK: spinlock_t definition without comment" but I think that's really okay since your commit description is very clear. Same oin line 277.
I think this is a falls positive, checkpatch doesn't realize that SYSCALL_DEFINE3 is a function, not a structure. And in a structure I'd have added the kerneldoc or comment.
Interesting, your theory sounds convincing, I too thought this was a bit too pedantic.
I'll fix up all the nits you've found for the next round. Thanks for taking a look.
You're welcome hope I didn't sound pedantic. I think you've a lot more experience actually and this can indeed turn into bikeshedding but since I was answering anyway and most of this was checkpatch…
-Daniel
Hi Daniel,
friendly ping. I haven't seen a new version of this patch series, as I said I think your change for s390/pci is generally useful so I'm curious, are you planning on sending a new version soon? If you want you can also just sent this patch with the last few nitpicks (primarily the mail address) fixed and I'll happily apply.
Best regards, Niklas Schnelle
On 10/12/20 4:19 PM, Daniel Vetter wrote:
On Mon, Oct 12, 2020 at 04:03:28PM +0200, Niklas Schnelle wrote:
... snip ....
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com
The above Cc: line for Dan Williams is a duplicate
Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Niklas Schnelle schnelle@linux.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: linux-s390@vger.kernel.org -- v2: Move VM_IO | VM_PFNMAP checks around so they keep returning EINVAL like before (Gerard)
I think the above should go before the CC/Signed-off/Reviewev block.
This is a per-subsystem bikeshed :-) drivers/gpu definitely wants it above, but most core subsystems want it below. I'll move it.
arch/s390/pci/pci_mmio.c | 98 +++++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 41 deletions(-)
diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 401cf670a243..1a6adbc68ee8 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -119,33 +119,15 @@ static inline int __memcpy_toio_inuser(void __iomem *dst, return rc; }
-static long get_pfn(unsigned long user_addr, unsigned long access,
unsigned long *pfn)
-{
- struct vm_area_struct *vma;
- long ret;
- mmap_read_lock(current->mm);
- ret = -EINVAL;
- vma = find_vma(current->mm, user_addr);
- if (!vma)
goto out;
- ret = -EACCES;
- if (!(vma->vm_flags & access))
goto out;
- ret = follow_pfn(vma, user_addr, pfn);
-out:
- mmap_read_unlock(current->mm);
- return ret;
-}
SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, const void __user *, user_buffer, size_t, length) { u8 local_buf[64]; void __iomem *io_addr; void *buf;
- unsigned long pfn;
- struct vm_area_struct *vma;
- pte_t *ptep;
- spinlock_t *ptl;
With checkpatch.pl --strict the above yields a complained "CHECK: spinlock_t definition without comment" but I think that's really okay since your commit description is very clear. Same oin line 277.
I think this is a falls positive, checkpatch doesn't realize that SYSCALL_DEFINE3 is a function, not a structure. And in a structure I'd have added the kerneldoc or comment.
I'll fix up all the nits you've found for the next round. Thanks for taking a look. -Daniel
On Wed, Oct 21, 2020 at 09:55:57AM +0200, Niklas Schnelle wrote:
Hi Daniel,
friendly ping. I haven't seen a new version of this patch series, as I said I think your change for s390/pci is generally useful so I'm curious, are you planning on sending a new version soon? If you want you can also just sent this patch with the last few nitpicks (primarily the mail address) fixed and I'll happily apply.
(I think this was stuck somewhere in moderation, only showed up just now)
I was waiting for the testing result for the habana driver from Oded, but I guess Oded was waiting for v3. Hence the delay.
Cheers, Daniel
Best regards, Niklas Schnelle
On 10/12/20 4:19 PM, Daniel Vetter wrote:
On Mon, Oct 12, 2020 at 04:03:28PM +0200, Niklas Schnelle wrote:
... snip ....
Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Dan Williams dan.j.williams@intel.com Cc: Kees Cook keescook@chromium.org Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com
The above Cc: line for Dan Williams is a duplicate
Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Niklas Schnelle schnelle@linux.ibm.com Cc: Gerald Schaefer gerald.schaefer@linux.ibm.com Cc: linux-s390@vger.kernel.org -- v2: Move VM_IO | VM_PFNMAP checks around so they keep returning EINVAL like before (Gerard)
I think the above should go before the CC/Signed-off/Reviewev block.
This is a per-subsystem bikeshed :-) drivers/gpu definitely wants it above, but most core subsystems want it below. I'll move it.
arch/s390/pci/pci_mmio.c | 98 +++++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 41 deletions(-)
diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c index 401cf670a243..1a6adbc68ee8 100644 --- a/arch/s390/pci/pci_mmio.c +++ b/arch/s390/pci/pci_mmio.c @@ -119,33 +119,15 @@ static inline int __memcpy_toio_inuser(void __iomem *dst, return rc; }
-static long get_pfn(unsigned long user_addr, unsigned long access,
unsigned long *pfn)
-{
- struct vm_area_struct *vma;
- long ret;
- mmap_read_lock(current->mm);
- ret = -EINVAL;
- vma = find_vma(current->mm, user_addr);
- if (!vma)
goto out;
- ret = -EACCES;
- if (!(vma->vm_flags & access))
goto out;
- ret = follow_pfn(vma, user_addr, pfn);
-out:
- mmap_read_unlock(current->mm);
- return ret;
-}
SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr, const void __user *, user_buffer, size_t, length) { u8 local_buf[64]; void __iomem *io_addr; void *buf;
- unsigned long pfn;
- struct vm_area_struct *vma;
- pte_t *ptep;
- spinlock_t *ptl;
With checkpatch.pl --strict the above yields a complained "CHECK: spinlock_t definition without comment" but I think that's really okay since your commit description is very clear. Same oin line 277.
I think this is a falls positive, checkpatch doesn't realize that SYSCALL_DEFINE3 is a function, not a structure. And in a structure I'd have added the kerneldoc or comment.
I'll fix up all the nits you've found for the next round. Thanks for taking a look. -Daniel
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: kvm@vger.kernel.org --- include/linux/mm.h | 2 ++ mm/memory.c | 32 +++++++++++++++++++++++++++++++- mm/nommu.c | 17 +++++++++++++++++ security/Kconfig | 13 +++++++++++++ 4 files changed, 63 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2a16631c1fda..ec8c90928fc9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1653,6 +1653,8 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long address, pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); +int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address, + unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, unsigned int flags, unsigned long *prot, resource_size_t *phys); int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, diff --git a/mm/memory.c b/mm/memory.c index f7cbc4dde0ef..7c7b234ffb24 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4821,7 +4821,12 @@ EXPORT_SYMBOL(follow_pte_pmd); * @address: user virtual address * @pfn: location to store found PFN * - * Only IO mappings and raw PFN mappings are allowed. + * Only IO mappings and raw PFN mappings are allowed. Note that callers must + * ensure coherency with pte updates by using a &mmu_notifier to follow updates. + * If this is not feasible, or the access to the @pfn is only very short term, + * use follow_pte_pmd() instead and hold the pagetable lock for the duration of + * the access instead. Any caller not following these requirements must use + * unsafe_follow_pfn() instead. * * Return: zero and the pfn at @pfn on success, -ve otherwise. */ @@ -4844,6 +4849,31 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL(follow_pfn);
+/** + * unsafe_follow_pfn - look up PFN at a user virtual address + * @vma: memory mapping + * @address: user virtual address + * @pfn: location to store found PFN + * + * Only IO mappings and raw PFN mappings are allowed. + * + * Returns zero and the pfn at @pfn on success, -ve otherwise. + */ +int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address, + unsigned long *pfn) +{ +#ifdef CONFIG_STRICT_FOLLOW_PFN + pr_info("unsafe follow_pfn usage rejected, see CONFIG_STRICT_FOLLOW_PFN\n"); + return -EINVAL; +#else + WARN_ONCE(1, "unsafe follow_pfn usage\n"); + add_taint(TAINT_USER, LOCKDEP_STILL_OK); + + return follow_pfn(vma, address, pfn); +#endif +} +EXPORT_SYMBOL(unsafe_follow_pfn); + #ifdef CONFIG_HAVE_IOREMAP_PROT int follow_phys(struct vm_area_struct *vma, unsigned long address, unsigned int flags, diff --git a/mm/nommu.c b/mm/nommu.c index 75a327149af1..3db2910f0d64 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -132,6 +132,23 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL(follow_pfn);
+/** + * unsafe_follow_pfn - look up PFN at a user virtual address + * @vma: memory mapping + * @address: user virtual address + * @pfn: location to store found PFN + * + * Only IO mappings and raw PFN mappings are allowed. + * + * Returns zero and the pfn at @pfn on success, -ve otherwise. + */ +int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address, + unsigned long *pfn) +{ + return follow_pfn(vma, address, pfn); +} +EXPORT_SYMBOL(unsafe_follow_pfn); + LIST_HEAD(vmap_area_list);
void vfree(const void *addr) diff --git a/security/Kconfig b/security/Kconfig index 7561f6f99f1d..48945402e103 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -230,6 +230,19 @@ config STATIC_USERMODEHELPER_PATH If you wish for all usermode helper programs to be disabled, specify an empty string here (i.e. "").
+config STRICT_FOLLOW_PFN + bool "Disable unsafe use of follow_pfn" + depends on MMU + help + Some functionality in the kernel follows userspace mappings to iomem + ranges in an unsafe matter. Examples include v4l userptr for zero-copy + buffers sharing. + + If this option is switched on, such access is rejected. Only enable + this option when you must run userspace which requires this. + + If in doubt, say Y. + source "security/selinux/Kconfig" source "security/smack/Kconfig" source "security/tomoyo/Kconfig"
Hi,
Em Fri, 9 Oct 2020 09:59:26 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
NACK, as this breaks an existing userspace API on media.
While I agree that using the userptr on media is something that new drivers may not support, as DMABUF is a better way of handling it, changing this for existing ones is a big no, as it may break usersapace.
The right approach here would be to be able to fine-tune support for it on a per-driver basis, e. g. disabling such feature only for drivers that would use a movable page.
The media subsystem has already a way to disable USERPTR support from VB2. the right approach would be to ensure that newer drivers will only set this if they won't use movable pages.
Regards, Mauro
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: kvm@vger.kernel.org
include/linux/mm.h | 2 ++ mm/memory.c | 32 +++++++++++++++++++++++++++++++- mm/nommu.c | 17 +++++++++++++++++ security/Kconfig | 13 +++++++++++++ 4 files changed, 63 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 2a16631c1fda..ec8c90928fc9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1653,6 +1653,8 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long address, pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); +int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address,
unsigned long *pfn);
int follow_phys(struct vm_area_struct *vma, unsigned long address, unsigned int flags, unsigned long *prot, resource_size_t *phys); int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, diff --git a/mm/memory.c b/mm/memory.c index f7cbc4dde0ef..7c7b234ffb24 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4821,7 +4821,12 @@ EXPORT_SYMBOL(follow_pte_pmd);
- @address: user virtual address
- @pfn: location to store found PFN
- Only IO mappings and raw PFN mappings are allowed.
- Only IO mappings and raw PFN mappings are allowed. Note that callers must
- ensure coherency with pte updates by using a &mmu_notifier to follow updates.
- If this is not feasible, or the access to the @pfn is only very short term,
- use follow_pte_pmd() instead and hold the pagetable lock for the duration of
- the access instead. Any caller not following these requirements must use
*/
- unsafe_follow_pfn() instead.
- Return: zero and the pfn at @pfn on success, -ve otherwise.
@@ -4844,6 +4849,31 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL(follow_pfn);
+/**
- unsafe_follow_pfn - look up PFN at a user virtual address
- @vma: memory mapping
- @address: user virtual address
- @pfn: location to store found PFN
- Only IO mappings and raw PFN mappings are allowed.
- Returns zero and the pfn at @pfn on success, -ve otherwise.
- */
+int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address,
- unsigned long *pfn)
+{ +#ifdef CONFIG_STRICT_FOLLOW_PFN
- pr_info("unsafe follow_pfn usage rejected, see CONFIG_STRICT_FOLLOW_PFN\n");
- return -EINVAL;
+#else
- WARN_ONCE(1, "unsafe follow_pfn usage\n");
- add_taint(TAINT_USER, LOCKDEP_STILL_OK);
- return follow_pfn(vma, address, pfn);
+#endif +} +EXPORT_SYMBOL(unsafe_follow_pfn);
#ifdef CONFIG_HAVE_IOREMAP_PROT int follow_phys(struct vm_area_struct *vma, unsigned long address, unsigned int flags, diff --git a/mm/nommu.c b/mm/nommu.c index 75a327149af1..3db2910f0d64 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -132,6 +132,23 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL(follow_pfn);
+/**
- unsafe_follow_pfn - look up PFN at a user virtual address
- @vma: memory mapping
- @address: user virtual address
- @pfn: location to store found PFN
- Only IO mappings and raw PFN mappings are allowed.
- Returns zero and the pfn at @pfn on success, -ve otherwise.
- */
+int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address,
- unsigned long *pfn)
+{
- return follow_pfn(vma, address, pfn);
+} +EXPORT_SYMBOL(unsafe_follow_pfn);
LIST_HEAD(vmap_area_list);
void vfree(const void *addr) diff --git a/security/Kconfig b/security/Kconfig index 7561f6f99f1d..48945402e103 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -230,6 +230,19 @@ config STATIC_USERMODEHELPER_PATH If you wish for all usermode helper programs to be disabled, specify an empty string here (i.e. "").
+config STRICT_FOLLOW_PFN
- bool "Disable unsafe use of follow_pfn"
- depends on MMU
- help
Some functionality in the kernel follows userspace mappings to iomem
ranges in an unsafe matter. Examples include v4l userptr for zero-copy
buffers sharing.
If this option is switched on, such access is rejected. Only enable
this option when you must run userspace which requires this.
If in doubt, say Y.
source "security/selinux/Kconfig" source "security/smack/Kconfig" source "security/tomoyo/Kconfig"
Thanks, Mauro
On Fri, Oct 09, 2020 at 12:34:21PM +0200, Mauro Carvalho Chehab wrote:
Hi,
Em Fri, 9 Oct 2020 09:59:26 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
NACK, as this breaks an existing userspace API on media.
It doesn't break it. You get a big warning the thing is broken and it keeps working.
We can't leave such a huge security hole open - it impacts other subsystems, distros need to be able to run in a secure mode.
While I agree that using the userptr on media is something that new drivers may not support, as DMABUF is a better way of handling it, changing this for existing ones is a big no, as it may break usersapace.
media community needs to work to fix this, not pretend it is OK to keep going as-is.
Dealing with security issues is the one case where an uABI break might be acceptable.
If you want to NAK it then you need to come up with the work to do something here correctly that will support the old drivers without the kernel taint.
Unfortunately making things uncomfortable for the subsystem is the big hammer the core kernel needs to use to actually get this security work done by those responsible.
Jason
Em Fri, 9 Oct 2020 09:21:11 -0300 Jason Gunthorpe jgg@ziepe.ca escreveu:
On Fri, Oct 09, 2020 at 12:34:21PM +0200, Mauro Carvalho Chehab wrote:
Hi,
Em Fri, 9 Oct 2020 09:59:26 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
NACK, as this breaks an existing userspace API on media.
It doesn't break it. You get a big warning the thing is broken and it keeps working.
We can't leave such a huge security hole open - it impacts other subsystems, distros need to be able to run in a secure mode.
Well, if distros disable it, then apps will break.
While I agree that using the userptr on media is something that new drivers may not support, as DMABUF is a better way of handling it, changing this for existing ones is a big no, as it may break usersapace.
media community needs to work to fix this, not pretend it is OK to keep going as-is.
Dealing with security issues is the one case where an uABI break might be acceptable.
If you want to NAK it then you need to come up with the work to do something here correctly that will support the old drivers without the kernel taint.
Unfortunately making things uncomfortable for the subsystem is the big hammer the core kernel needs to use to actually get this security work done by those responsible.
I'm not pretending that this is ok. Just pointing that the approach taken is NOT OK.
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
Please let address the issue on this way, instead of broken an userspace API that it is there since 1991.
Thanks, Mauro
Em Fri, 9 Oct 2020 14:37:23 +0200 Mauro Carvalho Chehab mchehab+huawei@kernel.org escreveu:
Em Fri, 9 Oct 2020 09:21:11 -0300 Jason Gunthorpe jgg@ziepe.ca escreveu:
On Fri, Oct 09, 2020 at 12:34:21PM +0200, Mauro Carvalho Chehab wrote:
Hi,
Em Fri, 9 Oct 2020 09:59:26 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
NACK, as this breaks an existing userspace API on media.
It doesn't break it. You get a big warning the thing is broken and it keeps working.
We can't leave such a huge security hole open - it impacts other subsystems, distros need to be able to run in a secure mode.
Well, if distros disable it, then apps will break.
While I agree that using the userptr on media is something that new drivers may not support, as DMABUF is a better way of handling it, changing this for existing ones is a big no, as it may break usersapace.
media community needs to work to fix this, not pretend it is OK to keep going as-is.
Dealing with security issues is the one case where an uABI break might be acceptable.
If you want to NAK it then you need to come up with the work to do something here correctly that will support the old drivers without the kernel taint.
Unfortunately making things uncomfortable for the subsystem is the big hammer the core kernel needs to use to actually get this security work done by those responsible.
I'm not pretending that this is ok. Just pointing that the approach taken is NOT OK.
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
Please let address the issue on this way, instead of broken an userspace API that it is there since 1991.
In time: I meant to say 1998.
Thanks, Mauro
On Fri, Oct 09, 2020 at 02:37:23PM +0200, Mauro Carvalho Chehab wrote:
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
No, it is unconditionally unsafe. The CMA movable mappings are specific VMAs that will have bad issues here, but there are other types too.
The only way to do something at a VMA level is to have a list of OK VMAs, eg because they were creatd via a special mmap helper from the media subsystem.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
It doesn't matter, nothing prevents the user from calling media APIs on mmaps it gets from other subsystems.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
We can't tell from the VMA these kinds of details..
It has to go the other direction, evey mmap that might be used as a userptr here has to be found and the VMA specially created to allow its use. At least that is a kernel only change, but will need people with the HW to do this work.
Please let address the issue on this way, instead of broken an userspace API that it is there since 1991.
It has happened before :( It took 4 years for RDMA to undo the uAPI breakage caused by a security fix for something that was a 15 years old bug.
Jason
On Fri, Oct 9, 2020 at 2:48 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 02:37:23PM +0200, Mauro Carvalho Chehab wrote:
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
No, it is unconditionally unsafe. The CMA movable mappings are specific VMAs that will have bad issues here, but there are other types too.
The only way to do something at a VMA level is to have a list of OK VMAs, eg because they were creatd via a special mmap helper from the media subsystem.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
It doesn't matter, nothing prevents the user from calling media APIs on mmaps it gets from other subsystems.
I think a good first step would be to disable userptr of non struct page backed storage going forward for any new hw support. Even on existing drivers. dma-buf sharing has been around for long enough now that this shouldn't be a problem. Unfortunately right now this doesn't seem to exist, so the entire problem keeps getting perpetuated.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
We can't tell from the VMA these kinds of details..
It has to go the other direction, evey mmap that might be used as a userptr here has to be found and the VMA specially created to allow its use. At least that is a kernel only change, but will need people with the HW to do this work.
I think the only reasonable way to keep this working is: - add a struct dma_buf *vma_tryget_dma_buf(struct vm_area_struct *vma); - add dma-buf export support to fbdev and v4l - roll this out everywhere we still need it.
Realistically this just isn't going to happen. And anything else just reimplements half of dma-buf, which is kinda pointless (you need minimally refcounting and some way to get at a promise of a permanent sg list for dma. Plus probably the vmap for kernel cpu access.
Please let address the issue on this way, instead of broken an userspace API that it is there since 1991.
It has happened before :( It took 4 years for RDMA to undo the uAPI breakage caused by a security fix for something that was a 15 years old bug.
Yeah we have a bunch of these on the drm side too. Some of them are really just "you have to upgrade userspace", and there's no real fix for the security nightmare without that. -Daniel
On Fri, Oct 09, 2020 at 07:52:05PM +0200, Daniel Vetter wrote:
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
We can't tell from the VMA these kinds of details..
It has to go the other direction, evey mmap that might be used as a userptr here has to be found and the VMA specially created to allow its use. At least that is a kernel only change, but will need people with the HW to do this work.
I think the only reasonable way to keep this working is:
- add a struct dma_buf *vma_tryget_dma_buf(struct vm_area_struct *vma);
- add dma-buf export support to fbdev and v4l
- roll this out everywhere we still need it.
It seems to me there is a technical way forward to restore user compat, so it is really no different than RDMA/DRM pain we both suffered before.
Thus no justification to NAK it. If media wants things to keep working they have to do the technical path like you outline above.
Realistically this just isn't going to happen.
If your series goes ahead it will get solved. Someone will take on the huge project to either add DMA buf to the drivers people still care about, or do the work above to transparently handle in kernel.
If we allow things to keep working without consequence then nobody will do it.
The only reason we did the 4 years of work in RDMA was because Linus went in and broke the uABI for a security fix. It was hundreds of patches to fix it, so I don't have much sympathy for "it is too hard" here.
Jason
On Fri, Oct 9, 2020 at 8:01 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 07:52:05PM +0200, Daniel Vetter wrote:
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
We can't tell from the VMA these kinds of details..
It has to go the other direction, evey mmap that might be used as a userptr here has to be found and the VMA specially created to allow its use. At least that is a kernel only change, but will need people with the HW to do this work.
I think the only reasonable way to keep this working is:
- add a struct dma_buf *vma_tryget_dma_buf(struct vm_area_struct *vma);
- add dma-buf export support to fbdev and v4l
- roll this out everywhere we still need it.
It seems to me there is a technical way forward to restore user compat, so it is really no different than RDMA/DRM pain we both suffered before.
Thus no justification to NAK it. If media wants things to keep working they have to do the technical path like you outline above.
Realistically this just isn't going to happen.
If your series goes ahead it will get solved. Someone will take on the huge project to either add DMA buf to the drivers people still care about, or do the work above to transparently handle in kernel.
If we allow things to keep working without consequence then nobody will do it.
The only reason we did the 4 years of work in RDMA was because Linus went in and broke the uABI for a security fix. It was hundreds of patches to fix it, so I don't have much sympathy for "it is too hard" here.
Oh fully agreeing with you here, I just wanted to lay out that a) there is a solid plan to fix it and b) it's way too much work for me to just type it as a part of a "learn me some core mm semantics" project :-)
I was hoping that we could get away with a special marker for problematic vma, and filter those out. But after all the digging I've noticed that on anything remotely modern, there's just nothing left. Device memory management has become massively more dynamic in the past 10 years. -Daniel
Em Fri, 9 Oct 2020 19:52:05 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
On Fri, Oct 9, 2020 at 2:48 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 02:37:23PM +0200, Mauro Carvalho Chehab wrote:
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
No, it is unconditionally unsafe. The CMA movable mappings are specific VMAs that will have bad issues here, but there are other types too.
I didn't check the mm dirty details, but I strongly suspect that the mm code has a way to prevent moving a mmapped page while it is still in usage.
If not, then the entire movable pages concept sounds broken to me, and has to be fixed at mm subsystem.
The only way to do something at a VMA level is to have a list of OK VMAs, eg because they were creatd via a special mmap helper from the media subsystem.
I'm not sure if I'm following you on that. The media API can work with different ways of sending buffers to userspace:
- read();
- using the overlay mode. This interface is deprecated in practice, being replaced by DMABUF. Only a few older hardware supports it, and it depends on an special X11 helper driver for it to work.
- via DMABUF: https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/dmabuf.html
- via mmap, using a mmap helper: https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/mmap.html
- via mmap, using userspace-provided pointers: https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/userp.html
The existing open-source programs usually chose one or more of the above modes. if the Kernel driver returns -EINVAL when an mmapped streaming I/O mode is not supported, userspace has to select a different method.
Most userspace open source programs have fallback support: if one mmap I/O method fails, it selects another one, although this is not a mandatory requirement. I found (and fixed) a few ones that were relying exclusively on userptr support, but I didn't make a comprehensive check.
Also there are a number of relevant closed-source apps that we have no idea about what methods they use, like Skype, and other similar videoconferencing programs. Breaking support for those, specially at a time where people are relying on it in order to participate on conferences and doing internal meetings is a **very bad** idea.
So, whatever solution is taken, it should not be dumping warning messages at the system and tainting the Kernel, but, instead, checking if the userspace request is valid or not. If it is invalid, return the proper error code via the right V4L2 ioctl, in order to let userspace choose a different method. I the request is valid, refcount the pages for them to not be moved while they're still in usage.
-
Let me provide some background about how things work at the media subsytem. If you want to know more, the userspace-provided memory mapped pointers work is described here:
https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/userp.html#...
Basically, userspace calls either one of those ioctls:
VIDIOC_CREATE_BUFS: https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/vidioc-crea...
Which is translated into a videobuf2 call to: vb2_ioctl_create_bufs()
VIDIOC_REQBUFS https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/vidioc-reqb...
Which is translated into a videobuf2 call to: vb2_ioctl_reqbufs()
Both internally calls vb2_verify_memory_type(), which is responsible for checking if the provided pointers are OK for the usage and/or do all necessary page allocations, and taking care of any special requirements. This could easily have some additional checks to verify if the requested VMA address has pages that are movable or not, ensuring that ensure that the VMA is OK, and locking them in order to prevent the mm code to move such pages while they are in usage by the media subsystem.
Now, as I said before, I don't know the dirty details about how to lock those pages at the mm subsystem in order to avoid it to move the used pages. Yet, when vb2_create_framevec() is called, the check if VMA is OK should already be happened at vb2_verify_memory_type().
-
It should be noticed that the dirty hack added by patch 09/17 and 10/17 affects *all* types of memory allocations at V4L2, as this kAPI is called by the 3 different memory models supported at the media subsystem:
drivers/media/common/videobuf2/videobuf2-vmalloc.c drivers/media/common/videobuf2/videobuf2-dma-contig.c drivers/media/common/videobuf2/videobuf2-dma-sg.c
In other words, with this code:
int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn) { #ifdef CONFIG_STRICT_FOLLOW_PFN pr_info("unsafe follow_pfn usage rejected, see CONFIG_STRICT_FOLLOW_PFN\n"); return -EINVAL; #else WARN_ONCE(1, "unsafe follow_pfn usage\n"); add_taint(TAINT_USER, LOCKDEP_STILL_OK);
return follow_pfn(vma, address, pfn); #endif
you're unconditionally breaking the media userspace API support not only for embedded systems that could be using userptr instead of DMA_BUF, but also for *all* video devices, including USB cameras.
This is **NOT** an acceptable solution.
So, I stand my NACK to this approach.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
It doesn't matter, nothing prevents the user from calling media APIs on mmaps it gets from other subsystems.
I think a good first step would be to disable userptr of non struct page backed storage going forward for any new hw support. Even on existing drivers. dma-buf sharing has been around for long enough now that this shouldn't be a problem. Unfortunately right now this doesn't seem to exist, so the entire problem keeps getting perpetuated.
Well, the media uAPI does support DMABUF (both import and export):
https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/dmabuf.html https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/vidioc-expb...
And I fully agree that newer programs should use DMABUF when sharing buffers with DRM subsystem, but that's not my main concern.
What I do want is to not break userspace support nor to taint the Kernel due to a valid uAPI call.
A valid uAPI call should check if the parameters passed though it are valid. If they are, it should handle. Otherwise, it should return -EINVAL, without tainting the Kernel or printing warning messages.
The approach took by patches 09/17 and 10/17 doesn't do that. Instead, they just unconditionally breaks the media uAPI.
What should be done, instead, is to drop patch 10/17, and work on a way for the code inside vb2_create_framevec() to ensure that, if USERPTR is used, the memory pages will be properly locked while the driver is using, returning -EINVAL only if there's no way to proceed, without tainting the Kernel.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
We can't tell from the VMA these kinds of details..
It has to go the other direction, evey mmap that might be used as a userptr here has to be found and the VMA specially created to allow its use. At least that is a kernel only change, but will need people with the HW to do this work.
I think the only reasonable way to keep this working is:
- add a struct dma_buf *vma_tryget_dma_buf(struct vm_area_struct *vma);
Not sure how an userspace buffer could be mapped to be using it, specially since the buffer may not even be imported/exported from the DRM subsystem, but it could simply be allocated via glibc calloc()/malloc().
- add dma-buf export support to fbdev and v4l
V4L has support for it already.
- roll this out everywhere we still need it.
That's where things are hard. This is not like DRM, where the APIs are called via some open source libraries that are also managed by DRM upstream developers.
In the case of the media subsystem, while we added a libv4l sometime ago, not all userspace apps use it, as a large part of them used to exist before the addition of the libraries. Also, we're currently trying to deprecate libv4l, at least for embedded systems, in favor of libcamera.
On media, there are lots of closed source apps that uses the media API directly. Even talking about open source ones, there are lots of different apps, including not only media-specific apps, but also generic ones, like web browsers, which don't use the libraries we wrote.
An userspace API breakage would take *huge* efforts and will take lots of time to have it merged everywhere. It will cause lots of troubles everywhere.
Realistically this just isn't going to happen.
Why not? Any app developer could already use DMA-BUF if required, as the upstream support was added several years ago.
And anything else just reimplements half of dma-buf,
It is just the opposite: those uAPIs came years before dma-buf. In other words, it was dma-buf that re-implemented it ;-)
Now, I agree with you that dma-buf is a way cooler than the past alternatives.
-
It sounds to me that you're considering on only one use case of USERPTR: to pass a buffer created from DRM. As far as I'm aware, only embedded userspace applications actually do that.
Yet, there are a number of other applications that do something like the userptr_capture() function on this code:
https://git.linuxtv.org/v4l-utils.git/tree/contrib/test/v4l2grab.c
E. g. using glibc alloc functions like calloc() to allocate memory, passing the user-allocated data to the Kernel via something like this:
struct v4l2_requestbuffers req; struct v4l2_buffer buf; int n_buffers = 2;
req.count = 2; req.type = V4L2_BUF_TYPE_VIDEO_CAPTURE; req.memory = V4L2_MEMORY_USERPTR; if (ioctl(fd, VIDIOC_REQBUFS, &req)) return errno;
for (i = 0; i < req.count; ++i) { buf.type = V4L2_BUF_TYPE_VIDEO_CAPTURE; buf.memory = V4L2_MEMORY_USERPTR; buf.index = i; buf.m.userptr = (unsigned long)buffers[i].start; buf.length = buffers[i].length; if (ioctl(fd, VIDIOC_QBUF, &buf)) return errno; } if (ioctl(fd, VIDIOC_STREAMON, &req.type)) return errno;
/* some capture loop */
ioctl(fd, VIDIOC_STREAMOFF, &req.type);
I can't possibly see *any* security issues with the above code.
As I said before, VIDIOC_REQBUFS should be checking if the userspace buffers are OK and ensure that their refcounts will be incremented, in order to avoid mm to move the pages used there, freeing the refconts when VIDIOC_STREAMOFF - or close(fd) - is called.
which is kinda pointless (you need minimally refcounting and some way to get at a promise of a permanent sg list for dma. Plus probably the vmap for kernel cpu access.
Yeah, refcounting needs to happen.
Thanks, Mauro
Hi Mauro,
You might want to read the patches more carefully, because what you're demanding is what my patches do. Short summary:
- if STRICT_FOLLOW_PFN is set: a) normal memory is handled as-is (i.e. your example works) through the addition of FOLL_LONGTERM. This is the "pin the pages correctly" approach you're demanding b) for non-page memory (zerocopy sharing before dma-buf was upstreamed is the only use-case for this) it is correctly rejected with -EINVAL
- if you do have blobby userspace which requires the zero-copy using userptr to work, and doesn't have any of the fallbacks implemented that you describe, this would be a regression. That's why STRICT_FOLLOW_PFN can be unset. And yes there's a real security issue in this usage, Marek also confirmed that the removal of the vma copy code a few years ago essentially broke even the weak assumptions that made the code work 10+ years ago when it was merged.
so tdlr; Everything you described will keep working even with the new flag set, and everything you demand must be implemented _is_ implemented in this patch series.
Also please keep in mind that we are _not_ talking about the general userptr support that was merge ~20 years ago. This patch series here is _only_ about the zerocpy userptr support merged with 50ac952d2263 ("[media] videobuf2-dma-sg: Support io userptr operations on io memory") in 2013.
Why this hack was merged in 2013 when we merged dma-buf almost 2 years before that I have no idea about. Imo that patch simply should never have landed, and instead dma-buf support prioritized.
Cheers, Daniel
On Sat, Oct 10, 2020 at 11:21 AM Mauro Carvalho Chehab mchehab+huawei@kernel.org wrote:
Em Fri, 9 Oct 2020 19:52:05 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
On Fri, Oct 9, 2020 at 2:48 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 02:37:23PM +0200, Mauro Carvalho Chehab wrote:
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
No, it is unconditionally unsafe. The CMA movable mappings are specific VMAs that will have bad issues here, but there are other types too.
I didn't check the mm dirty details, but I strongly suspect that the mm code has a way to prevent moving a mmapped page while it is still in usage.
If not, then the entire movable pages concept sounds broken to me, and has to be fixed at mm subsystem.
The only way to do something at a VMA level is to have a list of OK VMAs, eg because they were creatd via a special mmap helper from the media subsystem.
I'm not sure if I'm following you on that. The media API can work with different ways of sending buffers to userspace:
- read(); - using the overlay mode. This interface is deprecated in practice, being replaced by DMABUF. Only a few older hardware supports it, and it depends on an special X11 helper driver for it to work. - via DMABUF: https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/dmabuf.html - via mmap, using a mmap helper: https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/mmap.html - via mmap, using userspace-provided pointers: https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/userp.html
The existing open-source programs usually chose one or more of the above modes. if the Kernel driver returns -EINVAL when an mmapped streaming I/O mode is not supported, userspace has to select a different method.
Most userspace open source programs have fallback support: if one mmap I/O method fails, it selects another one, although this is not a mandatory requirement. I found (and fixed) a few ones that were relying exclusively on userptr support, but I didn't make a comprehensive check.
Also there are a number of relevant closed-source apps that we have no idea about what methods they use, like Skype, and other similar videoconferencing programs. Breaking support for those, specially at a time where people are relying on it in order to participate on conferences and doing internal meetings is a **very bad** idea.
So, whatever solution is taken, it should not be dumping warning messages at the system and tainting the Kernel, but, instead, checking if the userspace request is valid or not. If it is invalid, return the proper error code via the right V4L2 ioctl, in order to let userspace choose a different method. I the request is valid, refcount the pages for them to not be moved while they're still in usage.
Let me provide some background about how things work at the media subsytem. If you want to know more, the userspace-provided memory mapped pointers work is described here:
https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/userp.html#userp
Basically, userspace calls either one of those ioctls:
VIDIOC_CREATE_BUFS: https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/vidioc-create-bufs.html
Which is translated into a videobuf2 call to: vb2_ioctl_create_bufs()
VIDIOC_REQBUFS https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/vidioc-reqbufs.html#vidioc-reqbufs
Which is translated into a videobuf2 call to: vb2_ioctl_reqbufs()
Both internally calls vb2_verify_memory_type(), which is responsible for checking if the provided pointers are OK for the usage and/or do all necessary page allocations, and taking care of any special requirements. This could easily have some additional checks to verify if the requested VMA address has pages that are movable or not, ensuring that ensure that the VMA is OK, and locking them in order to prevent the mm code to move such pages while they are in usage by the media subsystem.
Now, as I said before, I don't know the dirty details about how to lock those pages at the mm subsystem in order to avoid it to move the used pages. Yet, when vb2_create_framevec() is called, the check if VMA is OK should already be happened at vb2_verify_memory_type().
It should be noticed that the dirty hack added by patch 09/17 and 10/17 affects *all* types of memory allocations at V4L2, as this kAPI is called by the 3 different memory models supported at the media subsystem:
drivers/media/common/videobuf2/videobuf2-vmalloc.c drivers/media/common/videobuf2/videobuf2-dma-contig.c drivers/media/common/videobuf2/videobuf2-dma-sg.c
In other words, with this code:
int unsafe_follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn) { #ifdef CONFIG_STRICT_FOLLOW_PFN pr_info("unsafe follow_pfn usage rejected, see CONFIG_STRICT_FOLLOW_PFN\n"); return -EINVAL; #else WARN_ONCE(1, "unsafe follow_pfn usage\n"); add_taint(TAINT_USER, LOCKDEP_STILL_OK); return follow_pfn(vma, address, pfn); #endif
you're unconditionally breaking the media userspace API support not only for embedded systems that could be using userptr instead of DMA_BUF, but also for *all* video devices, including USB cameras.
This is **NOT** an acceptable solution.
So, I stand my NACK to this approach.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
It doesn't matter, nothing prevents the user from calling media APIs on mmaps it gets from other subsystems.
I think a good first step would be to disable userptr of non struct page backed storage going forward for any new hw support. Even on existing drivers. dma-buf sharing has been around for long enough now that this shouldn't be a problem. Unfortunately right now this doesn't seem to exist, so the entire problem keeps getting perpetuated.
Well, the media uAPI does support DMABUF (both import and export):
https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/dmabuf.html https://linuxtv.org/downloads/v4l-dvb-apis-new/userspace-api/v4l/vidioc-expbuf.html#vidioc-expbuf
And I fully agree that newer programs should use DMABUF when sharing buffers with DRM subsystem, but that's not my main concern.
What I do want is to not break userspace support nor to taint the Kernel due to a valid uAPI call.
A valid uAPI call should check if the parameters passed though it are valid. If they are, it should handle. Otherwise, it should return -EINVAL, without tainting the Kernel or printing warning messages.
The approach took by patches 09/17 and 10/17 doesn't do that. Instead, they just unconditionally breaks the media uAPI.
What should be done, instead, is to drop patch 10/17, and work on a way for the code inside vb2_create_framevec() to ensure that, if USERPTR is used, the memory pages will be properly locked while the driver is using, returning -EINVAL only if there's no way to proceed, without tainting the Kernel.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
We can't tell from the VMA these kinds of details..
It has to go the other direction, evey mmap that might be used as a userptr here has to be found and the VMA specially created to allow its use. At least that is a kernel only change, but will need people with the HW to do this work.
I think the only reasonable way to keep this working is:
- add a struct dma_buf *vma_tryget_dma_buf(struct vm_area_struct *vma);
Not sure how an userspace buffer could be mapped to be using it, specially since the buffer may not even be imported/exported from the DRM subsystem, but it could simply be allocated via glibc calloc()/malloc().
- add dma-buf export support to fbdev and v4l
V4L has support for it already.
- roll this out everywhere we still need it.
That's where things are hard. This is not like DRM, where the APIs are called via some open source libraries that are also managed by DRM upstream developers.
In the case of the media subsystem, while we added a libv4l sometime ago, not all userspace apps use it, as a large part of them used to exist before the addition of the libraries. Also, we're currently trying to deprecate libv4l, at least for embedded systems, in favor of libcamera.
On media, there are lots of closed source apps that uses the media API directly. Even talking about open source ones, there are lots of different apps, including not only media-specific apps, but also generic ones, like web browsers, which don't use the libraries we wrote.
An userspace API breakage would take *huge* efforts and will take lots of time to have it merged everywhere. It will cause lots of troubles everywhere.
Realistically this just isn't going to happen.
Why not? Any app developer could already use DMA-BUF if required, as the upstream support was added several years ago.
And anything else just reimplements half of dma-buf,
It is just the opposite: those uAPIs came years before dma-buf. In other words, it was dma-buf that re-implemented it ;-)
Now, I agree with you that dma-buf is a way cooler than the past alternatives.
It sounds to me that you're considering on only one use case of USERPTR: to pass a buffer created from DRM. As far as I'm aware, only embedded userspace applications actually do that.
Yet, there are a number of other applications that do something like the userptr_capture() function on this code:
https://git.linuxtv.org/v4l-utils.git/tree/contrib/test/v4l2grab.c
E. g. using glibc alloc functions like calloc() to allocate memory, passing the user-allocated data to the Kernel via something like this:
struct v4l2_requestbuffers req; struct v4l2_buffer buf; int n_buffers = 2; req.count = 2; req.type = V4L2_BUF_TYPE_VIDEO_CAPTURE; req.memory = V4L2_MEMORY_USERPTR; if (ioctl(fd, VIDIOC_REQBUFS, &req)) return errno; for (i = 0; i < req.count; ++i) { buf.type = V4L2_BUF_TYPE_VIDEO_CAPTURE; buf.memory = V4L2_MEMORY_USERPTR; buf.index = i; buf.m.userptr = (unsigned long)buffers[i].start; buf.length = buffers[i].length; if (ioctl(fd, VIDIOC_QBUF, &buf)) return errno; } if (ioctl(fd, VIDIOC_STREAMON, &req.type)) return errno; /* some capture loop */ ioctl(fd, VIDIOC_STREAMOFF, &req.type);
I can't possibly see *any* security issues with the above code.
As I said before, VIDIOC_REQBUFS should be checking if the userspace buffers are OK and ensure that their refcounts will be incremented, in order to avoid mm to move the pages used there, freeing the refconts when VIDIOC_STREAMOFF - or close(fd) - is called.
which is kinda pointless (you need minimally refcounting and some way to get at a promise of a permanent sg list for dma. Plus probably the vmap for kernel cpu access.
Yeah, refcounting needs to happen.
Thanks, Mauro
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Em Sat, 10 Oct 2020 12:53:49 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
Hi Mauro,
You might want to read the patches more carefully, because what you're demanding is what my patches do. Short summary:
- if STRICT_FOLLOW_PFN is set:
a) normal memory is handled as-is (i.e. your example works) through the addition of FOLL_LONGTERM. This is the "pin the pages correctly" approach you're demanding b) for non-page memory (zerocopy sharing before dma-buf was upstreamed is the only use-case for this) it is correctly rejected with -EINVAL
- if you do have blobby userspace which requires the zero-copy using
userptr to work, and doesn't have any of the fallbacks implemented that you describe, this would be a regression. That's why STRICT_FOLLOW_PFN can be unset. And yes there's a real security issue in this usage, Marek also confirmed that the removal of the vma copy code a few years ago essentially broke even the weak assumptions that made the code work 10+ years ago when it was merged.
so tdlr; Everything you described will keep working even with the new flag set, and everything you demand must be implemented _is_ implemented in this patch series.
Also please keep in mind that we are _not_ talking about the general userptr support that was merge ~20 years ago. This patch series here is _only_ about the zerocpy userptr support merged with 50ac952d2263 ("[media] videobuf2-dma-sg: Support io userptr operations on io memory") in 2013.
Ok, now it is making more sense. Please update the comments for patch 10/17 to describe the above.
We need some time to test this though, in order to check if no regressions were added (except the ones due to changeset 50ac952d2263).
Why this hack was merged in 2013 when we merged dma-buf almost 2 years before that I have no idea about. Imo that patch simply should never have landed, and instead dma-buf support prioritized.
If I recall correctly, we didn't have any DMABUF support at the media subsystem, back on 2013.
It took some time for the DMA-BUF to arrive at media, as this was not a top priority. Also, there aren't many developers that understand the memory model well enough to implement DMA-BUF support and touch the VB2 code, which is quite complex, as it supports lots of different ways for I/O, plus works with vmalloc, DMA contig and DMA scatter/gather.
Changes there should carefully be tested against different drivers, in order to avoid regressions on it.
Cheers, Daniel
Thanks, Mauro
On Sat, Oct 10, 2020 at 1:39 PM Mauro Carvalho Chehab mchehab+huawei@kernel.org wrote:
Em Sat, 10 Oct 2020 12:53:49 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
Hi Mauro,
You might want to read the patches more carefully, because what you're demanding is what my patches do. Short summary:
- if STRICT_FOLLOW_PFN is set:
a) normal memory is handled as-is (i.e. your example works) through the addition of FOLL_LONGTERM. This is the "pin the pages correctly" approach you're demanding b) for non-page memory (zerocopy sharing before dma-buf was upstreamed is the only use-case for this) it is correctly rejected with -EINVAL
- if you do have blobby userspace which requires the zero-copy using
userptr to work, and doesn't have any of the fallbacks implemented that you describe, this would be a regression. That's why STRICT_FOLLOW_PFN can be unset. And yes there's a real security issue in this usage, Marek also confirmed that the removal of the vma copy code a few years ago essentially broke even the weak assumptions that made the code work 10+ years ago when it was merged.
so tdlr; Everything you described will keep working even with the new flag set, and everything you demand must be implemented _is_ implemented in this patch series.
Also please keep in mind that we are _not_ talking about the general userptr support that was merge ~20 years ago. This patch series here is _only_ about the zerocpy userptr support merged with 50ac952d2263 ("[media] videobuf2-dma-sg: Support io userptr operations on io memory") in 2013.
Ok, now it is making more sense. Please update the comments for patch 10/17 to describe the above.
Will do.
We need some time to test this though, in order to check if no regressions were added (except the ones due to changeset 50ac952d2263).
Yeah testing of the previous patches to switch to FOLL_LONGTERM would be really good. I also need that for habanalabs and ideally exynos too. All the userptr for normal memory should keep working, and with FOLL_LONGTERM it should actually work better, since with that it should now correctly interact with pagecache and fs code, not just with anon memory from malloc.
Thanks, Daniel
Why this hack was merged in 2013 when we merged dma-buf almost 2 years before that I have no idea about. Imo that patch simply should never have landed, and instead dma-buf support prioritized.
If I recall correctly, we didn't have any DMABUF support at the media subsystem, back on 2013.
It took some time for the DMA-BUF to arrive at media, as this was not a top priority. Also, there aren't many developers that understand the memory model well enough to implement DMA-BUF support and touch the VB2 code, which is quite complex, as it supports lots of different ways for I/O, plus works with vmalloc, DMA contig and DMA scatter/gather.
Changes there should carefully be tested against different drivers, in order to avoid regressions on it.
Cheers, Daniel
Thanks, Mauro
Hi Daniel,
On Fri, Oct 9, 2020 at 7:52 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Fri, Oct 9, 2020 at 2:48 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 02:37:23PM +0200, Mauro Carvalho Chehab wrote:
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
No, it is unconditionally unsafe. The CMA movable mappings are specific VMAs that will have bad issues here, but there are other types too.
The only way to do something at a VMA level is to have a list of OK VMAs, eg because they were creatd via a special mmap helper from the media subsystem.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
It doesn't matter, nothing prevents the user from calling media APIs on mmaps it gets from other subsystems.
I think a good first step would be to disable userptr of non struct page backed storage going forward for any new hw support. Even on existing drivers. dma-buf sharing has been around for long enough now that this shouldn't be a problem. Unfortunately right now this doesn't seem to exist, so the entire problem keeps getting perpetuated.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
We can't tell from the VMA these kinds of details..
It has to go the other direction, evey mmap that might be used as a userptr here has to be found and the VMA specially created to allow its use. At least that is a kernel only change, but will need people with the HW to do this work.
I think the only reasonable way to keep this working is:
- add a struct dma_buf *vma_tryget_dma_buf(struct vm_area_struct *vma);
- add dma-buf export support to fbdev and v4l
I assume you mean V4L2 and not the obsolete V4L that is emulated in the userspace by libv4l. If so, every video device that uses videobuf2 gets DMA-buf export for free and there is nothing needed to enable it.
We probably still have a few legacy drivers using videobuf (non-2), but IMHO those should be safe to put behind some disabled-by-default Kconfig symbol or even completely drop, as the legacy framework has been deprecated for many years already.
- roll this out everywhere we still need it.
Realistically this just isn't going to happen. And anything else just reimplements half of dma-buf, which is kinda pointless (you need minimally refcounting and some way to get at a promise of a permanent sg list for dma. Plus probably the vmap for kernel cpu access.
Please let address the issue on this way, instead of broken an userspace API that it is there since 1991.
It has happened before :( It took 4 years for RDMA to undo the uAPI breakage caused by a security fix for something that was a 15 years old bug.
Yeah we have a bunch of these on the drm side too. Some of them are really just "you have to upgrade userspace", and there's no real fix for the security nightmare without that.
I think we need to phase out such userspace indeed. The Kconfig symbol allows enabling the unsafe functionality for anyone who still needs it, so I think it's not entirely a breakage.
Best regards, Tomasz
Hi Tomasz,
On Sat, Oct 10, 2020 at 07:22:48PM +0200, Tomasz Figa wrote:
On Fri, Oct 9, 2020 at 7:52 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Fri, Oct 9, 2020 at 2:48 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 02:37:23PM +0200, Mauro Carvalho Chehab wrote:
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
No, it is unconditionally unsafe. The CMA movable mappings are specific VMAs that will have bad issues here, but there are other types too.
The only way to do something at a VMA level is to have a list of OK VMAs, eg because they were creatd via a special mmap helper from the media subsystem.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
It doesn't matter, nothing prevents the user from calling media APIs on mmaps it gets from other subsystems.
I think a good first step would be to disable userptr of non struct page backed storage going forward for any new hw support. Even on existing drivers. dma-buf sharing has been around for long enough now that this shouldn't be a problem. Unfortunately right now this doesn't seem to exist, so the entire problem keeps getting perpetuated.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
We can't tell from the VMA these kinds of details..
It has to go the other direction, evey mmap that might be used as a userptr here has to be found and the VMA specially created to allow its use. At least that is a kernel only change, but will need people with the HW to do this work.
I think the only reasonable way to keep this working is:
- add a struct dma_buf *vma_tryget_dma_buf(struct vm_area_struct *vma);
- add dma-buf export support to fbdev and v4l
I assume you mean V4L2 and not the obsolete V4L that is emulated in the userspace by libv4l. If so, every video device that uses videobuf2 gets DMA-buf export for free and there is nothing needed to enable it.
We probably still have a few legacy drivers using videobuf (non-2), but IMHO those should be safe to put behind some disabled-by-default Kconfig symbol or even completely drop, as the legacy framework has been deprecated for many years already.
There's 8 drivers left, and they support a very large number of devices. I expect unhappy users distros stop shipping them. On the other hand, videobuf has been deprecated for a loooooooong time, so there has been plenty of time to convert the remaining drivers to videobuf2. If nobody can do it, then we'll have to drop support for these devices given the security issues.
We have moved media drivers to staging in the past when there wasn't enough maintenance effort, we could do so here too.
- roll this out everywhere we still need it.
Realistically this just isn't going to happen. And anything else just reimplements half of dma-buf, which is kinda pointless (you need minimally refcounting and some way to get at a promise of a permanent sg list for dma. Plus probably the vmap for kernel cpu access.
Please let address the issue on this way, instead of broken an userspace API that it is there since 1991.
It has happened before :( It took 4 years for RDMA to undo the uAPI breakage caused by a security fix for something that was a 15 years old bug.
Yeah we have a bunch of these on the drm side too. Some of them are really just "you have to upgrade userspace", and there's no real fix for the security nightmare without that.
I think we need to phase out such userspace indeed. The Kconfig symbol allows enabling the unsafe functionality for anyone who still needs it, so I think it's not entirely a breakage.
On Sat, Oct 10, 2020 at 11:36 PM Laurent Pinchart laurent.pinchart@ideasonboard.com wrote:
Hi Tomasz,
On Sat, Oct 10, 2020 at 07:22:48PM +0200, Tomasz Figa wrote:
On Fri, Oct 9, 2020 at 7:52 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Fri, Oct 9, 2020 at 2:48 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 02:37:23PM +0200, Mauro Carvalho Chehab wrote:
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
No, it is unconditionally unsafe. The CMA movable mappings are specific VMAs that will have bad issues here, but there are other types too.
The only way to do something at a VMA level is to have a list of OK VMAs, eg because they were creatd via a special mmap helper from the media subsystem.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
It doesn't matter, nothing prevents the user from calling media APIs on mmaps it gets from other subsystems.
I think a good first step would be to disable userptr of non struct page backed storage going forward for any new hw support. Even on existing drivers. dma-buf sharing has been around for long enough now that this shouldn't be a problem. Unfortunately right now this doesn't seem to exist, so the entire problem keeps getting perpetuated.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
We can't tell from the VMA these kinds of details..
It has to go the other direction, evey mmap that might be used as a userptr here has to be found and the VMA specially created to allow its use. At least that is a kernel only change, but will need people with the HW to do this work.
I think the only reasonable way to keep this working is:
- add a struct dma_buf *vma_tryget_dma_buf(struct vm_area_struct *vma);
- add dma-buf export support to fbdev and v4l
I assume you mean V4L2 and not the obsolete V4L that is emulated in the userspace by libv4l. If so, every video device that uses videobuf2 gets DMA-buf export for free and there is nothing needed to enable it.
Yeah. And I missed that v4l2 added dma-buf export too.
We probably still have a few legacy drivers using videobuf (non-2), but IMHO those should be safe to put behind some disabled-by-default Kconfig symbol or even completely drop, as the legacy framework has been deprecated for many years already.
There's 8 drivers left, and they support a very large number of devices. I expect unhappy users distros stop shipping them. On the other hand, videobuf has been deprecated for a loooooooong time, so there has been plenty of time to convert the remaining drivers to videobuf2. If nobody can do it, then we'll have to drop support for these devices given the security issues.
Again, the issue here is _only_ with follow_pfn. For videobuf1 this means videbuf-dma-contig.c userptr support is broken. Unlike videobuf2 it means it's broken for all usage (not just zero-copy userptr), because videbuf-dma-contig.c lacks the pin_user_pages path. But that would be easy to add if this poses a problem I think - we just need to carry over the pin_user_pages_fast logic from videbuf2, no driver changes required. But of course I don't think we should do that before someone reports the regression, since videobuf1 userptr is doubly deprecated :-)
Everything else keeps working with videobuf1 with my patch series. So depending upon which videobuf1 implementations these 8 drivers use, you might not even have any real breakage there.
We have moved media drivers to staging in the past when there wasn't enough maintenance effort, we could do so here too.
I'm not breaking the world with this, it's really very minimal use-case. At least as far as I'm understanding the entire media subsystem here. -Daniel
- roll this out everywhere we still need it.
Realistically this just isn't going to happen. And anything else just reimplements half of dma-buf, which is kinda pointless (you need minimally refcounting and some way to get at a promise of a permanent sg list for dma. Plus probably the vmap for kernel cpu access.
Please let address the issue on this way, instead of broken an userspace API that it is there since 1991.
It has happened before :( It took 4 years for RDMA to undo the uAPI breakage caused by a security fix for something that was a 15 years old bug.
Yeah we have a bunch of these on the drm side too. Some of them are really just "you have to upgrade userspace", and there's no real fix for the security nightmare without that.
I think we need to phase out such userspace indeed. The Kconfig symbol allows enabling the unsafe functionality for anyone who still needs it, so I think it's not entirely a breakage.
-- Regards,
Laurent Pinchart
Em Sat, 10 Oct 2020 23:50:27 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
On Sat, Oct 10, 2020 at 11:36 PM Laurent Pinchart laurent.pinchart@ideasonboard.com wrote:
We probably still have a few legacy drivers using videobuf (non-2), but IMHO those should be safe to put behind some disabled-by-default Kconfig symbol or even completely drop, as the legacy framework has been deprecated for many years already.
There's 8 drivers left, and they support a very large number of devices. I expect unhappy users distros stop shipping them. On the other hand, videobuf has been deprecated for a loooooooong time, so there has been plenty of time to convert the remaining drivers to videobuf2. If nobody can do it, then we'll have to drop support for these devices given the security issues.
Again, the issue here is _only_ with follow_pfn. For videobuf1 this means videbuf-dma-contig.c userptr support is broken. Unlike videobuf2 it means it's broken for all usage (not just zero-copy userptr), because videbuf-dma-contig.c lacks the pin_user_pages path.
Well, follow_pfn() is used only by videbuf-dma-contig.c. If this is the only part of VB1 that will have userptr broken, then there's just one driver that might be affected: davinci.
Yet, taking a deeper look:
$ git grep include drivers/media/platform/davinci/|grep -i videobuf drivers/media/platform/davinci/vpif_capture.h:#include <media/videobuf2-dma-contig.h> drivers/media/platform/davinci/vpif_display.h:#include <media/videobuf2-dma-contig.h>
It sounds to me that it was already converted to VB2, but some VB1 symbols were not converted at its Kconfig.
It sounds to me that there are other drivers with some VB1 left overs at Kconfig, as those are the only ones using VB1 those days:
$ for i in $(git grep media/videobuf drivers |grep -v videobuf2 |grep -v v4l2-core|cut -d: -f1); do dirname $i; done|sort|uniq drivers/media/pci/bt8xx drivers/media/pci/cx18 drivers/media/platform drivers/media/usb/tm6000 drivers/media/usb/zr364xx drivers/staging/media/atomisp/pci
But that would be easy to add if this poses a problem I think - we just need to carry over the pin_user_pages_fast logic from videbuf2, no driver changes required. But of course I don't think we should do that before someone reports the regression, since videobuf1 userptr is doubly deprecated :-)
I think otherwise. Keeping a broken component at the Kernel is a bad idea.
Yet, from my quick search above, it sounds to me that it is time for us to retire the VB1 DMA contig support as a hole, as there's no client for it anymore.
I'll work on some patches cleaning up the VB1 left overs at Kconfig and removing videbuf-dma-contig.c for good, if there's no hidden dependency on it.
Thanks, Mauro
Em Sun, 11 Oct 2020 08:27:41 +0200 Mauro Carvalho Chehab mchehab+huawei@kernel.org escreveu:
Em Sat, 10 Oct 2020 23:50:27 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
On Sat, Oct 10, 2020 at 11:36 PM Laurent Pinchart laurent.pinchart@ideasonboard.com wrote:
We probably still have a few legacy drivers using videobuf (non-2), but IMHO those should be safe to put behind some disabled-by-default Kconfig symbol or even completely drop, as the legacy framework has been deprecated for many years already.
There's 8 drivers left, and they support a very large number of devices. I expect unhappy users distros stop shipping them. On the other hand, videobuf has been deprecated for a loooooooong time, so there has been plenty of time to convert the remaining drivers to videobuf2. If nobody can do it, then we'll have to drop support for these devices given the security issues.
Again, the issue here is _only_ with follow_pfn. For videobuf1 this means videbuf-dma-contig.c userptr support is broken. Unlike videobuf2 it means it's broken for all usage (not just zero-copy userptr), because videbuf-dma-contig.c lacks the pin_user_pages path.
Well, follow_pfn() is used only by videbuf-dma-contig.c. If this is the only part of VB1 that will have userptr broken, then there's just one driver that might be affected: davinci.
Yet, taking a deeper look:
$ git grep include drivers/media/platform/davinci/|grep -i videobuf drivers/media/platform/davinci/vpif_capture.h:#include <media/videobuf2-dma-contig.h> drivers/media/platform/davinci/vpif_display.h:#include <media/videobuf2-dma-contig.h>
It sounds to me that it was already converted to VB2, but some VB1 symbols were not converted at its Kconfig.
It sounds to me that there are other drivers with some VB1 left overs at Kconfig, as those are the only ones using VB1 those days:
$ for i in $(git grep media/videobuf drivers |grep -v videobuf2 |grep -v v4l2-core|cut -d: -f1); do dirname $i; done|sort|uniq drivers/media/pci/bt8xx drivers/media/pci/cx18 drivers/media/platform drivers/media/usb/tm6000 drivers/media/usb/zr364xx drivers/staging/media/atomisp/pci
This is incomplete. There are two drivers that include videobuf indirectly:
include/media/davinci/vpfe_capture.h include/media/drv-intf/saa7146_vv.h
I double-checked that DaVinci still uses VB1. There are actually two clients for videbuf-dma-contig.c: davinci and fsl-viu.
Those two will be affected, if we don't add pin_user_pages_fast() support to VB1 or convert them to VB2.
But that would be easy to add if this poses a problem I think - we just need to carry over the pin_user_pages_fast logic from videbuf2, no driver changes required. But of course I don't think we should do that before someone reports the regression, since videobuf1 userptr is doubly deprecated :-)
I think otherwise. Keeping a broken component at the Kernel is a bad idea.
Yet, from my quick search above, it sounds to me that it is time for us to retire the VB1 DMA contig support as a hole, as there's no client for it anymore.
I'll work on some patches cleaning up the VB1 left overs at Kconfig and removing videbuf-dma-contig.c for good, if there's no hidden dependency on it.
Thanks, Mauro
Thanks, Mauro
Hi Daniel,
On Fri, Oct 09, 2020 at 07:52:05PM +0200, Daniel Vetter wrote:
On Fri, Oct 9, 2020 at 2:48 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 02:37:23PM +0200, Mauro Carvalho Chehab wrote:
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
No, it is unconditionally unsafe. The CMA movable mappings are specific VMAs that will have bad issues here, but there are other types too.
The only way to do something at a VMA level is to have a list of OK VMAs, eg because they were creatd via a special mmap helper from the media subsystem.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
It doesn't matter, nothing prevents the user from calling media APIs on mmaps it gets from other subsystems.
I think a good first step would be to disable userptr of non struct page backed storage going forward for any new hw support. Even on existing drivers. dma-buf sharing has been around for long enough now that this shouldn't be a problem. Unfortunately right now this doesn't seem to exist, so the entire problem keeps getting perpetuated.
On the V4L2 side, I think we should disable USERPTR for any new driver, period. That's what I've been recommended when reviewing patches for several years already. It's a deprecated API, it should be phased out, which starts by not allowing any new use case.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
We can't tell from the VMA these kinds of details..
It has to go the other direction, evey mmap that might be used as a userptr here has to be found and the VMA specially created to allow its use. At least that is a kernel only change, but will need people with the HW to do this work.
I think the only reasonable way to keep this working is:
- add a struct dma_buf *vma_tryget_dma_buf(struct vm_area_struct *vma);
- add dma-buf export support to fbdev and v4l
- roll this out everywhere we still need it.
Realistically this just isn't going to happen. And anything else just reimplements half of dma-buf, which is kinda pointless (you need minimally refcounting and some way to get at a promise of a permanent sg list for dma. Plus probably the vmap for kernel cpu access.
Please let address the issue on this way, instead of broken an userspace API that it is there since 1991.
It has happened before :( It took 4 years for RDMA to undo the uAPI breakage caused by a security fix for something that was a 15 years old bug.
Yeah we have a bunch of these on the drm side too. Some of them are really just "you have to upgrade userspace", and there's no real fix for the security nightmare without that.
Hi Jason,
On 09.10.2020 14:48, Jason Gunthorpe wrote:
On Fri, Oct 09, 2020 at 02:37:23PM +0200, Mauro Carvalho Chehab wrote:
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
No, it is unconditionally unsafe. The CMA movable mappings are specific VMAs that will have bad issues here, but there are other types too.
I'm trying to follow this thread, but I really wonder what do you mean by CMA movable mappings? If a buffer has been allocated from CMA and used for DMA, it won't be moved in the memory. It will stay at the same physical memory address all the time until freed by the owner. It just a matter of proper usage count tracking to delay freeing if it is still used somewhere.
Best regards
On Mon, Oct 12, 2020 at 12:47 PM Marek Szyprowski m.szyprowski@samsung.com wrote:
Hi Jason,
On 09.10.2020 14:48, Jason Gunthorpe wrote:
On Fri, Oct 09, 2020 at 02:37:23PM +0200, Mauro Carvalho Chehab wrote:
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
No, it is unconditionally unsafe. The CMA movable mappings are specific VMAs that will have bad issues here, but there are other types too.
I'm trying to follow this thread, but I really wonder what do you mean by CMA movable mappings? If a buffer has been allocated from CMA and used for DMA, it won't be moved in the memory. It will stay at the same physical memory address all the time until freed by the owner. It just a matter of proper usage count tracking to delay freeing if it is still used somewhere.
Yup. The problem is that this usage count tracking doesn't exist. And drivers could at least in theory treat CMA like vram and swap buffers in&out of it, so just refcounting the userspace vma isn't enough. In practice, right now, it might be enough for CMA drivers though (but there's more that's possible here). -Daniel
Hi Mauro,
On Fri, Oct 9, 2020 at 2:37 PM Mauro Carvalho Chehab mchehab+huawei@kernel.org wrote:
Em Fri, 9 Oct 2020 09:21:11 -0300 Jason Gunthorpe jgg@ziepe.ca escreveu:
On Fri, Oct 09, 2020 at 12:34:21PM +0200, Mauro Carvalho Chehab wrote:
Hi,
Em Fri, 9 Oct 2020 09:59:26 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed:
- gpu drivers dynamically manage their memory nowadays, invalidating
ptes with unmap_mapping_range when buffers get moved
- contiguous dma allocations have moved from dedicated carvetouts to
cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
- even /dev/mem now invalidates mappings when the kernel requests that
iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region")
Accessing pfns obtained from ptes without holding all the locks is therefore no longer a good idea.
Unfortunately there's some users where this is not fixable (like v4l userptr of iomem mappings) or involves a pile of work (vfio type1 iommu). For now annotate these as unsafe and splat appropriately.
This patch adds an unsafe_follow_pfn, which later patches will then roll out to all appropriate places.
NACK, as this breaks an existing userspace API on media.
It doesn't break it. You get a big warning the thing is broken and it keeps working.
We can't leave such a huge security hole open - it impacts other subsystems, distros need to be able to run in a secure mode.
Well, if distros disable it, then apps will break.
Do we have any information on userspace that actually needs this functionality?
Note that we're _not_ talking here about the complete USERPTR functionality, but rather just the very corner case of carveout memory not backed by struct pages.
Given that the current in-tree ways of reserving carveout memory, such as shared-dma-pool, actually give memory backed by struct pages, do we even have a source of such legacy memory in the kernel today?
I think that given that this is a very niche functionality, we could have it disabled by default for security reasons and if someone _really_ (i.e. there is no replacement) needs it, they probably need to use a custom kernel build anyway for their exotic hardware setup (with PFN-backed carveout memory), so they can enable it.
While I agree that using the userptr on media is something that new drivers may not support, as DMABUF is a better way of handling it, changing this for existing ones is a big no, as it may break usersapace.
media community needs to work to fix this, not pretend it is OK to keep going as-is.
Dealing with security issues is the one case where an uABI break might be acceptable.
If you want to NAK it then you need to come up with the work to do something here correctly that will support the old drivers without the kernel taint.
Unfortunately making things uncomfortable for the subsystem is the big hammer the core kernel needs to use to actually get this security work done by those responsible.
I'm not pretending that this is ok. Just pointing that the approach taken is NOT OK.
I'm not a mm/ expert, but, from what I understood from Daniel's patch description is that this is unsafe *only if* __GFP_MOVABLE is used.
Well, no drivers inside the media subsystem uses such flag, although they may rely on some infrastructure that could be using it behind the bars.
If this is the case, the proper fix seems to have a GFP_NOT_MOVABLE flag that it would be denying the core mm code to set __GFP_MOVABLE.
Please let address the issue on this way, instead of broken an userspace API that it is there since 1991.
Note that USERPTR as a whole generally has been considered deprecated in V4L2 for many years and people have been actively discouraged to use it. And, still, we're just talking here about the very rare corner case, not the whole USERPTR API.
Best regards, Tomasz
The media model assumes that buffers are all preallocated, so that when a media pipeline is running we never miss a deadline because the buffers aren't allocated or available.
This means we cannot fix the v4l follow_pfn usage through mmu_notifier, without breaking how this all works. The only real fix is to deprecate userptr support for VM_IO | VM_PFNMAP mappings and tell everyone to cut over to dma-buf memory sharing for zerocopy.
userptr for normal memory will keep working as-is.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Laurent Dufour ldufour@linux.ibm.com Cc: Vlastimil Babka vbabka@suse.cz Cc: Daniel Jordan daniel.m.jordan@oracle.com Cc: Michel Lespinasse walken@google.com --- drivers/media/common/videobuf2/frame_vector.c | 2 +- drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c index 2b0b97761d15..a1b85fe9e7c1 100644 --- a/drivers/media/common/videobuf2/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -69,7 +69,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, break;
while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) { - err = follow_pfn(vma, start, &nums[ret]); + err = unsafe_follow_pfn(vma, start, &nums[ret]); if (err) { if (ret == 0) ret = err; diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c index 52312ce2ba05..821c4a76ab96 100644 --- a/drivers/media/v4l2-core/videobuf-dma-contig.c +++ b/drivers/media/v4l2-core/videobuf-dma-contig.c @@ -183,7 +183,7 @@ static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem, user_address = untagged_baddr;
while (pages_done < (mem->size >> PAGE_SHIFT)) { - ret = follow_pfn(vma, user_address, &this_pfn); + ret = unsafe_follow_pfn(vma, user_address, &this_pfn); if (ret) break;
Em Fri, 9 Oct 2020 09:59:27 +0200 Daniel Vetter daniel.vetter@ffwll.ch escreveu:
The media model assumes that buffers are all preallocated, so that when a media pipeline is running we never miss a deadline because the buffers aren't allocated or available.
This means we cannot fix the v4l follow_pfn usage through mmu_notifier, without breaking how this all works. The only real fix is to deprecate userptr support for VM_IO | VM_PFNMAP mappings and tell everyone to cut over to dma-buf memory sharing for zerocopy.
userptr for normal memory will keep working as-is.
I won't repeat here the discussions for patch 09/17, but just to be clear about this one:
NACK.
We need a better alternative to avoid breaking existing media applications.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Pawel Osciak pawel@osciak.com Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Kyungmin Park kyungmin.park@samsung.com Cc: Tomasz Figa tfiga@chromium.org Cc: Laurent Dufour ldufour@linux.ibm.com Cc: Vlastimil Babka vbabka@suse.cz Cc: Daniel Jordan daniel.m.jordan@oracle.com Cc: Michel Lespinasse walken@google.com
drivers/media/common/videobuf2/frame_vector.c | 2 +- drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c index 2b0b97761d15..a1b85fe9e7c1 100644 --- a/drivers/media/common/videobuf2/frame_vector.c +++ b/drivers/media/common/videobuf2/frame_vector.c @@ -69,7 +69,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, break;
while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
err = follow_pfn(vma, start, &nums[ret]);
err = unsafe_follow_pfn(vma, start, &nums[ret]); if (err) { if (ret == 0) ret = err;
diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c index 52312ce2ba05..821c4a76ab96 100644 --- a/drivers/media/v4l2-core/videobuf-dma-contig.c +++ b/drivers/media/v4l2-core/videobuf-dma-contig.c @@ -183,7 +183,7 @@ static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem, user_address = untagged_baddr;
while (pages_done < (mem->size >> PAGE_SHIFT)) {
ret = follow_pfn(vma, user_address, &this_pfn);
if (ret) break;ret = unsafe_follow_pfn(vma, user_address, &this_pfn);
Thanks, Mauro
The code seems to stuff these pfns into iommu pts (or something like that, I didn't follow), but there's no mmu_notifier to ensure that access is synchronized with pte updates.
Hence mark these as unsafe. This means that with CONFIG_STRICT_FOLLOW_PFN, these will be rejected.
Real fix is to wire up an mmu_notifier ... somehow. Probably means any invalidate is a fatal fault for this vfio device, but then this shouldn't ever happen if userspace is reasonable.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Alex Williamson alex.williamson@redhat.com Cc: Cornelia Huck cohuck@redhat.com Cc: kvm@vger.kernel.org --- drivers/vfio/vfio_iommu_type1.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 5fbf0c1f7433..a4d53f3d0a35 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -421,7 +421,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, { int ret;
- ret = follow_pfn(vma, vaddr, pfn); + ret = unsafe_follow_pfn(vma, vaddr, pfn); if (ret) { bool unlocked = false;
@@ -435,7 +435,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, if (ret) return ret;
- ret = follow_pfn(vma, vaddr, pfn); + ret = unsafe_follow_pfn(vma, vaddr, pfn); }
return ret;
There's three ways to access PCI BARs from userspace: /dev/mem, sysfs files, and the old proc interface. Two check against iomem_is_exclusive, proc never did. And with CONFIG_IO_STRICT_DEVMEM, this starts to matter, since we don't want random userspace having access to PCI BARs while a driver is loaded and using it.
Fix this by adding the same iomem_is_exclusive() check we already have on the sysfs side in pci_mmap_resource().
References: 90a545e98126 ("restrict /dev/mem to idle io memory ranges") Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org -- v2: Improve commit message (Bjorn) --- drivers/pci/proc.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c index d35186b01d98..3a2f90beb4cb 100644 --- a/drivers/pci/proc.c +++ b/drivers/pci/proc.c @@ -274,6 +274,11 @@ static int proc_bus_pci_mmap(struct file *file, struct vm_area_struct *vma) else return -EINVAL; } + + if (dev->resource[i].flags & IORESOURCE_MEM && + iomem_is_exclusive(dev->resource[i].start)) + return -EINVAL; + ret = pci_mmap_page_range(dev, i, vma, fpriv->mmap_state, write_combine); if (ret < 0)
When we care about pagecache maintenance, we need to make sure that both f_mapping and i_mapping point at the right mapping.
But for iomem mappings we only care about the virtual/pte side of things, so f_mapping is enough. Also setting inode->i_mapping was confusing me as a driver maintainer, since in e.g. drivers/gpu we don't do that. Per Dan this seems to be copypasta from places which do care about pagecache consistency, but not needed. Hence remove it for slightly less confusion.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org --- drivers/char/mem.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/drivers/char/mem.c b/drivers/char/mem.c index abd4ffdc8cde..5502f56f3655 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -864,7 +864,6 @@ static int open_port(struct inode *inode, struct file *filp) * revocations when drivers want to take over a /dev/mem mapped * range. */ - inode->i_mapping = devmem_inode->i_mapping; filp->f_mapping = inode->i_mapping;
return 0;
We want all iomem mmaps to consistently revoke ptes when the kernel takes over and CONFIG_IO_STRICT_DEVMEM is enabled. This includes the pci bar mmaps available through procfs and sysfs, which currently do not revoke mappings.
To prepare for this, move the code from the /dev/kmem driver to kernel/resource.c.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Arnd Bergmann arnd@arndb.de Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: David Hildenbrand david@redhat.com Cc: "Rafael J. Wysocki" rafael.j.wysocki@intel.com --- drivers/char/mem.c | 85 +------------------------------------ include/linux/ioport.h | 6 +-- kernel/resource.c | 95 +++++++++++++++++++++++++++++++++++++++++- 3 files changed, 96 insertions(+), 90 deletions(-)
diff --git a/drivers/char/mem.c b/drivers/char/mem.c index 5502f56f3655..53338aad8d28 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -31,9 +31,6 @@ #include <linux/uio.h> #include <linux/uaccess.h> #include <linux/security.h> -#include <linux/pseudo_fs.h> -#include <uapi/linux/magic.h> -#include <linux/mount.h>
#ifdef CONFIG_IA64 # include <linux/efi.h> @@ -809,42 +806,6 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig) return ret; }
-static struct inode *devmem_inode; - -#ifdef CONFIG_IO_STRICT_DEVMEM -void revoke_devmem(struct resource *res) -{ - /* pairs with smp_store_release() in devmem_init_inode() */ - struct inode *inode = smp_load_acquire(&devmem_inode); - - /* - * Check that the initialization has completed. Losing the race - * is ok because it means drivers are claiming resources before - * the fs_initcall level of init and prevent /dev/mem from - * establishing mappings. - */ - if (!inode) - return; - - /* - * The expectation is that the driver has successfully marked - * the resource busy by this point, so devmem_is_allowed() - * should start returning false, however for performance this - * does not iterate the entire resource range. - */ - if (devmem_is_allowed(PHYS_PFN(res->start)) && - devmem_is_allowed(PHYS_PFN(res->end))) { - /* - * *cringe* iomem=relaxed says "go ahead, what's the - * worst that can happen?" - */ - return; - } - - unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1); -} -#endif - static int open_port(struct inode *inode, struct file *filp) { int rc; @@ -864,7 +825,7 @@ static int open_port(struct inode *inode, struct file *filp) * revocations when drivers want to take over a /dev/mem mapped * range. */ - filp->f_mapping = inode->i_mapping; + filp->f_mapping = iomem_get_mapping();
return 0; } @@ -995,48 +956,6 @@ static char *mem_devnode(struct device *dev, umode_t *mode)
static struct class *mem_class;
-static int devmem_fs_init_fs_context(struct fs_context *fc) -{ - return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM; -} - -static struct file_system_type devmem_fs_type = { - .name = "devmem", - .owner = THIS_MODULE, - .init_fs_context = devmem_fs_init_fs_context, - .kill_sb = kill_anon_super, -}; - -static int devmem_init_inode(void) -{ - static struct vfsmount *devmem_vfs_mount; - static int devmem_fs_cnt; - struct inode *inode; - int rc; - - rc = simple_pin_fs(&devmem_fs_type, &devmem_vfs_mount, &devmem_fs_cnt); - if (rc < 0) { - pr_err("Cannot mount /dev/mem pseudo filesystem: %d\n", rc); - return rc; - } - - inode = alloc_anon_inode(devmem_vfs_mount->mnt_sb); - if (IS_ERR(inode)) { - rc = PTR_ERR(inode); - pr_err("Cannot allocate inode for /dev/mem: %d\n", rc); - simple_release_fs(&devmem_vfs_mount, &devmem_fs_cnt); - return rc; - } - - /* - * Publish /dev/mem initialized. - * Pairs with smp_load_acquire() in revoke_devmem(). - */ - smp_store_release(&devmem_inode, inode); - - return 0; -} - static int __init chr_dev_init(void) { int minor; @@ -1058,8 +977,6 @@ static int __init chr_dev_init(void) */ if ((minor == DEVPORT_MINOR) && !arch_has_dev_port()) continue; - if ((minor == DEVMEM_MINOR) && devmem_init_inode() != 0) - continue;
device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor), NULL, devlist[minor].name); diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 6c2b06fe8beb..8ffb61b36606 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -302,11 +302,7 @@ struct resource *devm_request_free_mem_region(struct device *dev, struct resource *request_free_mem_region(struct resource *base, unsigned long size, const char *name);
-#ifdef CONFIG_IO_STRICT_DEVMEM -void revoke_devmem(struct resource *res); -#else -static inline void revoke_devmem(struct resource *res) { }; -#endif +extern struct address_space *iomem_get_mapping(void);
#endif /* __ASSEMBLY__ */ #endif /* _LINUX_IOPORT_H */ diff --git a/kernel/resource.c b/kernel/resource.c index 841737bbda9e..22153fdec4f5 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -18,12 +18,15 @@ #include <linux/spinlock.h> #include <linux/fs.h> #include <linux/proc_fs.h> +#include <linux/pseudo_fs.h> #include <linux/sched.h> #include <linux/seq_file.h> #include <linux/device.h> #include <linux/pfn.h> #include <linux/mm.h> +#include <linux/mount.h> #include <linux/resource_ext.h> +#include <uapi/linux/magic.h> #include <asm/io.h>
@@ -1112,6 +1115,52 @@ resource_size_t resource_alignment(struct resource *res)
static DECLARE_WAIT_QUEUE_HEAD(muxed_resource_wait);
+static struct inode *iomem_inode; + +#ifdef CONFIG_IO_STRICT_DEVMEM +static void revoke_iomem(struct resource *res) +{ + /* pairs with smp_store_release() in iomem_init_inode() */ + struct inode *inode = smp_load_acquire(&iomem_inode); + + /* + * Check that the initialization has completed. Losing the race + * is ok because it means drivers are claiming resources before + * the fs_initcall level of init and prevent /dev/mem from + * establishing mappings. + */ + if (!inode) + return; + + /* + * The expectation is that the driver has successfully marked + * the resource busy by this point, so devmem_is_allowed() + * should start returning false, however for performance this + * does not iterate the entire resource range. + */ + if (devmem_is_allowed(PHYS_PFN(res->start)) && + devmem_is_allowed(PHYS_PFN(res->end))) { + /* + * *cringe* iomem=relaxed says "go ahead, what's the + * worst that can happen?" + */ + return; + } + + unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1); +} +struct address_space *iomem_get_mapping(void) +{ + return iomem_inode->i_mapping; +} +#else +static void revoke_iomem(struct resource *res) {} +struct address_space *iomem_get_mapping(void) +{ + return NULL; +} +#endif + /** * __request_region - create a new busy resource region * @parent: parent resource descriptor @@ -1179,7 +1228,7 @@ struct resource * __request_region(struct resource *parent, write_unlock(&resource_lock);
if (res && orig_parent == &iomem_resource) - revoke_devmem(res); + revoke_iomem(res);
return res; } @@ -1713,4 +1762,48 @@ static int __init strict_iomem(char *str) return 1; }
+static int iomem_fs_init_fs_context(struct fs_context *fc) +{ + return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM; +} + +static struct file_system_type iomem_fs_type = { + .name = "iomem", + .owner = THIS_MODULE, + .init_fs_context = iomem_fs_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static int __init iomem_init_inode(void) +{ + static struct vfsmount *iomem_vfs_mount; + static int iomem_fs_cnt; + struct inode *inode; + int rc; + + rc = simple_pin_fs(&iomem_fs_type, &iomem_vfs_mount, &iomem_fs_cnt); + if (rc < 0) { + pr_err("Cannot mount iomem pseudo filesystem: %d\n", rc); + return rc; + } + + inode = alloc_anon_inode(iomem_vfs_mount->mnt_sb); + if (IS_ERR(inode)) { + rc = PTR_ERR(inode); + pr_err("Cannot allocate inode for iomem: %d\n", rc); + simple_release_fs(&iomem_vfs_mount, &iomem_fs_cnt); + return rc; + } + + /* + * Publish /dev/mem initialized. + * Pairs with smp_load_acquire() in revoke_iomem(). + */ + smp_store_release(&iomem_inode, inode); + + return 0; +} + +fs_initcall(iomem_init_inode); + __setup("iomem=", strict_iomem);
On Fri, Oct 09, 2020 at 09:59:31AM +0200, Daniel Vetter wrote:
We want all iomem mmaps to consistently revoke ptes when the kernel takes over and CONFIG_IO_STRICT_DEVMEM is enabled. This includes the pci bar mmaps available through procfs and sysfs, which currently do not revoke mappings.
To prepare for this, move the code from the /dev/kmem driver to kernel/resource.c.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Arnd Bergmann arnd@arndb.de Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: David Hildenbrand david@redhat.com Cc: "Rafael J. Wysocki" rafael.j.wysocki@intel.com
Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
On Fri, Oct 09, 2020 at 09:59:31AM +0200, Daniel Vetter wrote:
+struct address_space *iomem_get_mapping(void) +{
- return iomem_inode->i_mapping;
This should pair an acquire with the release below
- /*
* Publish /dev/mem initialized.
* Pairs with smp_load_acquire() in revoke_iomem().
*/
- smp_store_release(&iomem_inode, inode);
However, this seems abnormal, initcalls rarely do this kind of stuff with global data..
The kernel crashes if this fs_initcall is raced with iomem_get_mapping() due to the unconditional dereference, so I think it can be safely switched to a simple assignment.
Jason
On Fri, Oct 9, 2020 at 2:31 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 09:59:31AM +0200, Daniel Vetter wrote:
+struct address_space *iomem_get_mapping(void) +{
return iomem_inode->i_mapping;
This should pair an acquire with the release below
/*
* Publish /dev/mem initialized.
* Pairs with smp_load_acquire() in revoke_iomem().
*/
smp_store_release(&iomem_inode, inode);
However, this seems abnormal, initcalls rarely do this kind of stuff with global data..
The kernel crashes if this fs_initcall is raced with iomem_get_mapping() due to the unconditional dereference, so I think it can be safely switched to a simple assignment.
Ah yes I checked this all, but forgot to correctly annotate the iomem_get_mapping access. For reference, see b34e7e298d7a ("/dev/mem: Add missing memory barriers for devmem_inode").
The reasons for the annotations is that iomem requests can happen fairly early, way before fs_initcalls happen. That means revoke_iomem needs to check for that and bail out if we race - nothing bad can happen since userspace isn't running at this point anyway. And apparently it needs to be a full acquire fence since we don't just write a value, but need a barrier for the struct stuff.
Now iomem_get_mapping otoh can only be called after userspace is up & running, so way after all the fs_initcalls are guaranteed to have fininshed. Hence we don't really need anything there. But I expect the kernel race checker thing to complain, plus that then gives me a good spot to explain why we can't race and don't have to check for a NULL iomem_inode.
I'll add that in v3. -Daniel
On Fri, Oct 09, 2020 at 04:24:45PM +0200, Daniel Vetter wrote:
On Fri, Oct 9, 2020 at 2:31 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 09:59:31AM +0200, Daniel Vetter wrote:
+struct address_space *iomem_get_mapping(void) +{
return iomem_inode->i_mapping;
This should pair an acquire with the release below
/*
* Publish /dev/mem initialized.
* Pairs with smp_load_acquire() in revoke_iomem().
*/
smp_store_release(&iomem_inode, inode);
However, this seems abnormal, initcalls rarely do this kind of stuff with global data..
The kernel crashes if this fs_initcall is raced with iomem_get_mapping() due to the unconditional dereference, so I think it can be safely switched to a simple assignment.
Ah yes I checked this all, but forgot to correctly annotate the iomem_get_mapping access. For reference, see b34e7e298d7a ("/dev/mem: Add missing memory barriers for devmem_inode").
Oh yikes, so revoke_iomem can run concurrently during early boot, tricky.
The reasons for the annotations is that iomem requests can happen fairly early, way before fs_initcalls happen. That means revoke_iomem needs to check for that and bail out if we race - nothing bad can happen since userspace isn't running at this point anyway. And apparently it needs to be a full acquire fence since we don't just write a value, but need a barrier for the struct stuff.
Yes, if that is what is happening it release/acquire is needed.
Jason
On Fri, Oct 9, 2020 at 7:32 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 04:24:45PM +0200, Daniel Vetter wrote:
On Fri, Oct 9, 2020 at 2:31 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 09:59:31AM +0200, Daniel Vetter wrote:
+struct address_space *iomem_get_mapping(void) +{
return iomem_inode->i_mapping;
This should pair an acquire with the release below
/*
* Publish /dev/mem initialized.
* Pairs with smp_load_acquire() in revoke_iomem().
*/
smp_store_release(&iomem_inode, inode);
However, this seems abnormal, initcalls rarely do this kind of stuff with global data..
The kernel crashes if this fs_initcall is raced with iomem_get_mapping() due to the unconditional dereference, so I think it can be safely switched to a simple assignment.
Ah yes I checked this all, but forgot to correctly annotate the iomem_get_mapping access. For reference, see b34e7e298d7a ("/dev/mem: Add missing memory barriers for devmem_inode").
Oh yikes, so revoke_iomem can run concurrently during early boot, tricky.
It runs early because request_mem_region() can run before fs_initcall. Rather than add an unnecessary lock just arrange for the revoke to be skipped before the inode is initialized. The expectation is that any early resource reservations will block future userspace mapping attempts.
On Fri, Oct 09, 2020 at 11:28:54AM -0700, Dan Williams wrote:
On Fri, Oct 9, 2020 at 7:32 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 04:24:45PM +0200, Daniel Vetter wrote:
On Fri, Oct 9, 2020 at 2:31 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 09:59:31AM +0200, Daniel Vetter wrote:
+struct address_space *iomem_get_mapping(void) +{
return iomem_inode->i_mapping;
This should pair an acquire with the release below
/*
* Publish /dev/mem initialized.
* Pairs with smp_load_acquire() in revoke_iomem().
*/
smp_store_release(&iomem_inode, inode);
However, this seems abnormal, initcalls rarely do this kind of stuff with global data..
The kernel crashes if this fs_initcall is raced with iomem_get_mapping() due to the unconditional dereference, so I think it can be safely switched to a simple assignment.
Ah yes I checked this all, but forgot to correctly annotate the iomem_get_mapping access. For reference, see b34e7e298d7a ("/dev/mem: Add missing memory barriers for devmem_inode").
Oh yikes, so revoke_iomem can run concurrently during early boot, tricky.
It runs early because request_mem_region() can run before fs_initcall. Rather than add an unnecessary lock just arrange for the revoke to be skipped before the inode is initialized. The expectation is that any early resource reservations will block future userspace mapping attempts.
Actually, on this point a simple WRITE_ONCE/READ_ONCE pairing is OK, Paul once explained that the pointer chase on the READ_ONCE side is required to be like an acquire - this is why rcu_dereference is just READ_ONCE
Jason
On Thu, Oct 15, 2020 at 2:09 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 11:28:54AM -0700, Dan Williams wrote:
On Fri, Oct 9, 2020 at 7:32 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 04:24:45PM +0200, Daniel Vetter wrote:
On Fri, Oct 9, 2020 at 2:31 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 09:59:31AM +0200, Daniel Vetter wrote:
+struct address_space *iomem_get_mapping(void) +{
return iomem_inode->i_mapping;
This should pair an acquire with the release below
/*
* Publish /dev/mem initialized.
* Pairs with smp_load_acquire() in revoke_iomem().
*/
smp_store_release(&iomem_inode, inode);
However, this seems abnormal, initcalls rarely do this kind of stuff with global data..
The kernel crashes if this fs_initcall is raced with iomem_get_mapping() due to the unconditional dereference, so I think it can be safely switched to a simple assignment.
Ah yes I checked this all, but forgot to correctly annotate the iomem_get_mapping access. For reference, see b34e7e298d7a ("/dev/mem: Add missing memory barriers for devmem_inode").
Oh yikes, so revoke_iomem can run concurrently during early boot, tricky.
It runs early because request_mem_region() can run before fs_initcall. Rather than add an unnecessary lock just arrange for the revoke to be skipped before the inode is initialized. The expectation is that any early resource reservations will block future userspace mapping attempts.
Actually, on this point a simple WRITE_ONCE/READ_ONCE pairing is OK, Paul once explained that the pointer chase on the READ_ONCE side is required to be like an acquire - this is why rcu_dereference is just READ_ONCE
Indeed this changed with the sm_read_barrier_depends() removal a year ago. Before that READ_ONCE and rcu_dereference where not actually the same. I guess I'll throw a patch on top to switch that over too. -Daniel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Oct 15, 2020 at 9:52 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
On Thu, Oct 15, 2020 at 2:09 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 11:28:54AM -0700, Dan Williams wrote:
On Fri, Oct 9, 2020 at 7:32 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 04:24:45PM +0200, Daniel Vetter wrote:
On Fri, Oct 9, 2020 at 2:31 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 09:59:31AM +0200, Daniel Vetter wrote:
> +struct address_space *iomem_get_mapping(void) > +{ > + return iomem_inode->i_mapping;
This should pair an acquire with the release below
> + /* > + * Publish /dev/mem initialized. > + * Pairs with smp_load_acquire() in revoke_iomem(). > + */ > + smp_store_release(&iomem_inode, inode);
However, this seems abnormal, initcalls rarely do this kind of stuff with global data..
The kernel crashes if this fs_initcall is raced with iomem_get_mapping() due to the unconditional dereference, so I think it can be safely switched to a simple assignment.
Ah yes I checked this all, but forgot to correctly annotate the iomem_get_mapping access. For reference, see b34e7e298d7a ("/dev/mem: Add missing memory barriers for devmem_inode").
Oh yikes, so revoke_iomem can run concurrently during early boot, tricky.
It runs early because request_mem_region() can run before fs_initcall. Rather than add an unnecessary lock just arrange for the revoke to be skipped before the inode is initialized. The expectation is that any early resource reservations will block future userspace mapping attempts.
Actually, on this point a simple WRITE_ONCE/READ_ONCE pairing is OK, Paul once explained that the pointer chase on the READ_ONCE side is required to be like an acquire - this is why rcu_dereference is just READ_ONCE
Indeed this changed with the sm_read_barrier_depends() removal a year ago. Before that READ_ONCE and rcu_dereference where not actually the same. I guess I'll throw a patch on top to switch that over too.
Actually 2019 landed just the clean-up, the read change landed in 2017 already:
commit 76ebbe78f7390aee075a7f3768af197ded1bdfbb Author: Will Deacon will@kernel.org Date: Tue Oct 24 11:22:47 2017 +0100
locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()
Cheers, Daniel
On Thu, Oct 15, 2020 at 2:09 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 11:28:54AM -0700, Dan Williams wrote:
On Fri, Oct 9, 2020 at 7:32 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 04:24:45PM +0200, Daniel Vetter wrote:
On Fri, Oct 9, 2020 at 2:31 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Oct 09, 2020 at 09:59:31AM +0200, Daniel Vetter wrote:
+struct address_space *iomem_get_mapping(void) +{
return iomem_inode->i_mapping;
This should pair an acquire with the release below
/*
* Publish /dev/mem initialized.
* Pairs with smp_load_acquire() in revoke_iomem().
*/
smp_store_release(&iomem_inode, inode);
However, this seems abnormal, initcalls rarely do this kind of stuff with global data..
The kernel crashes if this fs_initcall is raced with iomem_get_mapping() due to the unconditional dereference, so I think it can be safely switched to a simple assignment.
Ah yes I checked this all, but forgot to correctly annotate the iomem_get_mapping access. For reference, see b34e7e298d7a ("/dev/mem: Add missing memory barriers for devmem_inode").
Oh yikes, so revoke_iomem can run concurrently during early boot, tricky.
It runs early because request_mem_region() can run before fs_initcall. Rather than add an unnecessary lock just arrange for the revoke to be skipped before the inode is initialized. The expectation is that any early resource reservations will block future userspace mapping attempts.
Actually, on this point a simple WRITE_ONCE/READ_ONCE pairing is OK, Paul once explained that the pointer chase on the READ_ONCE side is required to be like an acquire - this is why rcu_dereference is just READ_ONCE
Hm so WRITE_ONCE doesn't have any barriers, and we'd need that for updating the pointer. That would leave things rather inconsistent, so I think I'll just leave it as-is for symmetry reasons. None of this code matters for performance anyway, so micro-optimizing barriers seems a bit silly. -Daniel
We want to be able to revoke pci mmaps so that the same access rules applies as for /dev/kmem. Revoke support for devmem was added in 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region").
The simplest way to achieve this is by having the same filp->f_mapping for all mappings, so that unmap_mapping_range can find them all, no matter through which file they've been created. Since this must be set at open time we need sysfs support for this.
Add an optional mapping parameter bin_attr, which is only consulted when there's also an mmap callback, since without mmap support allowing to adjust the ->f_mapping makes no sense.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Rafael J. Wysocki" rafael@kernel.org Cc: Christian Brauner christian.brauner@ubuntu.com Cc: "David S. Miller" davem@davemloft.net Cc: Michael Ellerman mpe@ellerman.id.au Cc: Sourabh Jain sourabhjain@linux.ibm.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Mauro Carvalho Chehab mchehab+huawei@kernel.org Cc: Nayna Jain nayna@linux.ibm.com --- fs/sysfs/file.c | 11 +++++++++++ include/linux/sysfs.h | 2 ++ 2 files changed, 13 insertions(+)
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c index eb6897ab78e7..9d8ccdb000e3 100644 --- a/fs/sysfs/file.c +++ b/fs/sysfs/file.c @@ -169,6 +169,16 @@ static int sysfs_kf_bin_mmap(struct kernfs_open_file *of, return battr->mmap(of->file, kobj, battr, vma); }
+static int sysfs_kf_bin_open(struct kernfs_open_file *of) +{ + struct bin_attribute *battr = of->kn->priv; + + if (battr->mapping) + of->file->f_mapping = battr->mapping; + + return 0; +} + void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr) { struct kernfs_node *kn = kobj->sd, *tmp; @@ -240,6 +250,7 @@ static const struct kernfs_ops sysfs_bin_kfops_mmap = { .read = sysfs_kf_bin_read, .write = sysfs_kf_bin_write, .mmap = sysfs_kf_bin_mmap, + .open = sysfs_kf_bin_open, };
int sysfs_add_file_mode_ns(struct kernfs_node *parent, diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index 34e84122f635..a17a474d1601 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -164,11 +164,13 @@ __ATTRIBUTE_GROUPS(_name)
struct file; struct vm_area_struct; +struct address_space;
struct bin_attribute { struct attribute attr; size_t size; void *private; + struct address_space *mapping; ssize_t (*read)(struct file *, struct kobject *, struct bin_attribute *, char *, loff_t, size_t); ssize_t (*write)(struct file *, struct kobject *, struct bin_attribute *,
On Fri, Oct 09, 2020 at 09:59:32AM +0200, Daniel Vetter wrote:
We want to be able to revoke pci mmaps so that the same access rules applies as for /dev/kmem. Revoke support for devmem was added in 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region").
The simplest way to achieve this is by having the same filp->f_mapping for all mappings, so that unmap_mapping_range can find them all, no matter through which file they've been created. Since this must be set at open time we need sysfs support for this.
Add an optional mapping parameter bin_attr, which is only consulted when there's also an mmap callback, since without mmap support allowing to adjust the ->f_mapping makes no sense.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Rafael J. Wysocki" rafael@kernel.org Cc: Christian Brauner christian.brauner@ubuntu.com Cc: "David S. Miller" davem@davemloft.net Cc: Michael Ellerman mpe@ellerman.id.au Cc: Sourabh Jain sourabhjain@linux.ibm.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Mauro Carvalho Chehab mchehab+huawei@kernel.org Cc: Nayna Jain nayna@linux.ibm.com
fs/sysfs/file.c | 11 +++++++++++ include/linux/sysfs.h | 2 ++ 2 files changed, 13 insertions(+)
Reviewed-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Since 3234ac664a87 ("/dev/mem: Revoke mappings when a driver claims the region") /dev/kmem zaps ptes when the kernel requests exclusive acccess to an iomem region. And with CONFIG_IO_STRICT_DEVMEM, this is the default for all driver uses.
Except there's two more ways to access PCI BARs: sysfs and proc mmap support. Let's plug that hole.
For revoke_devmem() to work we need to link our vma into the same address_space, with consistent vma->vm_pgoff. ->pgoff is already adjusted, because that's how (io_)remap_pfn_range works, but for the mapping we need to adjust vma->vm_file->f_mapping. The cleanest way is to adjust this at at ->open time:
- for sysfs this is easy, now that binary attributes support this. We just set bin_attr->mapping when mmap is supported - for procfs it's a bit more tricky, since procfs pci access has only one file per device, and access to a specific resources first needs to be set up with some ioctl calls. But mmap is only supported for the same resources as sysfs exposes with mmap support, and otherwise rejected, so we can set the mapping unconditionally at open time without harm.
A special consideration is for arch_can_pci_mmap_io() - we need to make sure that the ->f_mapping doesn't alias between ioport and iomem space. There's only 2 ways in-tree to support mmap of ioports: generic pci mmap (ARCH_GENERIC_PCI_MMAP_RESOURCE), and sparc as the single architecture hand-rolling. Both approach support ioport mmap through a special pfn range and not through magic pte attributes. Aliasing is therefore not a problem.
The only difference in access checks left is that sysfs PCI mmap does not check for CAP_RAWIO. I'm not really sure whether that should be added or not.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org -- v2: - Totally new approach: Adjust filp->f_mapping at open time. Note that this now works on all architectures, not just those support ARCH_GENERIC_PCI_MMAP_RESOURCE --- drivers/pci/pci-sysfs.c | 4 ++++ drivers/pci/proc.c | 1 + 2 files changed, 5 insertions(+)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 6d78df981d41..cee38fcb4a86 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -928,6 +928,7 @@ void pci_create_legacy_files(struct pci_bus *b) b->legacy_io->read = pci_read_legacy_io; b->legacy_io->write = pci_write_legacy_io; b->legacy_io->mmap = pci_mmap_legacy_io; + b->legacy_io->mapping = iomem_get_mapping(); pci_adjust_legacy_attr(b, pci_mmap_io); error = device_create_bin_file(&b->dev, b->legacy_io); if (error) @@ -940,6 +941,7 @@ void pci_create_legacy_files(struct pci_bus *b) b->legacy_mem->size = 1024*1024; b->legacy_mem->attr.mode = 0600; b->legacy_mem->mmap = pci_mmap_legacy_mem; + b->legacy_io->mapping = iomem_get_mapping(); pci_adjust_legacy_attr(b, pci_mmap_mem); error = device_create_bin_file(&b->dev, b->legacy_mem); if (error) @@ -1155,6 +1157,8 @@ static int pci_create_attr(struct pci_dev *pdev, int num, int write_combine) res_attr->mmap = pci_mmap_resource_uc; } } + if (res_attr->mmap) + res_attr->mapping = iomem_get_mapping(); res_attr->attr.name = res_attr_name; res_attr->attr.mode = 0600; res_attr->size = pci_resource_len(pdev, num); diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c index 3a2f90beb4cb..9bab07302bbf 100644 --- a/drivers/pci/proc.c +++ b/drivers/pci/proc.c @@ -298,6 +298,7 @@ static int proc_bus_pci_open(struct inode *inode, struct file *file) fpriv->write_combine = 0;
file->private_data = fpriv; + file->f_mapping = iomem_get_mapping();
return 0; }
When trying to test my CONFIG_IO_STRICT_DEVMEM changes I realized they do nothing for i915. Because i915 doesn't request any regions, like pretty much all drm pci drivers. I guess this is some very old remnants from the userspace modesetting days, when we wanted to co-exist with the fbdev driver. Which usually requested these resources.
But makes me wonder why the pci subsystem doesn't just request resource automatically when we map a bar and a pci driver is bound?
Knowledge about which pci bars we need kludged together from intel_uncore.c and intel_gtt.c from i915 and intel-gtt.c over in the fake agp driver.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org --- drivers/gpu/drm/i915/intel_uncore.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 54e201fdeba4..ce39049d8919 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1692,10 +1692,13 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) struct pci_dev *pdev = i915->drm.pdev; int mmio_bar; int mmio_size; + int bar_selection; + int ret;
mmio_bar = IS_GEN(i915, 2) ? 1 : 0; + bar_selection = BIT (2) | BIT(mmio_bar); /* - * Before gen4, the registers and the GTT are behind different BARs. + * On gen3 the registers and the GTT are behind different BARs. * However, from gen4 onwards, the registers and the GTT are shared * in the same BAR, so we want to restrict this ioremap from * clobbering the GTT which we want ioremap_wc instead. Fortunately, @@ -1703,6 +1706,8 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) * generations up to Ironlake. * For dgfx chips register range is expanded to 4MB. */ + if (INTEL_GEN(i915) == 3) + bar_selection |= BIT(3); if (INTEL_GEN(i915) < 5) mmio_size = 512 * 1024; else if (IS_DGFX(i915)) @@ -1710,8 +1715,15 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) else mmio_size = 2 * 1024 * 1024;
+ ret = pci_request_selected_regions(pdev, bar_selection, "i915"); + if (ret < 0) { + drm_err(&i915->drm, "failed to request pci bars\n"); + return ret; + } + uncore->regs = pci_iomap(pdev, mmio_bar, mmio_size); if (uncore->regs == NULL) { + pci_release_selected_regions(pdev, bar_selection); drm_err(&i915->drm, "failed to map registers\n"); return -EIO; } @@ -1721,9 +1733,18 @@ static int uncore_mmio_setup(struct intel_uncore *uncore)
static void uncore_mmio_cleanup(struct intel_uncore *uncore) { - struct pci_dev *pdev = uncore->i915->drm.pdev; + struct drm_i915_private *i915 = uncore->i915; + struct pci_dev *pdev = i915->drm.pdev; + int mmio_bar; + int bar_selection; + + mmio_bar = IS_GEN(i915, 2) ? 1 : 0; + bar_selection = BIT (2) | BIT(mmio_bar); + if (INTEL_GEN(i915) == 3) + bar_selection |= BIT(3);
pci_iounmap(pdev, uncore->regs); + pci_release_selected_regions(pdev, bar_selection); }
void intel_uncore_init_early(struct intel_uncore *uncore,
On Fri, Oct 09, 2020 at 09:59:34AM +0200, Daniel Vetter wrote:
When trying to test my CONFIG_IO_STRICT_DEVMEM changes I realized they do nothing for i915. Because i915 doesn't request any regions, like pretty much all drm pci drivers. I guess this is some very old remnants from the userspace modesetting days, when we wanted to co-exist with the fbdev driver. Which usually requested these resources.
But makes me wonder why the pci subsystem doesn't just request resource automatically when we map a bar and a pci driver is bound?
Knowledge about which pci bars we need kludged together from intel_uncore.c and intel_gtt.c from i915 and intel-gtt.c over in the fake agp driver.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org
drivers/gpu/drm/i915/intel_uncore.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 54e201fdeba4..ce39049d8919 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1692,10 +1692,13 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) struct pci_dev *pdev = i915->drm.pdev; int mmio_bar; int mmio_size;
- int bar_selection;
Signed bitmasks always make me uneasy. But looks like that's what it is in the pci api. So meh.
int ret;
mmio_bar = IS_GEN(i915, 2) ? 1 : 0;
bar_selection = BIT (2) | BIT(mmio_bar);
^ spurious space
That's also not correct for gen2 I think.
gen2: 0 = GMADR 1 = MMADR 2 = IOBAR
gen3: 0 = MMADR 1 = IOBAR 2 = GMADR 3 = GTTADR
gen4+: 0+1 = GTTMMADR 2+3 = GMADR 4 = IOBAR
Maybe we should just have an explicit list of bars like that in a comment?
I'd also suggest sucking this bitmask calculation into a small helper so you can reuse it for the release.
/*
* Before gen4, the registers and the GTT are behind different BARs.
* On gen3 the registers and the GTT are behind different BARs.
- However, from gen4 onwards, the registers and the GTT are shared
- in the same BAR, so we want to restrict this ioremap from
- clobbering the GTT which we want ioremap_wc instead. Fortunately,
@@ -1703,6 +1706,8 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) * generations up to Ironlake. * For dgfx chips register range is expanded to 4MB. */
- if (INTEL_GEN(i915) == 3)
if (INTEL_GEN(i915) < 5) mmio_size = 512 * 1024; else if (IS_DGFX(i915))bar_selection |= BIT(3);
@@ -1710,8 +1715,15 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) else mmio_size = 2 * 1024 * 1024;
- ret = pci_request_selected_regions(pdev, bar_selection, "i915");
- if (ret < 0) {
drm_err(&i915->drm, "failed to request pci bars\n");
return ret;
- }
- uncore->regs = pci_iomap(pdev, mmio_bar, mmio_size); if (uncore->regs == NULL) {
drm_err(&i915->drm, "failed to map registers\n"); return -EIO; }pci_release_selected_regions(pdev, bar_selection);
@@ -1721,9 +1733,18 @@ static int uncore_mmio_setup(struct intel_uncore *uncore)
static void uncore_mmio_cleanup(struct intel_uncore *uncore) {
- struct pci_dev *pdev = uncore->i915->drm.pdev;
struct drm_i915_private *i915 = uncore->i915;
struct pci_dev *pdev = i915->drm.pdev;
int mmio_bar;
int bar_selection;
mmio_bar = IS_GEN(i915, 2) ? 1 : 0;
bar_selection = BIT (2) | BIT(mmio_bar);
if (INTEL_GEN(i915) == 3)
bar_selection |= BIT(3);
pci_iounmap(pdev, uncore->regs);
pci_release_selected_regions(pdev, bar_selection);
}
void intel_uncore_init_early(struct intel_uncore *uncore,
2.28.0
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Fri, Oct 9, 2020 at 11:47 AM Ville Syrjälä ville.syrjala@linux.intel.com wrote:
On Fri, Oct 09, 2020 at 09:59:34AM +0200, Daniel Vetter wrote:
When trying to test my CONFIG_IO_STRICT_DEVMEM changes I realized they do nothing for i915. Because i915 doesn't request any regions, like pretty much all drm pci drivers. I guess this is some very old remnants from the userspace modesetting days, when we wanted to co-exist with the fbdev driver. Which usually requested these resources.
But makes me wonder why the pci subsystem doesn't just request resource automatically when we map a bar and a pci driver is bound?
Knowledge about which pci bars we need kludged together from intel_uncore.c and intel_gtt.c from i915 and intel-gtt.c over in the fake agp driver.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org
drivers/gpu/drm/i915/intel_uncore.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 54e201fdeba4..ce39049d8919 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1692,10 +1692,13 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) struct pci_dev *pdev = i915->drm.pdev; int mmio_bar; int mmio_size;
int bar_selection;
Signed bitmasks always make me uneasy. But looks like that's what it is in the pci api. So meh.
Yeah it's surprising.
int ret; mmio_bar = IS_GEN(i915, 2) ? 1 : 0;
bar_selection = BIT (2) | BIT(mmio_bar);
^
spurious space
That's also not correct for gen2 I think.
gen2: 0 = GMADR 1 = MMADR 2 = IOBAR
gen3: 0 = MMADR 1 = IOBAR 2 = GMADR 3 = GTTADR
gen4+: 0+1 = GTTMMADR 2+3 = GMADR 4 = IOBAR
Maybe we should just have an explicit list of bars like that in a comment?
I'd also suggest sucking this bitmask calculation into a small helper so you can reuse it for the release.
tbh I just hacked this up for testing. Given how almost no other drm driver does this, I'm wondering whether we should or not.
Also the only reason why I didn't just use the pci_request_regions helper is to avoid the vga ioport range, since that's managed by vgaarbiter.
So I think if we go for this for real we should: - register the vga ioport range in the vgaarbiter - have a pci_request_iomem_regions helper that grabs all mem bars - roll that out to all drm pci drivers
Or something like that. The other complication is when we resize the iobar. So not really sure what to do here. -Daniel
/*
* Before gen4, the registers and the GTT are behind different BARs.
* On gen3 the registers and the GTT are behind different BARs. * However, from gen4 onwards, the registers and the GTT are shared * in the same BAR, so we want to restrict this ioremap from * clobbering the GTT which we want ioremap_wc instead. Fortunately,
@@ -1703,6 +1706,8 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) * generations up to Ironlake. * For dgfx chips register range is expanded to 4MB. */
if (INTEL_GEN(i915) == 3)
bar_selection |= BIT(3); if (INTEL_GEN(i915) < 5) mmio_size = 512 * 1024; else if (IS_DGFX(i915))
@@ -1710,8 +1715,15 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) else mmio_size = 2 * 1024 * 1024;
ret = pci_request_selected_regions(pdev, bar_selection, "i915");
if (ret < 0) {
drm_err(&i915->drm, "failed to request pci bars\n");
return ret;
}
uncore->regs = pci_iomap(pdev, mmio_bar, mmio_size); if (uncore->regs == NULL) {
pci_release_selected_regions(pdev, bar_selection); drm_err(&i915->drm, "failed to map registers\n"); return -EIO; }
@@ -1721,9 +1733,18 @@ static int uncore_mmio_setup(struct intel_uncore *uncore)
static void uncore_mmio_cleanup(struct intel_uncore *uncore) {
struct pci_dev *pdev = uncore->i915->drm.pdev;
struct drm_i915_private *i915 = uncore->i915;
struct pci_dev *pdev = i915->drm.pdev;
int mmio_bar;
int bar_selection;
mmio_bar = IS_GEN(i915, 2) ? 1 : 0;
bar_selection = BIT (2) | BIT(mmio_bar);
if (INTEL_GEN(i915) == 3)
bar_selection |= BIT(3); pci_iounmap(pdev, uncore->regs);
pci_release_selected_regions(pdev, bar_selection);
}
void intel_uncore_init_early(struct intel_uncore *uncore,
2.28.0
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Ville Syrjälä Intel
On Fri, Oct 09, 2020 at 12:01:39PM +0200, Daniel Vetter wrote:
On Fri, Oct 9, 2020 at 11:47 AM Ville Syrjälä ville.syrjala@linux.intel.com wrote:
On Fri, Oct 09, 2020 at 09:59:34AM +0200, Daniel Vetter wrote:
When trying to test my CONFIG_IO_STRICT_DEVMEM changes I realized they do nothing for i915. Because i915 doesn't request any regions, like pretty much all drm pci drivers. I guess this is some very old remnants from the userspace modesetting days, when we wanted to co-exist with the fbdev driver. Which usually requested these resources.
But makes me wonder why the pci subsystem doesn't just request resource automatically when we map a bar and a pci driver is bound?
Knowledge about which pci bars we need kludged together from intel_uncore.c and intel_gtt.c from i915 and intel-gtt.c over in the fake agp driver.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org
drivers/gpu/drm/i915/intel_uncore.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 54e201fdeba4..ce39049d8919 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1692,10 +1692,13 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) struct pci_dev *pdev = i915->drm.pdev; int mmio_bar; int mmio_size;
int bar_selection;
Signed bitmasks always make me uneasy. But looks like that's what it is in the pci api. So meh.
Yeah it's surprising.
int ret; mmio_bar = IS_GEN(i915, 2) ? 1 : 0;
bar_selection = BIT (2) | BIT(mmio_bar);
^
spurious space
That's also not correct for gen2 I think.
gen2: 0 = GMADR 1 = MMADR 2 = IOBAR
gen3: 0 = MMADR 1 = IOBAR 2 = GMADR 3 = GTTADR
gen4+: 0+1 = GTTMMADR 2+3 = GMADR 4 = IOBAR
Maybe we should just have an explicit list of bars like that in a comment?
I'd also suggest sucking this bitmask calculation into a small helper so you can reuse it for the release.
tbh I just hacked this up for testing. Given how almost no other drm driver does this, I'm wondering whether we should or not.
Also the only reason why I didn't just use the pci_request_regions helper is to avoid the vga ioport range, since that's managed by vgaarbiter.
VGA io range isn't part of any bar. Or do you mean just the io decode enable bit in the pci command register? That should be just a matter or pci_enable_device() vs. pci_enable_device_mem() I think. So nothing to do with which bars we've requested IIRC.
So I think if we go for this for real we should:
- register the vga ioport range in the vgaarbiter
- have a pci_request_iomem_regions helper that grabs all mem bars
- roll that out to all drm pci drivers
Or something like that. The other complication is when we resize the iobar. So not really sure what to do here.
We resize it?
On Fri, Oct 9, 2020 at 12:42 PM Ville Syrjälä ville.syrjala@linux.intel.com wrote:
On Fri, Oct 09, 2020 at 12:01:39PM +0200, Daniel Vetter wrote:
On Fri, Oct 9, 2020 at 11:47 AM Ville Syrjälä ville.syrjala@linux.intel.com wrote:
On Fri, Oct 09, 2020 at 09:59:34AM +0200, Daniel Vetter wrote:
When trying to test my CONFIG_IO_STRICT_DEVMEM changes I realized they do nothing for i915. Because i915 doesn't request any regions, like pretty much all drm pci drivers. I guess this is some very old remnants from the userspace modesetting days, when we wanted to co-exist with the fbdev driver. Which usually requested these resources.
But makes me wonder why the pci subsystem doesn't just request resource automatically when we map a bar and a pci driver is bound?
Knowledge about which pci bars we need kludged together from intel_uncore.c and intel_gtt.c from i915 and intel-gtt.c over in the fake agp driver.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kees Cook keescook@chromium.org Cc: Dan Williams dan.j.williams@intel.com Cc: Andrew Morton akpm@linux-foundation.org Cc: John Hubbard jhubbard@nvidia.com Cc: Jérôme Glisse jglisse@redhat.com Cc: Jan Kara jack@suse.cz Cc: Dan Williams dan.j.williams@intel.com Cc: linux-mm@kvack.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-samsung-soc@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: linux-pci@vger.kernel.org
drivers/gpu/drm/i915/intel_uncore.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 54e201fdeba4..ce39049d8919 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1692,10 +1692,13 @@ static int uncore_mmio_setup(struct intel_uncore *uncore) struct pci_dev *pdev = i915->drm.pdev; int mmio_bar; int mmio_size;
int bar_selection;
Signed bitmasks always make me uneasy. But looks like that's what it is in the pci api. So meh.
Yeah it's surprising.
int ret; mmio_bar = IS_GEN(i915, 2) ? 1 : 0;
bar_selection = BIT (2) | BIT(mmio_bar);
^
spurious space
That's also not correct for gen2 I think.
gen2: 0 = GMADR 1 = MMADR 2 = IOBAR
gen3: 0 = MMADR 1 = IOBAR 2 = GMADR 3 = GTTADR
gen4+: 0+1 = GTTMMADR 2+3 = GMADR 4 = IOBAR
Maybe we should just have an explicit list of bars like that in a comment?
I'd also suggest sucking this bitmask calculation into a small helper so you can reuse it for the release.
tbh I just hacked this up for testing. Given how almost no other drm driver does this, I'm wondering whether we should or not.
Also the only reason why I didn't just use the pci_request_regions helper is to avoid the vga ioport range, since that's managed by vgaarbiter.
VGA io range isn't part of any bar. Or do you mean just the io decode enable bit in the pci command register? That should be just a matter or pci_enable_device() vs. pci_enable_device_mem() I think. So nothing to do with which bars we've requested IIRC.
So I think if we go for this for real we should:
- register the vga ioport range in the vgaarbiter
- have a pci_request_iomem_regions helper that grabs all mem bars
- roll that out to all drm pci drivers
Or something like that. The other complication is when we resize the iobar. So not really sure what to do here.
We resize it?
By default I thought firmware puts everything (well, squeezes) into the lower 32bit. Maybe they stopped doing that. So when we want the full bar (for discrete at least) we need to resize it and put it somewhere in the 64bit range above end of system memory.
So I guess correct phrasing is "we will resize it" :-) -Daniel
dri-devel@lists.freedesktop.org