Code is expecing to observe the same number of buffers returned from dma_map_sg() function compared to sg_alloc_table_from_pages(). This doesn't hold true universally especially for systems with IOMMU.
IOMMU driver tries to combine buffers into a single DMA address as much as it can. The right thing is to tell the DMA layer how much combining IOMMU can do.
Signed-off-by: Sinan Kaya okaya@codeaurora.org --- drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 1 + 4 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c index 8e28270..1b031eb 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c @@ -851,7 +851,7 @@ static int gmc_v6_0_sw_init(void *handle) pci_set_consistent_dma_mask(adev->pdev, DMA_BIT_MASK(32)); dev_warn(adev->dev, "amdgpu: No coherent DMA available.\n"); } - + dma_set_max_seg_size(adev->dev, PAGE_SIZE); r = gmc_v6_0_init_microcode(adev); if (r) { dev_err(adev->dev, "Failed to load mc firmware!\n"); diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c index 86e9d682..0a4b2cc1 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c @@ -999,6 +999,7 @@ static int gmc_v7_0_sw_init(void *handle) pci_set_consistent_dma_mask(adev->pdev, DMA_BIT_MASK(32)); pr_warn("amdgpu: No coherent DMA available\n"); } + dma_set_max_seg_size(adev->dev, PAGE_SIZE);
r = gmc_v7_0_init_microcode(adev); if (r) { diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c index 9a813d8..b171529 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c @@ -1096,6 +1096,7 @@ static int gmc_v8_0_sw_init(void *handle) pci_set_consistent_dma_mask(adev->pdev, DMA_BIT_MASK(32)); pr_warn("amdgpu: No coherent DMA available\n"); } + dma_set_max_seg_size(adev->dev, PAGE_SIZE);
r = gmc_v8_0_init_microcode(adev); if (r) { diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c index 3b7e7af..36e658ab 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c @@ -855,6 +855,7 @@ static int gmc_v9_0_sw_init(void *handle) pci_set_consistent_dma_mask(adev->pdev, DMA_BIT_MASK(32)); printk(KERN_WARNING "amdgpu: No coherent DMA available.\n"); } + dma_set_max_seg_size(adev->dev, PAGE_SIZE);
r = gmc_v9_0_mc_init(adev); if (r)
On Tue, Apr 10, 2018 at 04:59:55PM -0400, Sinan Kaya wrote:
Code is expecing to observe the same number of buffers returned from dma_map_sg() function compared to sg_alloc_table_from_pages(). This doesn't hold true universally especially for systems with IOMMU.
IOMMU driver tries to combine buffers into a single DMA address as much as it can. The right thing is to tell the DMA layer how much combining IOMMU can do.
Signed-off-by: Sinan Kaya okaya@codeaurora.org
Sinan, I guess Christian's suggestion is to add amdgpu_device_init function like here:
8<------- diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 0e798b3..9b96771 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2339,6 +2339,8 @@ int amdgpu_device_init(struct amdgpu_device *adev, /* init the mode config */ drm_mode_config_init(adev->ddev);
+ dma_set_max_seg_size(adev->dev, PAGE_SIZE); + r = amdgpu_device_ip_init(adev); if (r) { /* failed in exclusive mode due to timeout */ 8<-------
After that, we don't do it in each generation.
Thanks, Ray
Am 11.04.2018 um 08:26 schrieb Huang Rui:
On Tue, Apr 10, 2018 at 04:59:55PM -0400, Sinan Kaya wrote:
Code is expecing to observe the same number of buffers returned from dma_map_sg() function compared to sg_alloc_table_from_pages(). This doesn't hold true universally especially for systems with IOMMU.
IOMMU driver tries to combine buffers into a single DMA address as much as it can. The right thing is to tell the DMA layer how much combining IOMMU can do.
Signed-off-by: Sinan Kaya okaya@codeaurora.org
Sinan, I guess Christian's suggestion is to add amdgpu_device_init function like here:
Yes, exactly. Looks like Sinan just tried to place the call next to pci_set_dma_mask()/pci_set_consistent_dma_mask().
Not necessary a bad idea, but in this case not optimal.
Sinan please refine your patch as Rui suggested and resend.
Thanks, Christian.
8<------- diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 0e798b3..9b96771 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2339,6 +2339,8 @@ int amdgpu_device_init(struct amdgpu_device *adev, /* init the mode config */ drm_mode_config_init(adev->ddev);
dma_set_max_seg_size(adev->dev, PAGE_SIZE);
r = amdgpu_device_ip_init(adev); if (r) { /* failed in exclusive mode due to timeout */
8<-------
After that, we don't do it in each generation.
Thanks, Ray _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On 10/04/18 21:59, Sinan Kaya wrote:
Code is expecing to observe the same number of buffers returned from dma_map_sg() function compared to sg_alloc_table_from_pages(). This doesn't hold true universally especially for systems with IOMMU.
So why not fix said code? It's clearly not a real hardware limitation, and the map_sg() APIs have potentially returned fewer than nents since forever, so there's really no excuse.
IOMMU driver tries to combine buffers into a single DMA address as much as it can. The right thing is to tell the DMA layer how much combining IOMMU can do.
Disagree; this is a dodgy hack, since you'll now end up passing scatterlists into dma_map_sg() which already violate max_seg_size to begin with, and I think a conscientious DMA API implementation would be at rights to fail the mapping for that reason (I know arm64 happens not to, but that was a deliberate design decision to make my life easier at the time).
As a short-term fix, at least do something like what i915 does and constrain the table allocation to the desired segment size as well, so things remain self-consistent. But still never claim that faking a hardware constraint as a workaround for a driver shortcoming is "the right thing to do" ;)
Robin.
Signed-off-by: Sinan Kaya okaya@codeaurora.org
drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 1 + 4 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c index 8e28270..1b031eb 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c @@ -851,7 +851,7 @@ static int gmc_v6_0_sw_init(void *handle) pci_set_consistent_dma_mask(adev->pdev, DMA_BIT_MASK(32)); dev_warn(adev->dev, "amdgpu: No coherent DMA available.\n"); }
- dma_set_max_seg_size(adev->dev, PAGE_SIZE); r = gmc_v6_0_init_microcode(adev); if (r) { dev_err(adev->dev, "Failed to load mc firmware!\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c index 86e9d682..0a4b2cc1 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c @@ -999,6 +999,7 @@ static int gmc_v7_0_sw_init(void *handle) pci_set_consistent_dma_mask(adev->pdev, DMA_BIT_MASK(32)); pr_warn("amdgpu: No coherent DMA available\n"); }
dma_set_max_seg_size(adev->dev, PAGE_SIZE);
r = gmc_v7_0_init_microcode(adev); if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c index 9a813d8..b171529 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c @@ -1096,6 +1096,7 @@ static int gmc_v8_0_sw_init(void *handle) pci_set_consistent_dma_mask(adev->pdev, DMA_BIT_MASK(32)); pr_warn("amdgpu: No coherent DMA available\n"); }
dma_set_max_seg_size(adev->dev, PAGE_SIZE);
r = gmc_v8_0_init_microcode(adev); if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c index 3b7e7af..36e658ab 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c @@ -855,6 +855,7 @@ static int gmc_v9_0_sw_init(void *handle) pci_set_consistent_dma_mask(adev->pdev, DMA_BIT_MASK(32)); printk(KERN_WARNING "amdgpu: No coherent DMA available.\n"); }
dma_set_max_seg_size(adev->dev, PAGE_SIZE);
r = gmc_v9_0_mc_init(adev); if (r)
On 4/11/2018 8:03 AM, Robin Murphy wrote:
On 10/04/18 21:59, Sinan Kaya wrote:
Code is expecing to observe the same number of buffers returned from dma_map_sg() function compared to sg_alloc_table_from_pages(). This doesn't hold true universally especially for systems with IOMMU.
So why not fix said code? It's clearly not a real hardware limitation, and the map_sg() APIs have potentially returned fewer than nents since forever, so there's really no excuse.
Sure, I'll take a better fix if there is one.
IOMMU driver tries to combine buffers into a single DMA address as much as it can. The right thing is to tell the DMA layer how much combining IOMMU can do.
Disagree; this is a dodgy hack, since you'll now end up passing scatterlists into dma_map_sg() which already violate max_seg_size to begin with, and I think a conscientious DMA API implementation would be at rights to fail the mapping for that reason (I know arm64 happens not to, but that was a deliberate design decision to make my life easier at the time).
As a short-term fix, at least do something like what i915 does and constrain the table allocation to the desired segment size as well, so things remain self-consistent. But still never claim that faking a hardware constraint as a workaround for a driver shortcoming is "the right thing to do" ;)
You are asking for something like this from here, right?
https://elixir.bootlin.com/linux/v4.16.1/source/drivers/gpu/drm/i915/i915_ge...
ret = sg_alloc_table(st, obj->mm.pages->nents, GFP_KERNEL); if (ret) goto err_free;
src = obj->mm.pages->sgl; dst = st->sgl; for (i = 0; i < obj->mm.pages->nents; i++) { sg_set_page(dst, sg_page(src), src->length, 0); dst = sg_next(dst); src = sg_next(src); }
This seems to allocate the scatter gather list and fill it in manually before passing it to dma_map_sg(). I'll give it a try.
Just double checking.
Robin.
Signed-off-by: Sinan Kaya okaya@codeaurora.org
On 11/04/18 15:33, Sinan Kaya wrote:
On 4/11/2018 8:03 AM, Robin Murphy wrote:
On 10/04/18 21:59, Sinan Kaya wrote:
Code is expecing to observe the same number of buffers returned from dma_map_sg() function compared to sg_alloc_table_from_pages(). This doesn't hold true universally especially for systems with IOMMU.
So why not fix said code? It's clearly not a real hardware limitation, and the map_sg() APIs have potentially returned fewer than nents since forever, so there's really no excuse.
Sure, I'll take a better fix if there is one.
IOMMU driver tries to combine buffers into a single DMA address as much as it can. The right thing is to tell the DMA layer how much combining IOMMU can do.
Disagree; this is a dodgy hack, since you'll now end up passing scatterlists into dma_map_sg() which already violate max_seg_size to begin with, and I think a conscientious DMA API implementation would be at rights to fail the mapping for that reason (I know arm64 happens not to, but that was a deliberate design decision to make my life easier at the time).
As a short-term fix, at least do something like what i915 does and constrain the table allocation to the desired segment size as well, so things remain self-consistent. But still never claim that faking a hardware constraint as a workaround for a driver shortcoming is "the right thing to do" ;)
You are asking for something like this from here, right?
https://elixir.bootlin.com/linux/v4.16.1/source/drivers/gpu/drm/i915/i915_ge...
I just looked for callers of __sg_alloc_table_from_pages() with a limit other than SCATTERLIST_MAX_SEGMENT, and found __i915_gem_userptr_alloc_pages(), which at first glance appears to be doing something vaguely similar to amdgpu_ttm_tt_pin_userptr().
Robin.
On Wed, Apr 11, 2018 at 01:03:59PM +0100, Robin Murphy wrote:
On 10/04/18 21:59, Sinan Kaya wrote:
Code is expecing to observe the same number of buffers returned from dma_map_sg() function compared to sg_alloc_table_from_pages(). This doesn't hold true universally especially for systems with IOMMU.
So why not fix said code? It's clearly not a real hardware limitation, and the map_sg() APIs have potentially returned fewer than nents since forever, so there's really no excuse.
Yes, relying on dma_map_sg returning the same number of entries as passed it is completely bogus.
IOMMU driver tries to combine buffers into a single DMA address as much as it can. The right thing is to tell the DMA layer how much combining IOMMU can do.
Disagree; this is a dodgy hack, since you'll now end up passing scatterlists into dma_map_sg() which already violate max_seg_size to begin with, and I think a conscientious DMA API implementation would be at rights to fail the mapping for that reason (I know arm64 happens not to, but that was a deliberate design decision to make my life easier at the time).
Agreed.
Am 12.04.2018 um 08:26 schrieb Christoph Hellwig:
On Wed, Apr 11, 2018 at 01:03:59PM +0100, Robin Murphy wrote:
On 10/04/18 21:59, Sinan Kaya wrote:
Code is expecing to observe the same number of buffers returned from dma_map_sg() function compared to sg_alloc_table_from_pages(). This doesn't hold true universally especially for systems with IOMMU.
So why not fix said code? It's clearly not a real hardware limitation, and the map_sg() APIs have potentially returned fewer than nents since forever, so there's really no excuse.
Yes, relying on dma_map_sg returning the same number of entries as passed it is completely bogus.
I agree that the common DRM functions should be able to deal with both, but we should let the driver side decide if it wants multiple addresses combined or not.
IOMMU driver tries to combine buffers into a single DMA address as much as it can. The right thing is to tell the DMA layer how much combining IOMMU can do.
Disagree; this is a dodgy hack, since you'll now end up passing scatterlists into dma_map_sg() which already violate max_seg_size to begin with, and I think a conscientious DMA API implementation would be at rights to fail the mapping for that reason (I know arm64 happens not to, but that was a deliberate design decision to make my life easier at the time).
Agreed.
Sounds like my understanding of max_seg_size is incorrect, what exactly should that describe?
Thanks, Christian.
On 12/04/18 10:42, Christian König wrote:
Am 12.04.2018 um 08:26 schrieb Christoph Hellwig:
On Wed, Apr 11, 2018 at 01:03:59PM +0100, Robin Murphy wrote:
On 10/04/18 21:59, Sinan Kaya wrote:
Code is expecing to observe the same number of buffers returned from dma_map_sg() function compared to sg_alloc_table_from_pages(). This doesn't hold true universally especially for systems with IOMMU.
So why not fix said code? It's clearly not a real hardware limitation, and the map_sg() APIs have potentially returned fewer than nents since forever, so there's really no excuse.
Yes, relying on dma_map_sg returning the same number of entries as passed it is completely bogus.
I agree that the common DRM functions should be able to deal with both, but we should let the driver side decide if it wants multiple addresses combined or not.
IOMMU driver tries to combine buffers into a single DMA address as much as it can. The right thing is to tell the DMA layer how much combining IOMMU can do.
Disagree; this is a dodgy hack, since you'll now end up passing scatterlists into dma_map_sg() which already violate max_seg_size to begin with, and I think a conscientious DMA API implementation would be at rights to fail the mapping for that reason (I know arm64 happens not to, but that was a deliberate design decision to make my life easier at the time).
Agreed.
Sounds like my understanding of max_seg_size is incorrect, what exactly should that describe?
The segment size and boundary mask are there to represent a device's hardware scatter-gather capabilities - a lot of things have limits on the size and alignment of the data pointed to by a single descriptor (e.g. an xHCI TRB, where the data buffer must not cross a 64KB boundary) - although they can also be relevant to non-scatter-gather devices if they just have limits on how big/aligned a single DMA transfer can be.
Robin.
dri-devel@lists.freedesktop.org