[Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30

List overview All Threads
Download

newer

older

KASAN: use-after-free Read in...

nouveau-fixes 5.6

Dmitry Osipenko

12 Dec 2019 12 Dec '19

9:25 p.m.

Hello Thierry,

Commit [1] introduced a severe GPU performance regression on Tegra20 and Tegra30 using.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...

Interestingly the performance is okay on Tegra30 if CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for Tegra20.

I was telling you about this problem on the #tegra IRC sometime ago and you asked to report it in a trackable form, so finally here it is.

You could reproduce the problem by running [2] like this `grate/texture-filter -f -s` which should produce over 100 FPS for 720p display resolution and currently it's ~11 FPS.

[2] https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter...

Previously I was seeing some memory errors coming from Host1x DMA, but don't see any errors at all right now.

I don't see anything done horribly wrong in the offending commit.

Unfortunately I couldn't dedicate enough time to sit down and debug the problem thoroughly yet. Please let me know if you'll find a solution, I'll be happy to test it. Thanks in advance!

Show replies by date

Thierry Reding

13 Dec 13 Dec

3:10 p.m.

On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:

...

Hello Thierry,

Commit [1] introduced a severe GPU performance regression on Tegra20 and Tegra30 using.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...

Interestingly the performance is okay on Tegra30 if CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for Tegra20.

I was telling you about this problem on the #tegra IRC sometime ago and you asked to report it in a trackable form, so finally here it is.

You could reproduce the problem by running [2] like this `grate/texture-filter -f -s` which should produce over 100 FPS for 720p display resolution and currently it's ~11 FPS.

[2] https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter...

Previously I was seeing some memory errors coming from Host1x DMA, but don't see any errors at all right now.

I don't see anything done horribly wrong in the offending commit.

Unfortunately I couldn't dedicate enough time to sit down and debug the problem thoroughly yet. Please let me know if you'll find a solution, I'll be happy to test it. Thanks in advance!

I suspect that the problem here is that we're now using the DMA API, which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall that that code doesn't coalesce entries in the SG table, so we may end up calling iommu_map() a lot of times, and miss out on much of the advantages that the ->iotlb_sync_map() gives us on Tegra20.

At the same time dma_map_sg() will flush caches, which we didn't do before. This we should be able to improve by passing the attribute DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache maintenance isn't needed.

And while thinking about it, one other difference is that with the DMA API we actually map/unmap the buffers for every submission. This is because the DMA API semantics require that buffers be mapped/unmapped every time you use them. Previously we would basically only map each buffer once (at allocation time) and only have to deal with cache maintenance, so the overhead per submission was drastically lower.

If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we may want to restore explicit IOMMU usage, at least on anything prior to Tegra124 where we're unlikely to ever use different IOMMU domains anyway (because they are such a scarce resource).

Thierry

Dmitry Osipenko

3:35 p.m.

13.12.2019 18:10, Thierry Reding пишет:

...

On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:

...
Hello Thierry,

Commit [1] introduced a severe GPU performance regression on Tegra20 and Tegra30 using.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...

Interestingly the performance is okay on Tegra30 if CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for Tegra20.

I was telling you about this problem on the #tegra IRC sometime ago and you asked to report it in a trackable form, so finally here it is.

You could reproduce the problem by running [2] like this `grate/texture-filter -f -s` which should produce over 100 FPS for 720p display resolution and currently it's ~11 FPS.

[2] https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter...

Previously I was seeing some memory errors coming from Host1x DMA, but don't see any errors at all right now.

I don't see anything done horribly wrong in the offending commit.

Unfortunately I couldn't dedicate enough time to sit down and debug the problem thoroughly yet. Please let me know if you'll find a solution, I'll be happy to test it. Thanks in advance!

I suspect that the problem here is that we're now using the DMA API, which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall that that code doesn't coalesce entries in the SG table, so we may end up calling iommu_map() a lot of times, and miss out on much of the advantages that the ->iotlb_sync_map() gives us on Tegra20.

At the same time dma_map_sg() will flush caches, which we didn't do before. This we should be able to improve by passing the attribute DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache maintenance isn't needed.

And while thinking about it, one other difference is that with the DMA API we actually map/unmap the buffers for every submission. This is because the DMA API semantics require that buffers be mapped/unmapped every time you use them. Previously we would basically only map each buffer once (at allocation time) and only have to deal with cache maintenance, so the overhead per submission was drastically lower.

If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we may want to restore explicit IOMMU usage, at least on anything prior to Tegra124 where we're unlikely to ever use different IOMMU domains anyway (because they are such a scarce resource).

Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't think that it's the root of the problem. Disabling IOMMU for Tegra30 also didn't help (IIRC).

The offending patch shouldn't change anything in regards to the DMA API, if I'm not missing something. Strange..

Please keep me up-to-date!

Dmitry Osipenko

20 Jan 20 Jan

2:53 a.m.

13.12.2019 18:35, Dmitry Osipenko пишет:

...

13.12.2019 18:10, Thierry Reding пишет:

...
On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:

...
Hello Thierry,

Commit [1] introduced a severe GPU performance regression on Tegra20 and Tegra30 using.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...

Interestingly the performance is okay on Tegra30 if CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for Tegra20.

I was telling you about this problem on the #tegra IRC sometime ago and you asked to report it in a trackable form, so finally here it is.

You could reproduce the problem by running [2] like this `grate/texture-filter -f -s` which should produce over 100 FPS for 720p display resolution and currently it's ~11 FPS.

[2] https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter...

Previously I was seeing some memory errors coming from Host1x DMA, but don't see any errors at all right now.

I don't see anything done horribly wrong in the offending commit.

Unfortunately I couldn't dedicate enough time to sit down and debug the problem thoroughly yet. Please let me know if you'll find a solution, I'll be happy to test it. Thanks in advance!

I suspect that the problem here is that we're now using the DMA API, which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall that that code doesn't coalesce entries in the SG table, so we may end up calling iommu_map() a lot of times, and miss out on much of the advantages that the ->iotlb_sync_map() gives us on Tegra20.

At the same time dma_map_sg() will flush caches, which we didn't do before. This we should be able to improve by passing the attribute DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache maintenance isn't needed.

And while thinking about it, one other difference is that with the DMA API we actually map/unmap the buffers for every submission. This is because the DMA API semantics require that buffers be mapped/unmapped every time you use them. Previously we would basically only map each buffer once (at allocation time) and only have to deal with cache maintenance, so the overhead per submission was drastically lower.

If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we may want to restore explicit IOMMU usage, at least on anything prior to Tegra124 where we're unlikely to ever use different IOMMU domains anyway (because they are such a scarce resource).

Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't think that it's the root of the problem. Disabling IOMMU for Tegra30 also didn't help (IIRC).

The offending patch shouldn't change anything in regards to the DMA API, if I'm not missing something. Strange..

Please keep me up-to-date!

Hello Thierry,

I took another look at the problem and here what was found:

1) The "Optionally attach clients to the IOMMU" patch is wrong because:

1. host1x_drm_probe() is invoked *before* any of the host1x_client_iommu_attach() happens, so there is no way on earth the 'use_explicit_iommu' could ever be true.

2. Not attaching DRM clients to IOMMU if HOST1x isn't attached is wrong because it never attached in the case of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also makes no sense for T20/30 that do not support LPAE.

[1] https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L2...

2) Because of the above problems, the DRM clients are erroneously not getting attached to IOMMU at all and thus CMA is getting used for the BO allocations. Here comes the problems introduced by the "gpu: host1x: Support DMA mapping of buffers" patch, which makes DMA API to perform CPU cache maintenance on each job submission and apparently this is super bad for performance. This also makes no sense in comparison to the case of enabled IOMMU, where cache maintenance isn't performed at all (like it should be).

Please let me know if you're going to fix the problems or if you'd prefer me to create the patches.

Here is a draft of the fix for #2, it doesn't cover case of imported buffers (which should be statically mapped, IIUC):

@@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device *dev, struct host1x_bo *bo, * If we've manually mapped the buffer object through the IOMMU, make * sure to return the IOVA address of our mapping. */ - if (phys && obj->mm) { + if (phys && (obj->mm || obj->vaddr)) { *phys = obj->iova; return NULL; } diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c index 25ca54de8fc5..69adfd66196b 100644 --- a/drivers/gpu/host1x/job.c +++ b/drivers/gpu/host1x/job.c @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)

for (i = 0; i < job->num_relocs; i++) { struct host1x_reloc *reloc = &job->relocs[i]; - dma_addr_t phys_addr, *phys; + dma_addr_t phys_addr; struct sg_table *sgt;

reloc->target.bo = host1x_bo_get(reloc->target.bo); @@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }

- if (client->group) - phys = &phys_addr; - else - phys = NULL; - - sgt = host1x_bo_pin(dev, reloc->target.bo, phys); + sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr); if (IS_ERR(sgt)) { err = PTR_ERR(sgt); goto unpin; @@ -184,7 +179,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }

- sgt = host1x_bo_pin(host->dev, g->bo, NULL); + sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr); if (IS_ERR(sgt)) { err = PTR_ERR(sgt); goto unpin; @@ -214,7 +209,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)

job->unpins[job->num_unpins].size = gather_size; phys_addr = iova_dma_addr(&host->iova, alloc); - } else { + } else if (sgt) { err = dma_map_sg(host->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE); if (!err) {

Dmitry Osipenko

28 Jan 28 Jan

3:43 p.m.

20.01.2020 05:53, Dmitry Osipenko пишет:

...

13.12.2019 18:35, Dmitry Osipenko пишет:

...
13.12.2019 18:10, Thierry Reding пишет:

...
On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:

...
Hello Thierry,

Commit [1] introduced a severe GPU performance regression on Tegra20 and Tegra30 using.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...

Interestingly the performance is okay on Tegra30 if CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for Tegra20.

I was telling you about this problem on the #tegra IRC sometime ago and you asked to report it in a trackable form, so finally here it is.

You could reproduce the problem by running [2] like this `grate/texture-filter -f -s` which should produce over 100 FPS for 720p display resolution and currently it's ~11 FPS.

[2] https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter...

Previously I was seeing some memory errors coming from Host1x DMA, but don't see any errors at all right now.

I don't see anything done horribly wrong in the offending commit.

Unfortunately I couldn't dedicate enough time to sit down and debug the problem thoroughly yet. Please let me know if you'll find a solution, I'll be happy to test it. Thanks in advance!

I suspect that the problem here is that we're now using the DMA API, which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall that that code doesn't coalesce entries in the SG table, so we may end up calling iommu_map() a lot of times, and miss out on much of the advantages that the ->iotlb_sync_map() gives us on Tegra20.

At the same time dma_map_sg() will flush caches, which we didn't do before. This we should be able to improve by passing the attribute DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache maintenance isn't needed.

And while thinking about it, one other difference is that with the DMA API we actually map/unmap the buffers for every submission. This is because the DMA API semantics require that buffers be mapped/unmapped every time you use them. Previously we would basically only map each buffer once (at allocation time) and only have to deal with cache maintenance, so the overhead per submission was drastically lower.

If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we may want to restore explicit IOMMU usage, at least on anything prior to Tegra124 where we're unlikely to ever use different IOMMU domains anyway (because they are such a scarce resource).

Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't think that it's the root of the problem. Disabling IOMMU for Tegra30 also didn't help (IIRC).

The offending patch shouldn't change anything in regards to the DMA API, if I'm not missing something. Strange..

Please keep me up-to-date!

Hello Thierry,

I took another look at the problem and here what was found:

The "Optionally attach clients to the IOMMU" patch is wrong because:

host1x_drm_probe() is invoked *before* any of the host1x_client_iommu_attach() happens, so there is no way on earth the 'use_explicit_iommu' could ever be true.

Not attaching DRM clients to IOMMU if HOST1x isn't attached is wrong because it never attached in the case of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also makes no sense for T20/30 that do not support LPAE.

[1] https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L2...

Because of the above problems, the DRM clients are erroneously not

getting attached to IOMMU at all and thus CMA is getting used for the BO allocations. Here comes the problems introduced by the "gpu: host1x: Support DMA mapping of buffers" patch, which makes DMA API to perform CPU cache maintenance on each job submission and apparently this is super bad for performance. This also makes no sense in comparison to the case of enabled IOMMU, where cache maintenance isn't performed at all (like it should be).

Please let me know if you're going to fix the problems or if you'd prefer me to create the patches.

Here is a draft of the fix for #2, it doesn't cover case of imported buffers (which should be statically mapped, IIUC):

...

The v5.5 is released now with the unusable GPU driver. Thierry, could please let me know if you're planning to do something about it? Should I help?

Thierry Reding

29 Jan 29 Jan

12:39 p.m.

On Mon, Jan 20, 2020 at 05:53:03AM +0300, Dmitry Osipenko wrote:

...

13.12.2019 18:35, Dmitry Osipenko пишет:

...
13.12.2019 18:10, Thierry Reding пишет:

...
On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:

...
Hello Thierry,

Commit [1] introduced a severe GPU performance regression on Tegra20 and Tegra30 using.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...

Interestingly the performance is okay on Tegra30 if CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for Tegra20.

I was telling you about this problem on the #tegra IRC sometime ago and you asked to report it in a trackable form, so finally here it is.

You could reproduce the problem by running [2] like this `grate/texture-filter -f -s` which should produce over 100 FPS for 720p display resolution and currently it's ~11 FPS.

[2] https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter...

Previously I was seeing some memory errors coming from Host1x DMA, but don't see any errors at all right now.

I don't see anything done horribly wrong in the offending commit.

Unfortunately I couldn't dedicate enough time to sit down and debug the problem thoroughly yet. Please let me know if you'll find a solution, I'll be happy to test it. Thanks in advance!

I suspect that the problem here is that we're now using the DMA API, which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall that that code doesn't coalesce entries in the SG table, so we may end up calling iommu_map() a lot of times, and miss out on much of the advantages that the ->iotlb_sync_map() gives us on Tegra20.

At the same time dma_map_sg() will flush caches, which we didn't do before. This we should be able to improve by passing the attribute DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache maintenance isn't needed.

And while thinking about it, one other difference is that with the DMA API we actually map/unmap the buffers for every submission. This is because the DMA API semantics require that buffers be mapped/unmapped every time you use them. Previously we would basically only map each buffer once (at allocation time) and only have to deal with cache maintenance, so the overhead per submission was drastically lower.

If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we may want to restore explicit IOMMU usage, at least on anything prior to Tegra124 where we're unlikely to ever use different IOMMU domains anyway (because they are such a scarce resource).

Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't think that it's the root of the problem. Disabling IOMMU for Tegra30 also didn't help (IIRC).

The offending patch shouldn't change anything in regards to the DMA API, if I'm not missing something. Strange..

Please keep me up-to-date!

Hello Thierry,

I took another look at the problem and here what was found:

The "Optionally attach clients to the IOMMU" patch is wrong because:

host1x_drm_probe() is invoked *before* any of the host1x_client_iommu_attach() happens, so there is no way on earth the 'use_explicit_iommu' could ever be true.

That's not correct. host1x_client_iommu_attach() happens during host1x_device_init(), which is called during host1x_drm_probe(). The idea is that host1x_drm_probe() sets up the minimum IOMMU so that all devices can attach, if they want to. If any of them connect (because they aren't already attached via something like the DMA/IOMMU glue) then the tegra->use_explicit_iommu is set to true, in which case the IOMMU domain setup for explicit IOMMU API usage is completed. If, on the other hand, none of the clients have a need for the explicit IOMMU domain, there's no need to set it up and host1x_drm_probe() will just discard it.

...

2. Not attaching DRM clients to IOMMU if HOST1x isn't
   attached is wrong because it never attached in the case
   of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
   makes no sense for T20/30 that do not support LPAE.

It's not at all wrong. Take for example the case of Tegra124 and Tegra210 where host1x and its clients can address 34 bits. In those cases, allocating individual pages via shmem has a high probability of hitting physical addresses beyond the 32-bit range, which means that the host1x can not access them unless it is also attached to an IOMMU where physical addresses to >= 4 GiB addresses can be translated into < 4 GiB virtual addresses. This is a very real problem that I was running into when testing on Tegra124 and Tegra210.

But I agree that this shouldn't be necessary on Tegra20 and Tegra30. We should be able to remedy the situation on Tegra20 and Tegra30 by adding another check, based on the DMA mask. Something like the below should work:

--- >8 --- diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c index aa9e49f04988..bd268028fb3d 100644 --- a/drivers/gpu/drm/tegra/drm.c +++ b/drivers/gpu/drm/tegra/drm.c @@ -1037,23 +1037,9 @@ void tegra_drm_free(struct tegra_drm *tegra, size_t size, void *virt, free_pages((unsigned long)virt, get_order(size)); }

-static int host1x_drm_probe(struct host1x_device *dev) +static bool host1x_drm_wants_iommu(struct host1x_device *dev) { - struct drm_driver *driver = &tegra_drm_driver; struct iommu_domain *domain; - struct tegra_drm *tegra; - struct drm_device *drm; - int err; - - drm = drm_dev_alloc(driver, &dev->dev); - if (IS_ERR(drm)) - return PTR_ERR(drm); - - tegra = kzalloc(sizeof(*tegra), GFP_KERNEL); - if (!tegra) { - err = -ENOMEM; - goto put; - }

/* * If the Tegra DRM clients are backed by an IOMMU, push buffers are @@ -1082,9 +1068,38 @@ static int host1x_drm_probe(struct host1x_device *dev) * up the device tree appropriately. This is considered an problem * of integration, so care must be taken for the DT to be consistent. */ - domain = iommu_get_domain_for_dev(drm->dev->parent); + domain = iommu_get_domain_for_dev(dev->dev.parent); + + /* + * Tegra20 and Tegra30 don't support addressing memory beyond the + * 32-bit boundary, so the regular GATHER opcodes will always be + * sufficient and whether or not the host1x is attached to an IOMMU + * doesn't matter. + */ + if (!domain && dma_get_mask(dev->dev.parent) <= DMA_BIT_MASK(32)) + return true; + + return domain != NULL; +} + +static int host1x_drm_probe(struct host1x_device *dev) +{ + struct drm_driver *driver = &tegra_drm_driver; + struct tegra_drm *tegra; + struct drm_device *drm; + int err; + + drm = drm_dev_alloc(driver, &dev->dev); + if (IS_ERR(drm)) + return PTR_ERR(drm); + + tegra = kzalloc(sizeof(*tegra), GFP_KERNEL); + if (!tegra) { + err = -ENOMEM; + goto put; + }

- if (domain && iommu_present(&platform_bus_type)) { + if (host1x_drm_wants_iommu(dev) && iommu_present(&platform_bus_type)) { tegra->domain = iommu_domain_alloc(&platform_bus_type); if (!tegra->domain) { err = -ENOMEM; --- >8 ---

...

[1] https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L2...

Because of the above problems, the DRM clients are erroneously not

getting attached to IOMMU at all and thus CMA is getting used for the BO allocations. Here comes the problems introduced by the "gpu: host1x: Support DMA mapping of buffers" patch, which makes DMA API to perform CPU cache maintenance on each job submission and apparently this is super bad for performance. This also makes no sense in comparison to the case of enabled IOMMU, where cache maintenance isn't performed at all (like it should be).

It actually does make a lot of sense. Very strictly speaking we were violating the DMA API prior to the above patch because we were not DMA mapping the buffers at all. Whenever you pass a buffer to hardware you need to map it for the device. At that point, the kernel does not know whether or not the buffer is dirty, so it has to perform a cache flush. Similarily, when the hardware is done with a buffer, we need to unmap it so that the CPU can access it again. This typically requires a cache invalidate.

That things even worked to begin with more by accident than by design.

So yes, this is different from what we were doing before, but it's actually the right thing to do. That said, I'm sure we can find ways to optimize this. For example, as part of the DMA API conversion series I added the possibility to set direction flags for relocation buffers. In cases where it is known that a certain buffer will only be used for reading, we should be able to avoid at least the cache invalidate operation after a job is done, since the hardware won't have modified the contents (when using an SMMU this can even be enforced). It's slightly trickier to avoid cache flushes. For buffers that are only going to be written, there's no need to flush the cache because the CPUs changes can be assumed to be overwritten by the hardware anyway. However we still need to make sure that we invalidate the caches in that case to ensure subsequent cache flushes don't overwrite data already written by hardware.

One other potential optimization I can imagine is to add flags to make cache maintenance optional on buffers when we know it's safe to do so. I'm not sure we can always know, so this is going to require further thought.

...

Please let me know if you're going to fix the problems or if you'd prefer me to create the patches.

Here is a draft of the fix for #2, it doesn't cover case of imported buffers (which should be statically mapped, IIUC):

@@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device *dev, struct host1x_bo *bo, * If we've manually mapped the buffer object through the IOMMU, make * sure to return the IOVA address of our mapping. */
  if (phys && obj->mm) {
  if (phys && (obj->mm || obj->vaddr)) {
          *phys = obj->iova;

This doesn't work for the case where we use the DMA API for mapping. Or at least it isn't going to work in the general case. The reason is because obj->iova is only valid for whatever the device was that mapped or allocated the buffer, which in this case is the host1x device, which isn't even a real device, so it won't work. The only case where it does work is if we're not behind an IOMMU, so obj->iova will actually be the physical address.

So what this basically ends up doing is avoid dma_map_*() all the time, which I guess is what you're trying to achieve. But it also gives you the wrong I/O virtual address in any case where an IOMMU is involved. Also, as discussed above, avoiding cache maintenance isn't correct.

Thierry

...

            return NULL;
    }

diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c index 25ca54de8fc5..69adfd66196b 100644 --- a/drivers/gpu/host1x/job.c +++ b/drivers/gpu/host1x/job.c @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)

    for (i = 0; i < job->num_relocs; i++) {
            struct host1x_reloc *reloc = &job->relocs[i];

```
          dma_addr_t phys_addr, *phys;
```

          dma_addr_t phys_addr;
          struct sg_table *sgt;

          reloc->target.bo = host1x_bo_get(reloc->target.bo);

@@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }

```
          if (client->group)
```
```
                  phys = &phys_addr;
```
```
          else
```
```
                  phys = NULL;
```

          sgt = host1x_bo_pin(dev, reloc->target.bo, phys);

          sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
          if (IS_ERR(sgt)) {
                  err = PTR_ERR(sgt);
                  goto unpin;

@@ -184,7 +179,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }

          sgt = host1x_bo_pin(host->dev, g->bo, NULL);

          sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
          if (IS_ERR(sgt)) {
                  err = PTR_ERR(sgt);
                  goto unpin;

@@ -214,7 +209,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)

                    job->unpins[job->num_unpins].size = gather_size;
                    phys_addr = iova_dma_addr(&host->iova, alloc);

```
          } else {
```

          } else if (sgt) {
                  err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
                                   DMA_TO_DEVICE);
                  if (!err) {

Dmitry Osipenko

30 Jan 30 Jan

4:36 a.m.

29.01.2020 15:39, Thierry Reding пишет:

...

On Mon, Jan 20, 2020 at 05:53:03AM +0300, Dmitry Osipenko wrote:

...
13.12.2019 18:35, Dmitry Osipenko пишет:

...

...
Hello Thierry,

I took another look at the problem and here what was found:

The "Optionally attach clients to the IOMMU" patch is wrong because:

host1x_drm_probe() is invoked *before* any of the host1x_client_iommu_attach() happens, so there is no way on earth the 'use_explicit_iommu' could ever be true.

That's not correct. host1x_client_iommu_attach() happens during host1x_device_init(), which is called during host1x_drm_probe().

Looks like I previously got confused by accident, my bad.

...

The idea is that host1x_drm_probe() sets up the minimum IOMMU so that all devices can attach, if they want to. If any of them connect (because they aren't already attached via something like the DMA/IOMMU glue) then the tegra->use_explicit_iommu is set to true, in which case the IOMMU domain setup for explicit IOMMU API usage is completed. If, on the other hand, none of the clients have a need for the explicit IOMMU domain, there's no need to set it up and host1x_drm_probe() will just discard it.

This matches my understanding of what you wanted to achieve, thanks.

...

...
2. Not attaching DRM clients to IOMMU if HOST1x isn't
   attached is wrong because it never attached in the case
   of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
   makes no sense for T20/30 that do not support LPAE.
It's not at all wrong. Take for example the case of Tegra124 and Tegra210 where host1x and its clients can address 34 bits. In those cases, allocating individual pages via shmem has a high probability of hitting physical addresses beyond the 32-bit range, which means that the host1x can not access them unless it is also attached to an IOMMU where physical addresses to >= 4 GiB addresses can be translated into < 4 GiB virtual addresses. This is a very real problem that I was running into when testing on Tegra124 and Tegra210.

Why not to set the DMA mask to 32bits if IOMMU is unavailable?

I'm a bit puzzled by the actual need to support the case where Host1x is backed by IOMMU and clients not.. How we could ever end up with this scenario in the upstream kernel?

What about the reverse scenario? You won't be able to patch cmdstream properly for >32bit addresses.

The root of the problem is that Tegra DRM UAPI doesn't support 64bit addresses, so you can't use "wide" opcodes and can't patch cmdstream.

Perhaps it is better not to add any new things or quirks to the Host1x / Tegra DRM for now. The drivers need a serious clean up, otherwise mess only continues to grow up. Don't you think so?

...

But I agree that this shouldn't be necessary on Tegra20 and Tegra30. We should be able to remedy the situation on Tegra20 and Tegra30 by adding another check, based on the DMA mask. Something like the below should work:

--- >8 ---

[snip]

...

--- >8 ---

This works, thanks.

...

...
[1] https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L2...

Because of the above problems, the DRM clients are erroneously not

getting attached to IOMMU at all and thus CMA is getting used for the BO allocations. Here comes the problems introduced by the "gpu: host1x: Support DMA mapping of buffers" patch, which makes DMA API to perform CPU cache maintenance on each job submission and apparently this is super bad for performance. This also makes no sense in comparison to the case of enabled IOMMU, where cache maintenance isn't performed at all (like it should be).

It actually does make a lot of sense. Very strictly speaking we were violating the DMA API prior to the above patch because we were not DMA mapping the buffers at all. Whenever you pass a buffer to hardware you need to map it for the device. At that point, the kernel does not know whether or not the buffer is dirty, so it has to perform a cache flush. Similarily, when the hardware is done with a buffer, we need to unmap it so that the CPU can access it again. This typically requires a cache invalidate.

That things even worked to begin with more by accident than by design.

So yes, this is different from what we were doing before, but it's actually the right thing to do. That said, I'm sure we can find ways to optimize this. For example, as part of the DMA API conversion series I added the possibility to set direction flags for relocation buffers. In cases where it is known that a certain buffer will only be used for reading, we should be able to avoid at least the cache invalidate operation after a job is done, since the hardware won't have modified the contents (when using an SMMU this can even be enforced). It's slightly trickier to avoid cache flushes. For buffers that are only going to be written, there's no need to flush the cache because the CPUs changes can be assumed to be overwritten by the hardware anyway. However we still need to make sure that we invalidate the caches in that case to ensure subsequent cache flushes don't overwrite data already written by hardware.

One other potential optimization I can imagine is to add flags to make cache maintenance optional on buffers when we know it's safe to do so. I'm not sure we can always know, so this is going to require further thought.

Doesn't sound good to me.. this is not going to be good for GPU drivers. All cache maintenance should be in control of userspace, the userspace should be telling kernel driver when it needs to get CPU access and when to finish the access. DMABUF has generic UAPI for the synchronizations, although a mature GPU driver may need more than that.

Today Tegra DRM driver supports only write-combined BO allocations, and thus, we don't need to do more than to flush CPU buffers before executing HW job.

...

...
Please let me know if you're going to fix the problems or if you'd prefer me to create the patches.

Here is a draft of the fix for #2, it doesn't cover case of imported buffers (which should be statically mapped, IIUC):

@@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device *dev, struct host1x_bo *bo, * If we've manually mapped the buffer object through the IOMMU, make * sure to return the IOVA address of our mapping. */
  if (phys && obj->mm) {
  if (phys && (obj->mm || obj->vaddr)) {
          *phys = obj->iova;
This doesn't work for the case where we use the DMA API for mapping. Or at least it isn't going to work in the general case.

Right, looks like I'll need to update my memory about the DMA API usage.

...

The reason is because obj->iova is only valid for whatever the device was that mapped or allocated the buffer, which in this case is the host1x device, which isn't even a real device, so it won't work. The only case where it does work is if we're not behind an IOMMU, so obj->iova will actually be the physical address.

But why do you need to dynamically map/unmap the statically-allocated buffers on each job submission, could you please explain what is the point? Perhaps it's a temporary workaround just to get a minimum of things working for the case of implicit IOMMU?

All buffers should be statically allocated and statically mapped, and when there is a need to sync an already mapped buffer, the dma_sync_* API should be used.

Like I said above, the syncing should be done by userspace for the buffers that are in control of userspace.

...

So what this basically ends up doing is avoid dma_map_*() all the time, which I guess is what you're trying to achieve. But it also gives you the wrong I/O virtual address in any case where an IOMMU is involved. Also, as discussed above, avoiding cache maintenance isn't correct.

Alright, then right now we need to bypass the dma_map_*() in a case of a non-implicit IOMMU, in order to bring back the good old behavior (at least temporary, until there will be a more comprehensive solution).

What do you think about this variant:

--- >8 --- diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c index 1237df157e05..555a6424e9fa 100644 --- a/drivers/gpu/drm/tegra/gem.c +++ b/drivers/gpu/drm/tegra/gem.c @@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device *dev, struct host1x_bo *bo, dma_addr_t *phys) { struct tegra_bo *obj = host1x_to_tegra_bo(bo); + struct tegra_drm *tegra = obj->gem.dev->dev_private; struct sg_table *sgt; int err;

- /* - * If we've manually mapped the buffer object through the IOMMU, make - * sure to return the IOVA address of our mapping. - */ - if (phys && obj->mm) { - *phys = obj->iova; - return NULL; + if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) { + /* if IOMMU isn't available, return IOVA=PHYS of the mapping */ + if (obj->vaddr) { + *phys = obj->iova; + return NULL; + } + + /* + * If we've manually mapped the buffer object through the + * IOMMU, make sure to return the IOVA address of our mapping. + */ + if (obj->mm) { + *phys = obj->iova; + return NULL; + } }

/* diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c index 6d1872ddef17..91304b9034fa 100644 --- a/drivers/gpu/drm/tegra/plane.c +++ b/drivers/gpu/drm/tegra/plane.c @@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)

for (i = 0; i < state->base.fb->format->num_planes; i++) { struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i); + struct sg_table *sgt;

- if (!dc->client.group) { - struct sg_table *sgt; - - sgt = host1x_bo_pin(dc->dev, &bo->base, NULL); - if (IS_ERR(sgt)) { - err = PTR_ERR(sgt); - goto unpin; - } + sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]); + if (IS_ERR(sgt)) { + err = PTR_ERR(sgt); + goto unpin; + }

+ if (sgt) { err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE); if (err == 0) { @@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)

state->iova[i] = sg_dma_address(sgt->sgl); state->sgt[i] = sgt; - } else { - state->iova[i] = bo->iova; } }

@@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state) struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i); struct sg_table *sgt = state->sgt[i];

- dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE); - host1x_bo_unpin(dc->dev, &bo->base, sgt); + if (sgt) { + dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, + DMA_TO_DEVICE); + host1x_bo_unpin(dc->dev, &bo->base, sgt); + }

state->iova[i] = DMA_MAPPING_ERROR; state->sgt[i] = NULL; diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c index 60b2fedd0061..538c0f0b40a4 100644 --- a/drivers/gpu/host1x/job.c +++ b/drivers/gpu/host1x/job.c @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)

for (i = 0; i < job->num_relocs; i++) { struct host1x_reloc *reloc = &job->relocs[i]; - dma_addr_t phys_addr, *phys; + dma_addr_t phys_addr; struct sg_table *sgt;

reloc->target.bo = host1x_bo_get(reloc->target.bo); @@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }

- if (client->group) - phys = &phys_addr; - else - phys = NULL; - - sgt = host1x_bo_pin(dev, reloc->target.bo, phys); + sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr); if (IS_ERR(sgt)) { err = PTR_ERR(sgt); goto unpin; @@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) job->num_unpins++; }

+ /* + * In a case of enabled firewall, all user gathers are copied into + * the secured job->gather_copy_mapped. + */ + if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL)) + return 0; + for (i = 0; i < job->num_gathers; i++) { struct host1x_job_gather *g = &job->gathers[i]; size_t gather_size = 0; @@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }

- sgt = host1x_bo_pin(host->dev, g->bo, NULL); + sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr); if (IS_ERR(sgt)) { err = PTR_ERR(sgt); goto unpin; }

- if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) { + if (host->domain) { for_each_sg(sgt->sgl, sg, sgt->nents, j) gather_size += sg->length; gather_size = iova_align(&host->iova, gather_size); @@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)

--- >8 ---

Thierry Reding

12:08 p.m.

On Thu, Jan 30, 2020 at 07:36:36AM +0300, Dmitry Osipenko wrote:

...

29.01.2020 15:39, Thierry Reding пишет:

...
On Mon, Jan 20, 2020 at 05:53:03AM +0300, Dmitry Osipenko wrote:

...
13.12.2019 18:35, Dmitry Osipenko пишет:

...

...
...
Hello Thierry,

I took another look at the problem and here what was found:

The "Optionally attach clients to the IOMMU" patch is wrong because:

host1x_drm_probe() is invoked *before* any of the host1x_client_iommu_attach() happens, so there is no way on earth the 'use_explicit_iommu' could ever be true.

That's not correct. host1x_client_iommu_attach() happens during host1x_device_init(), which is called during host1x_drm_probe().

Looks like I previously got confused by accident, my bad.

...
The idea is that host1x_drm_probe() sets up the minimum IOMMU so that all devices can attach, if they want to. If any of them connect (because they aren't already attached via something like the DMA/IOMMU glue) then the tegra->use_explicit_iommu is set to true, in which case the IOMMU domain setup for explicit IOMMU API usage is completed. If, on the other hand, none of the clients have a need for the explicit IOMMU domain, there's no need to set it up and host1x_drm_probe() will just discard it.

This matches my understanding of what you wanted to achieve, thanks.

...
...
2. Not attaching DRM clients to IOMMU if HOST1x isn't
   attached is wrong because it never attached in the case
   of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
   makes no sense for T20/30 that do not support LPAE.
It's not at all wrong. Take for example the case of Tegra124 and Tegra210 where host1x and its clients can address 34 bits. In those cases, allocating individual pages via shmem has a high probability of hitting physical addresses beyond the 32-bit range, which means that the host1x can not access them unless it is also attached to an IOMMU where physical addresses to >= 4 GiB addresses can be translated into < 4 GiB virtual addresses. This is a very real problem that I was running into when testing on Tegra124 and Tegra210.
Why not to set the DMA mask to 32bits if IOMMU is unavailable?

We already do that. If you look at host1x_iommu_init() in drivers/gpu/host1x/dev.c, you'll see that when no IOMMU support is available and the host1x doesn't support wide GATHER opcodes, then we limit the DMA Mask to 32 bits.

But that's not enough, see below.

...

I'm a bit puzzled by the actual need to support the case where Host1x is backed by IOMMU and clients not.. How we could ever end up with this scenario in the upstream kernel?

That's not what we're doing here. The fundamental problem is that we have a couple of generations where the hardware is mismatched in that clients support 34-bit addresses while host1x can only use 32-bit addresses in the GATHER opcode. The only way to get around this mismatch is by using an IOMMU.

However, with an IOMMU enabled for clients, we can run into the case where sparse pages would be allocated via shmem and end up beyond the 32-bit boundary. If the host1x is not attached to an IOMMU, there's no way for it to access these pages with standard GATHER opcodes.

This is what used to happen prior to this change when the host1x firewall was enabled. Since we were not attaching it to an IOMMU in that case, we would end up with sparse buffers allocated from pages that the host1x couldn't address.

...

What about the reverse scenario? You won't be able to patch cmdstream properly for >32bit addresses.

I don't think that scenario exists. I'm not aware of a Tegra device that has system memory outside of the CPU-addressable region.

...

The root of the problem is that Tegra DRM UAPI doesn't support 64bit addresses, so you can't use "wide" opcodes and can't patch cmdstream.

There's nothing in the UAPI that deals with addresses directly. We only pass around handles and these are resolved to buffer objects in the kernel where the address of the buffers can be 32-bit or 64-bit.

And we do in fact support wide opcodes and patch command streams just fine on 64-bit systems.

I mean, it's not like I've been doing this just for the fun of it. There are actual configurations where this is needed in order for it to work.

...

Perhaps it is better not to add any new things or quirks to the Host1x / Tegra DRM for now. The drivers need a serious clean up, otherwise mess only continues to grow up. Don't you think so?

This isn't anything new or a quirk. This is bug fixes to ensure that the driver works in (now hopefully) all configurations. Previously it was a matter of getting the configuration just right in order for it to work. All the work I did here (starting with the wide opcode support and then the DMA API and IOMMU work) was to make sure it would safely work in any setup.

And I do consider these changes to also be cleanups and incremental improvements of what the state was before. Again, I don't consider a rewrite a serious cleanup.

I'm fully aware that the driver has been collecting dust for a while and it isn't perfect. But it's also not overly messy. It's perhaps a bit more complex than your average driver, but it's also some pretty complex hardware.

...

...
But I agree that this shouldn't be necessary on Tegra20 and Tegra30. We should be able to remedy the situation on Tegra20 and Tegra30 by adding another check, based on the DMA mask. Something like the below should work:

--- >8 ---

[snip]

...
--- >8 ---

This works, thanks.

Great, I'll send this out then.

...

...
...
[1] https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L2...

Because of the above problems, the DRM clients are erroneously not

getting attached to IOMMU at all and thus CMA is getting used for the BO allocations. Here comes the problems introduced by the "gpu: host1x: Support DMA mapping of buffers" patch, which makes DMA API to perform CPU cache maintenance on each job submission and apparently this is super bad for performance. This also makes no sense in comparison to the case of enabled IOMMU, where cache maintenance isn't performed at all (like it should be).

It actually does make a lot of sense. Very strictly speaking we were violating the DMA API prior to the above patch because we were not DMA mapping the buffers at all. Whenever you pass a buffer to hardware you need to map it for the device. At that point, the kernel does not know whether or not the buffer is dirty, so it has to perform a cache flush. Similarily, when the hardware is done with a buffer, we need to unmap it so that the CPU can access it again. This typically requires a cache invalidate.

That things even worked to begin with more by accident than by design.

So yes, this is different from what we were doing before, but it's actually the right thing to do. That said, I'm sure we can find ways to optimize this. For example, as part of the DMA API conversion series I added the possibility to set direction flags for relocation buffers. In cases where it is known that a certain buffer will only be used for reading, we should be able to avoid at least the cache invalidate operation after a job is done, since the hardware won't have modified the contents (when using an SMMU this can even be enforced). It's slightly trickier to avoid cache flushes. For buffers that are only going to be written, there's no need to flush the cache because the CPUs changes can be assumed to be overwritten by the hardware anyway. However we still need to make sure that we invalidate the caches in that case to ensure subsequent cache flushes don't overwrite data already written by hardware.

One other potential optimization I can imagine is to add flags to make cache maintenance optional on buffers when we know it's safe to do so. I'm not sure we can always know, so this is going to require further thought.

Doesn't sound good to me.. this is not going to be good for GPU drivers. All cache maintenance should be in control of userspace, the userspace should be telling kernel driver when it needs to get CPU access and when to finish the access. DMABUF has generic UAPI for the synchronizations, although a mature GPU driver may need more than that.

I agree. But that's not something that we can do at this point. We don't have a way of passing information such as this to the driver, so the driver has to assume that caches are dirty for all buffers, otherwise it will not be able to guarantee that random cache flushes won't corrupt the data that it's passing to the hardware.

So yes, when we do have a way of explicitly flushing the caches for buffers, then we can add a mechanism to pass that information to the kernel so that it can optimize. But until then we just can't be sure. And I prefer a kernel driver that gives me slow and reliable, rather than fast but unpredictable results.

...

Today Tegra DRM driver supports only write-combined BO allocations, and thus, we don't need to do more than to flush CPU buffers before executing HW job.

That's only true when you allocate with the DMA API. When you allocate from a shmem mapping you don't get write-combined memory, so we do have to perform cache maintenance on the pages.

...

...
...
Please let me know if you're going to fix the problems or if you'd prefer me to create the patches.

Here is a draft of the fix for #2, it doesn't cover case of imported buffers (which should be statically mapped, IIUC):

@@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device *dev, struct host1x_bo *bo, * If we've manually mapped the buffer object through the IOMMU, make * sure to return the IOVA address of our mapping. */
  if (phys && obj->mm) {
  if (phys && (obj->mm || obj->vaddr)) {
          *phys = obj->iova;
This doesn't work for the case where we use the DMA API for mapping. Or at least it isn't going to work in the general case.
Right, looks like I'll need to update my memory about the DMA API usage.

...
The reason is because obj->iova is only valid for whatever the device was that mapped or allocated the buffer, which in this case is the host1x device, which isn't even a real device, so it won't work. The only case where it does work is if we're not behind an IOMMU, so obj->iova will actually be the physical address.

But why do you need to dynamically map/unmap the statically-allocated buffers on each job submission, could you please explain what is the point? Perhaps it's a temporary workaround just to get a minimum of things working for the case of implicit IOMMU?

It's primarily because we don't really know if a buffer has been mapped for a specific device. We always map at allocation time for the Tegra DRM parent device (which isn't a real device) but before it's used by any other host1x client, it has to be mapped for that device as well. That's important in case any of these devices have different IOMMU domains.

Actually, given that the device isn't a real device, the DMA handle returned from dma_alloc_wc() is actually a physical address and not valid for any device behind an IOMMU. So even in the case where we share an IOMMU domain among multiple device, the mapping created by dma_alloc_wc() is useless for them.

Because of the above and probably a bunch of other reasons, it's also a requirement of the DMA API. If you enable CONFIG_DMA_API_DEBUG, the DMA API will start printing a bunch of errors if you violate those and they typically indicate that what you're doing may not work. That doesn't mean it can't work, but it usually only does so accidentally.

...

All buffers should be statically allocated and statically mapped, and when there is a need to sync an already mapped buffer, the dma_sync_* API should be used.

That's my understanding as well. However, it's slightly more complicated than that. Once you move away from the assumption that a mapping for a buffer the same for all devices, you can no longer do that. For example, while the statically allocated buffer may be mapped for the Tegra DRM device (again, it's not a real device), it's not mapped for any of the clients yet. So before a client can use it, its driver has to go and map the buffer for the device. The logical point to do that is during host1x_job_pin(). Once the client no longer has a use for the buffer it should also unmap the buffer again because it will otherwise occupy the IOVA space unnecessarily. The logical point to do that is during host1x_job_unpin().

host1x_job_unpin() is also the point at which the job releases its reference to the buffer, so the backing memory could go away at any point after that, which means that the IOVA mapping could point at invalid memory if we didn't unmap the buffer.

I'm not aware of an easy way to optimize this while at the same time making sure that everything is still consistent. I suppose one way to do this would be to keep a cache of buffer objects and their mappings for each device and avoid mapping/unmapping them for every job. The problem with that is that we also don't want to hold on to buffer objects indefinitely because that will potentially cause a lot of memory and IOVA space to be used.

...

Like I said above, the syncing should be done by userspace for the buffers that are in control of userspace.

Yes, I agree. We already have an implementation of the .begin_cpu_access and .end_cpu_access callbacks for DMA-BUF, so it should be easy to add something like that for native buffers. Alternatively, I suppose user- space could be required to flush/invalidate using the DMA-BUF, but that potentially has the drawback of having to export a DMA-BUF for every single buffer.

A better option may be to add a driver-specific IOCTL to do cache maintenance. I think other drivers already do that.

...

...
So what this basically ends up doing is avoid dma_map_*() all the time, which I guess is what you're trying to achieve. But it also gives you the wrong I/O virtual address in any case where an IOMMU is involved. Also, as discussed above, avoiding cache maintenance isn't correct.

Alright, then right now we need to bypass the dma_map_*() in a case of a non-implicit IOMMU, in order to bring back the good old behavior (at least temporary, until there will be a more comprehensive solution).

But it's not good old behaviour. You're still going to side-step the cache maintenance and violate the DMA API semantics.

Thierry

...

What do you think about this variant:

--- >8 --- diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c index 1237df157e05..555a6424e9fa 100644 --- a/drivers/gpu/drm/tegra/gem.c +++ b/drivers/gpu/drm/tegra/gem.c @@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device *dev, struct host1x_bo *bo, dma_addr_t *phys) { struct tegra_bo *obj = host1x_to_tegra_bo(bo);

struct tegra_drm *tegra = obj->gem.dev->dev_private; struct sg_table *sgt; int err;
/*
* If we've manually mapped the buffer object through the IOMMU, make
* sure to return the IOVA address of our mapping.
*/
if (phys && obj->mm) {
*phys = obj->iova;
return NULL;
if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
/* if IOMMU isn't available, return IOVA=PHYS of the mapping */
if (obj->vaddr) {
	*phys = obj->iova;
	return NULL;
}
/*
 * If we've manually mapped the buffer object through the
 * IOMMU, make sure to return the IOVA address of our mapping.
 */
if (obj->mm) {
	*phys = obj->iova;
	return NULL;
}
}

/*
diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c index 6d1872ddef17..91304b9034fa 100644 --- a/drivers/gpu/drm/tegra/plane.c +++ b/drivers/gpu/drm/tegra/plane.c @@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)

for (i = 0; i < state->base.fb->format->num_planes; i++) { struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
struct sg_table *sgt;
if (!dc->client.group) {
	struct sg_table *sgt;
	sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
	if (IS_ERR(sgt)) {
		err = PTR_ERR(sgt);
		goto unpin;
	}
sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
if (IS_ERR(sgt)) {
	err = PTR_ERR(sgt);
	goto unpin;
}
if (sgt) {
err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
		 DMA_TO_DEVICE);
if (err == 0) {
@@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)
	state->iova[i] = sg_dma_address(sgt->sgl);
	state->sgt[i] = sgt;
} else {
	state->iova[i] = bo->iova;
} }
@@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state) struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i); struct sg_table *sgt = state->sgt[i];
dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
host1x_bo_unpin(dc->dev, &bo->base, sgt);
if (sgt) {
	dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
		     DMA_TO_DEVICE);
	host1x_bo_unpin(dc->dev, &bo->base, sgt);
}
state->iova[i] = DMA_MAPPING_ERROR; state->sgt[i] = NULL;
diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c index 60b2fedd0061..538c0f0b40a4 100644 --- a/drivers/gpu/host1x/job.c +++ b/drivers/gpu/host1x/job.c @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)

for (i = 0; i < job->num_relocs; i++) { struct host1x_reloc *reloc = &job->relocs[i];
dma_addr_t phys_addr, *phys;
dma_addr_t phys_addr;
struct sg_table *sgt;

reloc->target.bo = host1x_bo_get(reloc->target.bo);
@@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }
if (client->group)
	phys = &phys_addr;
else
	phys = NULL;
sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
if (IS_ERR(sgt)) { err = PTR_ERR(sgt); goto unpin;
@@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) job->num_unpins++; }
/*
* In a case of enabled firewall, all user gathers are copied into
* the secured job->gather_copy_mapped.
*/
if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
return 0;
for (i = 0; i < job->num_gathers; i++) { struct host1x_job_gather *g = &job->gathers[i]; size_t gather_size = 0;
@@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }
sgt = host1x_bo_pin(host->dev, g->bo, NULL);
sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
if (IS_ERR(sgt)) { err = PTR_ERR(sgt); goto unpin; }
if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
if (host->domain) {
for_each_sg(sgt->sgl, sg, sgt->nents, j)
	gather_size += sg->length;
gather_size = iova_align(&host->iova, gather_size);
@@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)
	job->unpins[job->num_unpins].size = gather_size;
	phys_addr = iova_dma_addr(&host->iova, alloc);
} else {
} else if (sgt) {
err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
		 DMA_TO_DEVICE);
if (!err) {
--- >8 ---

Dmitry Osipenko

2 Feb 2 Feb

9 p.m.

30.01.2020 15:08, Thierry Reding пишет: ...

...

...
Why not to set the DMA mask to 32bits if IOMMU is unavailable?

We already do that. If you look at host1x_iommu_init() in drivers/gpu/host1x/dev.c, you'll see that when no IOMMU support is available and the host1x doesn't support wide GATHER opcodes, then we limit the DMA Mask to 32 bits.

But that's not enough, see below.

...
I'm a bit puzzled by the actual need to support the case where Host1x is backed by IOMMU and clients not.. How we could ever end up with this scenario in the upstream kernel?

That's not what we're doing here. The fundamental problem is that we have a couple of generations where the hardware is mismatched in that clients support 34-bit addresses while host1x can only use 32-bit addresses in the GATHER opcode. The only way to get around this mismatch is by using an IOMMU.

Isn't it possible to limit DMA mask to 32-bit for both clients and Host1x?

diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c index 6a995db51d6d..4253dd53ea2e 100644 --- a/drivers/gpu/host1x/bus.c +++ b/drivers/gpu/host1x/bus.c @@ -190,12 +190,28 @@ static void host1x_subdev_unregister(struct host1x_device *device, */ int host1x_device_init(struct host1x_device *device) { + struct host1x *host = dev_get_drvdata(device->dev.parent); + struct iommu_domain *domain = iommu_get_domain_for_dev(host->dev); struct host1x_client *client; + u64 mask; int err;

mutex_lock(&device->clients_lock);

list_for_each_entry(client, &device->clients, list) { + mask = dma_get_mask(client->dev); + + if (!domain && !host->info->has_wide_gather && + mask > DMA_BIT_MASK(32)) { + err = dma_coerce_mask_and_coherent(client->dev, + DMA_BIT_MASK(32)); + if (err < 0) { + dev_err(&device->dev, + "failed to set DMA mask: %d\n", err); + goto teardown; + } + } + if (client->ops && client->ops->init) { err = client->ops->init(client); if (err < 0) {

...

However, with an IOMMU enabled for clients, we can run into the case where sparse pages would be allocated via shmem and end up beyond the 32-bit boundary. If the host1x is not attached to an IOMMU, there's no way for it to access these pages with standard GATHER opcodes.

This is what used to happen prior to this change when the host1x firewall was enabled.

I assume you're meaning the *disabled* firewall because gathers are copied into the Host1x's DMA buffer if firewall is enabled.

...

Since we were not attaching it to an IOMMU in that case, we would end up with sparse buffers allocated from pages that the host1x couldn't address.

Could you please clarify what do you mean by the shmem? If you mean the get_pages(), then one option will be to simply enforce Host1x firewall in order to get gathers always copied, and thus, we can remove the TEGRA_HOST1X_FIREWALL Kconfig option.

The other option could be *not* to use get_pages() whenever it can't work.

It also should be worthwhile to simply restrict kernel's configuration by disabling all functionality which requires IOMMU if IOMMU is not available. Just error out with informative message in that case, telling that kernel/device-tree configuration is wrong.

Please notice that grate-drivers do not work with the disabled firewall anyways, because it's simply impractical/unusable from a userspace perspective to track liveness of the gather BO.

Also, take a look at the grate-kernel's WIP patch where Host1x is separated from Tegra DRM and Host1x maintains gather buffer allocations by itself. In the case of legacy UAPI, the gather commands are copied from DRM GEM to the Host1x's BO, similarly to what we have now in upstream for the case of the enabled firewall, but in a more optimized manner (using pre-allocated pool and etc). I'd like to see the upstream driver doing the same if we won't end up dropping the legacy staging UAPI entirely.

...

...
What about the reverse scenario? You won't be able to patch cmdstream properly for >32bit addresses.

I don't think that scenario exists. I'm not aware of a Tegra device that has system memory outside of the CPU-addressable region.

...
The root of the problem is that Tegra DRM UAPI doesn't support 64bit addresses, so you can't use "wide" opcodes and can't patch cmdstream.

There's nothing in the UAPI that deals with addresses directly. We only pass around handles and these are resolved to buffer objects in the kernel where the address of the buffers can be 32-bit or 64-bit.

If buffer is 64-bit, then client's address-registers take two 32-bit hardware registers, and thus, cmdstream should reserve two words for the patched handles.

Host1x driver doesn't support patching of 64-bit addresses and there is no way to tell userspace to reserve two words per-address using the current UAPI.

...

And we do in fact support wide opcodes and patch command streams just fine on 64-bit systems.

I mean, it's not like I've been doing this just for the fun of it. There are actual configurations where this is needed in order for it to work.

...
Perhaps it is better not to add any new things or quirks to the Host1x / Tegra DRM for now. The drivers need a serious clean up, otherwise mess only continues to grow up. Don't you think so?

This isn't anything new or a quirk. This is bug fixes to ensure that the driver works in (now hopefully) all configurations. Previously it was a matter of getting the configuration just right in order for it to work. All the work I did here (starting with the wide opcode support and then the DMA API and IOMMU work) was to make sure it would safely work in any setup.

And I do consider these changes to also be cleanups and incremental improvements of what the state was before. Again, I don't consider a rewrite a serious cleanup.

I'm fully aware that the driver has been collecting dust for a while and it isn't perfect. But it's also not overly messy. It's perhaps a bit more complex than your average driver, but it's also some pretty complex hardware.

...

...
Doesn't sound good to me.. this is not going to be good for GPU drivers. All cache maintenance should be in control of userspace, the userspace should be telling kernel driver when it needs to get CPU access and when to finish the access. DMABUF has generic UAPI for the synchronizations, although a mature GPU driver may need more than that.

I agree. But that's not something that we can do at this point. We don't have a way of passing information such as this to the driver, so the driver has to assume that caches are dirty for all buffers, otherwise it will not be able to guarantee that random cache flushes won't corrupt the data that it's passing to the hardware.

So yes, when we do have a way of explicitly flushing the caches for buffers, then we can add a mechanism to pass that information to the kernel so that it can optimize. But until then we just can't be sure.

It worked fine for the last 7 years on T20/30 and continues to work if the offending patches are reverted, no problems here.

...

And I prefer a kernel driver that gives me slow and reliable, rather than fast but unpredictable results.

Your changes introduced regression on Tegra20/30 without solving any problem. This is unacceptable, it's not an improvement, please don't do it :)

Apparently some problem exists only for later Tegra generations, but I still can't recognize what it's is from your words.

Please give a concrete real-world example of the trouble you're trying to solve.

...

...
Today Tegra DRM driver supports only write-combined BO allocations, and thus, we don't need to do more than to flush CPU buffers before executing HW job.

That's only true when you allocate with the DMA API. When you allocate from a shmem mapping you don't get write-combined memory, so we do have to perform cache maintenance on the pages.

In all cases GEM's memory mappings are write-combined, including the case of get_pages(). Host1x driver performs the maintenance by doing wmb() before executing job.

If you're meaning that imported buffers could be backed by cacheable pages, then userspace must use DMABUF syncing UAPI or any other means to sync CPU cache, there is no way around it. You should assume that yours userpspace is broken if it doesn't do it.

...

...
...
...
Please let me know if you're going to fix the problems or if you'd prefer me to create the patches.

Here is a draft of the fix for #2, it doesn't cover case of imported buffers (which should be statically mapped, IIUC):

@@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device *dev, struct host1x_bo *bo, * If we've manually mapped the buffer object through the IOMMU, make * sure to return the IOVA address of our mapping. */
  if (phys && obj->mm) {
  if (phys && (obj->mm || obj->vaddr)) {
          *phys = obj->iova;
This doesn't work for the case where we use the DMA API for mapping. Or at least it isn't going to work in the general case.
Right, looks like I'll need to update my memory about the DMA API usage.

...
The reason is because obj->iova is only valid for whatever the device was that mapped or allocated the buffer, which in this case is the host1x device, which isn't even a real device, so it won't work. The only case where it does work is if we're not behind an IOMMU, so obj->iova will actually be the physical address.

But why do you need to dynamically map/unmap the statically-allocated buffers on each job submission, could you please explain what is the point? Perhaps it's a temporary workaround just to get a minimum of things working for the case of implicit IOMMU?
It's primarily because we don't really know if a buffer has been mapped for a specific device. We always map at allocation time for the Tegra DRM parent device (which isn't a real device) but before it's used by any other host1x client, it has to be mapped for that device as well.

...

That's important in case any of these devices have different IOMMU domains.

This case should be covered by the solution I'm proposing below.

Today's mainline kernel has all clients in the same shared IOMMU domain. Please note that I'm focused on solving the primary regression problem, the potential future problems are secondary.

...

Actually, given that the device isn't a real device, the DMA handle returned from dma_alloc_wc() is actually a physical address and not valid for any device behind an IOMMU. So even in the case where we share an IOMMU domain among multiple device, the mapping created by dma_alloc_wc() is useless for them.

What about to pick any DRM sub-device and then statically map DMA buffer for this picked device during of BO allocation / import? All other DRM sub-devices can access that mapping because of the shared domain.

...

Because of the above and probably a bunch of other reasons, it's also a requirement of the DMA API. If you enable CONFIG_DMA_API_DEBUG, the DMA API will start printing a bunch of errors if you violate those and they typically indicate that what you're doing may not work. That doesn't mean it can't work, but it usually only does so accidentally.

...
All buffers should be statically allocated and statically mapped, and when there is a need to sync an already mapped buffer, the dma_sync_* API should be used.

That's my understanding as well. However, it's slightly more complicated than that. Once you move away from the assumption that a mapping for a buffer the same for all devices, you can no longer do that. For example, while the statically allocated buffer may be mapped for the Tegra DRM device (again, it's not a real device), it's not mapped for any of the clients yet.

The same what I wrote above. Just pick any DRM sub-device and map buffer for that device during allocation / import.

...

So before a client can use it, its driver has to go and map the buffer for the device. The logical point to do that is during host1x_job_pin(). Once the client no longer has a use for the buffer it should also unmap the buffer again because it will otherwise occupy the IOVA space unnecessarily. The logical point to do that is during host1x_job_unpin().

The IOVA space problem really exists only for Tegra20.

I suppose that page table size is ~10-20MB per 1GB of allocations, this is not something to worry about.

...

host1x_job_unpin() is also the point at which the job releases its reference to the buffer, so the backing memory could go away at any point after that, which means that the IOVA mapping could point at invalid memory if we didn't unmap the buffer.

Backing memory can't go away before buffer is unmapped, it is a refcount bug otherwise.

...

I'm not aware of an easy way to optimize this while at the same time making sure that everything is still consistent. I suppose one way to do this would be to keep a cache of buffer objects and their mappings for each device and avoid mapping/unmapping them for every job. The problem with that is that we also don't want to hold on to buffer objects indefinitely because that will potentially cause a lot of memory and IOVA space to be used.

The case of explicit IOMMU is nicer because we can use DRM's MM scanner to maintain a limited window of IOMMU mappings. Take a look at https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi... for example.

The LRU cache should be easy to implement, it also should be useful if we'll get around to supporting memory shrinker (BO swap-out).

...

...
Like I said above, the syncing should be done by userspace for the buffers that are in control of userspace.

Yes, I agree. We already have an implementation of the .begin_cpu_access and .end_cpu_access callbacks for DMA-BUF, so it should be easy to add something like that for native buffers. Alternatively, I suppose user- space could be required to flush/invalidate using the DMA-BUF, but that potentially has the drawback of having to export a DMA-BUF for every single buffer.> A better option may be to add a driver-specific IOCTL to do cache maintenance. I think other drivers already do that.

Perhaps.. but it needs a real-world use-case first.

...

...
...
So what this basically ends up doing is avoid dma_map_*() all the time, which I guess is what you're trying to achieve. But it also gives you the wrong I/O virtual address in any case where an IOMMU is involved. Also, as discussed above, avoiding cache maintenance isn't correct.

Alright, then right now we need to bypass the dma_map_*() in a case of a non-implicit IOMMU, in order to bring back the good old behavior (at least temporary, until there will be a more comprehensive solution).

But it's not good old behaviour. You're still going to side-step the cache maintenance and violate the DMA API semantics.

The DMA API usage that you introduced in the v5.5 is wrong too because it makes driver unusable :)

...

From a user's perspective it is exactly the good old behaviour. I tried

to load KDE Plasma 5 using Xorg Opentegra driver, which worked perfectly fine before, and now it is absolutely unusable.

There are couple things we could do in order to prevent this situation in the future:

1. Better automated testing. The grate-tests could be useful for this particular case, once it will get support for the offscreen testing.

2. Better review and real-world testing of the patches. Don't you mind if I'll propose myself as a reviewer for the Tegra DRM / Host1x drivers?

I'm suggesting to take a step back and revert to the old behaviour for now, at least on T20/30. I don't know about ARM64, but ARM32 permits to violate the DMA API semantics for the case of unavailable IOMMU, of course that may change in the future and by that time I hope we'll manage to get the DMA API usage done right.

The solution below restores old behavior for the case of unavailable IOMMU and for the case of explicit IOMMU domain, it should work in the case of implicit IOMMU as well.

...

...
What do you think about this variant:

--- >8 --- diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c index 1237df157e05..555a6424e9fa 100644 --- a/drivers/gpu/drm/tegra/gem.c +++ b/drivers/gpu/drm/tegra/gem.c @@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device *dev, struct host1x_bo *bo, dma_addr_t *phys) { struct tegra_bo *obj = host1x_to_tegra_bo(bo);

struct tegra_drm *tegra = obj->gem.dev->dev_private; struct sg_table *sgt; int err;
/*
* If we've manually mapped the buffer object through the IOMMU, make
* sure to return the IOVA address of our mapping.
*/
if (phys && obj->mm) {
*phys = obj->iova;
return NULL;
if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
/* if IOMMU isn't available, return IOVA=PHYS of the mapping */

...

```
if (obj->vaddr) {
```

If this is unacceptable on ARM64 (dma_handle != phys_addr), then it could be:

if (obj->vaddr && IS_ENABLED(CONFIG_ARM)) {

...

...
	*phys = obj->iova;
	return NULL;
}
/*
 * If we've manually mapped the buffer object through the
 * IOMMU, make sure to return the IOVA address of our mapping.
 */
if (obj->mm) {
	*phys = obj->iova;
	return NULL;
}
}

/*
diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c index 6d1872ddef17..91304b9034fa 100644 --- a/drivers/gpu/drm/tegra/plane.c +++ b/drivers/gpu/drm/tegra/plane.c @@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)

for (i = 0; i < state->base.fb->format->num_planes; i++) { struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
struct sg_table *sgt;
if (!dc->client.group) {
	struct sg_table *sgt;
	sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
	if (IS_ERR(sgt)) {
		err = PTR_ERR(sgt);
		goto unpin;
	}
sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
if (IS_ERR(sgt)) {
	err = PTR_ERR(sgt);
	goto unpin;
}
if (sgt) {
err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
		 DMA_TO_DEVICE);
if (err == 0) {
@@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)
	state->iova[i] = sg_dma_address(sgt->sgl);
	state->sgt[i] = sgt;
} else {
	state->iova[i] = bo->iova;
} }
@@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state) struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i); struct sg_table *sgt = state->sgt[i];
dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
host1x_bo_unpin(dc->dev, &bo->base, sgt);
if (sgt) {
	dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
		     DMA_TO_DEVICE);
	host1x_bo_unpin(dc->dev, &bo->base, sgt);
}
state->iova[i] = DMA_MAPPING_ERROR; state->sgt[i] = NULL;
diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c index 60b2fedd0061..538c0f0b40a4 100644 --- a/drivers/gpu/host1x/job.c +++ b/drivers/gpu/host1x/job.c @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)

for (i = 0; i < job->num_relocs; i++) { struct host1x_reloc *reloc = &job->relocs[i];
dma_addr_t phys_addr, *phys;
dma_addr_t phys_addr;
struct sg_table *sgt;

reloc->target.bo = host1x_bo_get(reloc->target.bo);
@@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }
if (client->group)
	phys = &phys_addr;
else
	phys = NULL;
sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
if (IS_ERR(sgt)) { err = PTR_ERR(sgt); goto unpin;
@@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) job->num_unpins++; }
/*
* In a case of enabled firewall, all user gathers are copied into
* the secured job->gather_copy_mapped.
*/
if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
return 0;
for (i = 0; i < job->num_gathers; i++) { struct host1x_job_gather *g = &job->gathers[i]; size_t gather_size = 0;
@@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }
sgt = host1x_bo_pin(host->dev, g->bo, NULL);
sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
if (IS_ERR(sgt)) { err = PTR_ERR(sgt); goto unpin; }
if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
if (host->domain) {
for_each_sg(sgt->sgl, sg, sgt->nents, j)
	gather_size += sg->length;
gather_size = iova_align(&host->iova, gather_size);
@@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)
	job->unpins[job->num_unpins].size = gather_size;
	phys_addr = iova_dma_addr(&host->iova, alloc);
} else {
} else if (sgt) {
err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
		 DMA_TO_DEVICE);
if (!err) {
--- >8 ---

Thierry Reding

3 Feb 3 Feb

12:56 p.m.

On Mon, Feb 03, 2020 at 12:00:49AM +0300, Dmitry Osipenko wrote:

...

30.01.2020 15:08, Thierry Reding пишет: ...

...
...
Why not to set the DMA mask to 32bits if IOMMU is unavailable?

We already do that. If you look at host1x_iommu_init() in drivers/gpu/host1x/dev.c, you'll see that when no IOMMU support is available and the host1x doesn't support wide GATHER opcodes, then we limit the DMA Mask to 32 bits.

But that's not enough, see below.

...
I'm a bit puzzled by the actual need to support the case where Host1x is backed by IOMMU and clients not.. How we could ever end up with this scenario in the upstream kernel?

That's not what we're doing here. The fundamental problem is that we have a couple of generations where the hardware is mismatched in that clients support 34-bit addresses while host1x can only use 32-bit addresses in the GATHER opcode. The only way to get around this mismatch is by using an IOMMU.

Isn't it possible to limit DMA mask to 32-bit for both clients and Host1x?

Many things are possible, but I'm trying to find the best and most intuitive way of encoding these restrictions in the driver. The problem isn't so much that the hardware can only access 32-bits because the memory interface is larger. It's more that the platform requires 32-bit addresses because of integration mismatches like the above.

...

diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c index 6a995db51d6d..4253dd53ea2e 100644 --- a/drivers/gpu/host1x/bus.c +++ b/drivers/gpu/host1x/bus.c @@ -190,12 +190,28 @@ static void host1x_subdev_unregister(struct host1x_device *device, */ int host1x_device_init(struct host1x_device *device) {
struct host1x *host = dev_get_drvdata(device->dev.parent);

struct iommu_domain *domain = iommu_get_domain_for_dev(host->dev); struct host1x_client *client;

u64 mask; int err;

mutex_lock(&device->clients_lock);

list_for_each_entry(client, &device->clients, list) {
mask = dma_get_mask(client->dev);
if (!domain && !host->info->has_wide_gather &&
    mask > DMA_BIT_MASK(32)) {
	err = dma_coerce_mask_and_coherent(client->dev,
					   DMA_BIT_MASK(32));
	if (err < 0) {
		dev_err(&device->dev,
			"failed to set DMA mask: %d\n", err);
		goto teardown;
	}
}
if (client->ops && client->ops->init) { err = client->ops->init(client); if (err < 0) {

So yes, something like this would probably work as well, but it's not an accurate description of what's going on, in my opinion.

...

...
However, with an IOMMU enabled for clients, we can run into the case where sparse pages would be allocated via shmem and end up beyond the 32-bit boundary. If the host1x is not attached to an IOMMU, there's no way for it to access these pages with standard GATHER opcodes.

This is what used to happen prior to this change when the host1x firewall was enabled.

I assume you're meaning the *disabled* firewall because gathers are copied into the Host1x's DMA buffer if firewall is enabled.

...
Since we were not attaching it to an IOMMU in that case, we would end up with sparse buffers allocated from pages that the host1x couldn't address.

Could you please clarify what do you mean by the shmem? If you mean the get_pages(), then one option will be to simply enforce Host1x firewall in order to get gathers always copied, and thus, we can remove the TEGRA_HOST1X_FIREWALL Kconfig option.

The firewall is useful for other things that just copying the memory. That said, I agree that always copying gathers doesn't sound like a bad idea.

...

The other option could be *not* to use get_pages() whenever it can't work.

That's basically what we're doing. drm_gem_get_pages() is only used when IOMMU support is enabled. Otherwise the DMA API will be used and that will respect the DMA mask that was previously configured.

So another way you could think of this is that we set the logical host1x device's DMA mask to the intersection of the DMA masks for all the sub- devices. That ensures that all devices can access the memory allocated by the DMA API.

...

It also should be worthwhile to simply restrict kernel's configuration by disabling all functionality which requires IOMMU if IOMMU is not available. Just error out with informative message in that case, telling that kernel/device-tree configuration is wrong.

What we currently do is to gracefully fall back to CMA allocations if the IOMMU functionality isn't there. That sounds like a better option to me than to fail and request people to rebuild the kernel.

Also note that we don't usually know at compile-time whether or not an IOMMU is enabled.

...

Please notice that grate-drivers do not work with the disabled firewall anyways, because it's simply impractical/unusable from a userspace perspective to track liveness of the gather BO.

Um... that's certainly a little strange. There's never going to be a reliable way for userspace to detect whether or not the firewall is enabled or not. Userspace also shouldn't care because nothing in the UABI should in anyway indicate the presence of the firewall.

What exactly makes it impractical/unusable to track liveness of a gather BO? The buffer can't go away until you destroy the GEM object, so it sounds rather simple to me. As for ownership, userspace needs to consider the gather buffer as owned by the kernel/hardware until the job to which it was submitted has finished running.

...

Also, take a look at the grate-kernel's WIP patch where Host1x is separated from Tegra DRM and Host1x maintains gather buffer allocations by itself. In the case of legacy UAPI, the gather commands are copied from DRM GEM to the Host1x's BO, similarly to what we have now in upstream for the case of the enabled firewall, but in a more optimized manner (using pre-allocated pool and etc). I'd like to see the upstream driver doing the same if we won't end up dropping the legacy staging UAPI entirely.

Do you have any pointers?

...

...
...
What about the reverse scenario? You won't be able to patch cmdstream properly for >32bit addresses.

I don't think that scenario exists. I'm not aware of a Tegra device that has system memory outside of the CPU-addressable region.

...
The root of the problem is that Tegra DRM UAPI doesn't support 64bit addresses, so you can't use "wide" opcodes and can't patch cmdstream.

There's nothing in the UAPI that deals with addresses directly. We only pass around handles and these are resolved to buffer objects in the kernel where the address of the buffers can be 32-bit or 64-bit.

If buffer is 64-bit, then client's address-registers take two 32-bit hardware registers, and thus, cmdstream should reserve two words for the patched handles.

Host1x driver doesn't support patching of 64-bit addresses and there is no way to tell userspace to reserve two words per-address using the current UAPI.

That's only partially true. Yes, we don't support patching of 64-bit addresses, but the reason is because we don't have to. On SoC generations where the memory interface is wider than 32 bits, the host1x clients use a relocation shift (typically of 8 bits) to accomodate for the 32-bit width restriction of the registers.

We currently deal with that in the ABI and I'm honestly not sure that's a good solution. But it works perfectly fine with 64-bit BO addresses.

...

...
And we do in fact support wide opcodes and patch command streams just fine on 64-bit systems.

I mean, it's not like I've been doing this just for the fun of it. There are actual configurations where this is needed in order for it to work.

...
Perhaps it is better not to add any new things or quirks to the Host1x / Tegra DRM for now. The drivers need a serious clean up, otherwise mess only continues to grow up. Don't you think so?

This isn't anything new or a quirk. This is bug fixes to ensure that the driver works in (now hopefully) all configurations. Previously it was a matter of getting the configuration just right in order for it to work. All the work I did here (starting with the wide opcode support and then the DMA API and IOMMU work) was to make sure it would safely work in any setup.

And I do consider these changes to also be cleanups and incremental improvements of what the state was before. Again, I don't consider a rewrite a serious cleanup.

I'm fully aware that the driver has been collecting dust for a while and it isn't perfect. But it's also not overly messy. It's perhaps a bit more complex than your average driver, but it's also some pretty complex hardware.

...

...
...
Doesn't sound good to me.. this is not going to be good for GPU drivers. All cache maintenance should be in control of userspace, the userspace should be telling kernel driver when it needs to get CPU access and when to finish the access. DMABUF has generic UAPI for the synchronizations, although a mature GPU driver may need more than that.

I agree. But that's not something that we can do at this point. We don't have a way of passing information such as this to the driver, so the driver has to assume that caches are dirty for all buffers, otherwise it will not be able to guarantee that random cache flushes won't corrupt the data that it's passing to the hardware.

So yes, when we do have a way of explicitly flushing the caches for buffers, then we can add a mechanism to pass that information to the kernel so that it can optimize. But until then we just can't be sure.

It worked fine for the last 7 years on T20/30 and continues to work if the offending patches are reverted, no problems here.

I do recall problems with cache maintenance during early development. Note that I've never seen these issues when using only the DMA API. But with explicit IOMMU usage there would typically be short horizontal lines that were the wrong color because CPU caches hadn't been properly flushed. We used to work around that by using dma_sync_sg_for_device(), but that causes the DMA API to report abuses in debug mode.

Again, back at the time I was working on Tegra20 and Tegra30, and I was definitely seeing those cache-related issues there as well.

...

...
And I prefer a kernel driver that gives me slow and reliable, rather than fast but unpredictable results.

Your changes introduced regression on Tegra20/30 without solving any problem. This is unacceptable, it's not an improvement, please don't do it :)

It's perhaps a performance regression, but functionally it should still be working fine. And see above for the issues that this solves.

Now, it may be that there are some cases where cache maintenance at this level is not necessary, but keep in mind that we have a generic driver that needs to work properly on many different setups at the same time. A mechanism like the DMA API debug facilities is there for a reason. It points out when you take shortcuts that may only work under some circumstances but break terribly under others.

So I consider this an improvement in that it ensures that things always work correctly, regardless of any assumptions.

...

Apparently some problem exists only for later Tegra generations, but I still can't recognize what it's is from your words.

Please give a concrete real-world example of the trouble you're trying to solve.

See above. And, again, this is a problem that also exists on earlier generations. I /think/ I've never seen cache-related issues when the IOMMU is disabled, and that's probably because we have write-combine mappings in that case. What you're saying seems to confirm that.

...

...
...
Today Tegra DRM driver supports only write-combined BO allocations, and thus, we don't need to do more than to flush CPU buffers before executing HW job.

That's only true when you allocate with the DMA API. When you allocate from a shmem mapping you don't get write-combined memory, so we do have to perform cache maintenance on the pages.

In all cases GEM's memory mappings are write-combined, including the case of get_pages(). Host1x driver performs the maintenance by doing wmb() before executing job.

No, that's just not true. You don't get write-combined mappings for drm_gem_get_pages(). The wmb() doesn't help in that case.

...

If you're meaning that imported buffers could be backed by cacheable pages, then userspace must use DMABUF syncing UAPI or any other means to sync CPU cache, there is no way around it. You should assume that yours userpspace is broken if it doesn't do it.

I don't think it's supposed to work that way. Again, in practice this may work fine most of the time. However, in the more general case it may not.

The DMA-BUF sync UABI is really all about userspace synchronizing CPU accesses with the exporter of the DMA-BUF. So, yes, if userspace makes modifications to the buffer, it needs to tell the exporter about it with the sync IOCTL.

But importers of the DMA-BUF aren't automatically on the safe side. So there could be cases where an importer needs to do additional cache maintenance in order to properly use a given buffer.

I'm not aware of any cases like that on Tegra, but that shouldn't be an excuse for taking shortcuts. We should still be using the APIs in the way they were intended. If they end up doing too much or even the wrong thing, then I think we need to address that in their implementation and not by abusing the API.

...

...
...
...
...
Please let me know if you're going to fix the problems or if you'd prefer me to create the patches.

Here is a draft of the fix for #2, it doesn't cover case of imported buffers (which should be statically mapped, IIUC):

@@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device *dev, struct host1x_bo *bo, * If we've manually mapped the buffer object through the IOMMU, make * sure to return the IOVA address of our mapping. */
  if (phys && obj->mm) {
  if (phys && (obj->mm || obj->vaddr)) {
          *phys = obj->iova;
This doesn't work for the case where we use the DMA API for mapping. Or at least it isn't going to work in the general case.
Right, looks like I'll need to update my memory about the DMA API usage.

...
The reason is because obj->iova is only valid for whatever the device was that mapped or allocated the buffer, which in this case is the host1x device, which isn't even a real device, so it won't work. The only case where it does work is if we're not behind an IOMMU, so obj->iova will actually be the physical address.

But why do you need to dynamically map/unmap the statically-allocated buffers on each job submission, could you please explain what is the point? Perhaps it's a temporary workaround just to get a minimum of things working for the case of implicit IOMMU?
It's primarily because we don't really know if a buffer has been mapped for a specific device. We always map at allocation time for the Tegra DRM parent device (which isn't a real device) but before it's used by any other host1x client, it has to be mapped for that device as well.
...
That's important in case any of these devices have different IOMMU domains.

This case should be covered by the solution I'm proposing below.

Today's mainline kernel has all clients in the same shared IOMMU domain. Please note that I'm focused on solving the primary regression problem, the potential future problems are secondary.

Again, that's only true up to and including Tegra210. For Tegra186 and Tegra194 that's no longer possible.

...

...
Actually, given that the device isn't a real device, the DMA handle returned from dma_alloc_wc() is actually a physical address and not valid for any device behind an IOMMU. So even in the case where we share an IOMMU domain among multiple device, the mapping created by dma_alloc_wc() is useless for them.

What about to pick any DRM sub-device and then statically map DMA buffer for this picked device during of BO allocation / import? All other DRM sub-devices can access that mapping because of the shared domain.

Again, this only works if all devices share a common IOMMU domain. Since that's not the case we can't do it.

...

...
Because of the above and probably a bunch of other reasons, it's also a requirement of the DMA API. If you enable CONFIG_DMA_API_DEBUG, the DMA API will start printing a bunch of errors if you violate those and they typically indicate that what you're doing may not work. That doesn't mean it can't work, but it usually only does so accidentally.

...
All buffers should be statically allocated and statically mapped, and when there is a need to sync an already mapped buffer, the dma_sync_* API should be used.

That's my understanding as well. However, it's slightly more complicated than that. Once you move away from the assumption that a mapping for a buffer the same for all devices, you can no longer do that. For example, while the statically allocated buffer may be mapped for the Tegra DRM device (again, it's not a real device), it's not mapped for any of the clients yet.

The same what I wrote above. Just pick any DRM sub-device and map buffer for that device during allocation / import.

That won't work in the general case where we have separate IOMMU domains, because you need to map the buffer once for each domain.

...

...
So before a client can use it, its driver has to go and map the buffer for the device. The logical point to do that is during host1x_job_pin(). Once the client no longer has a use for the buffer it should also unmap the buffer again because it will otherwise occupy the IOVA space unnecessarily. The logical point to do that is during host1x_job_unpin().

The IOVA space problem really exists only for Tegra20.

I suppose that page table size is ~10-20MB per 1GB of allocations, this is not something to worry about.

The IOVA space problem is going to be real for other devices once we start leaking IOVA memory addresses.

...

...
host1x_job_unpin() is also the point at which the job releases its reference to the buffer, so the backing memory could go away at any point after that, which means that the IOVA mapping could point at invalid memory if we didn't unmap the buffer.

Backing memory can't go away before buffer is unmapped, it is a refcount bug otherwise.

Right. However, my point was that if you don't unmap the buffer at host1x_job_unpin() time, when would you unmap it? It's really at unpin time that you know the buffer won't be used anymore. It might be used again in the future, but the driver doesn't know that. Similarily, the driver only knows at host1x_job_pin() time that a buffer will be used, and by what device. It's not possible to predict ahead of time when a buffer will be used, by which device and for how long.

...

...
I'm not aware of an easy way to optimize this while at the same time making sure that everything is still consistent. I suppose one way to do this would be to keep a cache of buffer objects and their mappings for each device and avoid mapping/unmapping them for every job. The problem with that is that we also don't want to hold on to buffer objects indefinitely because that will potentially cause a lot of memory and IOVA space to be used.

The case of explicit IOMMU is nicer because we can use DRM's MM scanner to maintain a limited window of IOMMU mappings. Take a look at https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi... for example.

The LRU cache should be easy to implement, it also should be useful if we'll get around to supporting memory shrinker (BO swap-out).

...
...
Like I said above, the syncing should be done by userspace for the buffers that are in control of userspace.

Yes, I agree. We already have an implementation of the .begin_cpu_access and .end_cpu_access callbacks for DMA-BUF, so it should be easy to add something like that for native buffers. Alternatively, I suppose user- space could be required to flush/invalidate using the DMA-BUF, but that potentially has the drawback of having to export a DMA-BUF for every single buffer.> A better option may be to add a driver-specific IOCTL to do cache maintenance. I think other drivers already do that.

Perhaps.. but it needs a real-world use-case first.

...
...
...
So what this basically ends up doing is avoid dma_map_*() all the time, which I guess is what you're trying to achieve. But it also gives you the wrong I/O virtual address in any case where an IOMMU is involved. Also, as discussed above, avoiding cache maintenance isn't correct.

Alright, then right now we need to bypass the dma_map_*() in a case of a non-implicit IOMMU, in order to bring back the good old behavior (at least temporary, until there will be a more comprehensive solution).

But it's not good old behaviour. You're still going to side-step the cache maintenance and violate the DMA API semantics.

The DMA API usage that you introduced in the v5.5 is wrong too because it makes driver unusable :)

No, it's not wrong. It does the right thing from an API point of view. If it causes a performance regression that's because previously we did too little and just got lucky that it still worked. So I think we need to find out why exactly we could get away with how things worked before and find a way to replicate that behaviour while still allowing the API to be correctly used.

So if for whatever reason cache maintenance isn't necessary in this case, can we make the DMA API not perform the cache maintenance. At this point, I'm not even sure that cache maintenance is the issue. We'll need to run some more tests to find out.

...

From a user's perspective it is exactly the good old behaviour. I tried to load KDE Plasma 5 using Xorg Opentegra driver, which worked perfectly fine before, and now it is absolutely unusable.

There are couple things we could do in order to prevent this situation in the future:

Better automated testing. The grate-tests could be useful for this

particular case, once it will get support for the offscreen testing.

Definitely. Right now we don't really have a good way of testing at all. We also don't have a good way for benchmarking tests. All of that would be a massive improvement.

FWIW, I've been holding back on the VIC test cases for libdrm because I didn't want to port them to the old ABI, assuming we could fairly quickly settle on a new ABI and get that merged. However, looking at the current state of affairs, it may be worth porting them over to using the old ABI so that we can also run tests on Tegra124 and later where grate can't be supported anymore.

...

Better review and real-world testing of the patches. Don't you mind

if I'll propose myself as a reviewer for the Tegra DRM / Host1x drivers?

You already review most of the patches anyway. And it would've been pretty hard to find out during review that these patches would cause a performance regression like this. To be honest, I'm still surprised that the impact is this big, since I'm not seeing any impact at all doing things like modeset -v on something like Tegra210, Tegra186 or Tegra194.

Granted, that's probably not the best test case since we're not really doing any cache maintenance (hopefully) after the first two page-flips. However, it does entail mapping/unmapping a 32 MiB framebuffer to/from the IOMMU per frame, which should account for quite a bit of work as well.

I guess that if you do a lot of updates from userspace and have to flush the cache a lot for every frame that could have some impact. But you were saying the difference was going from 100+ to 10- frames per second, which seems a bit much for *only* cache maintenance. Sounds to me like something else is going on here.

...

I'm suggesting to take a step back and revert to the old behaviour for now, at least on T20/30. I don't know about ARM64, but ARM32 permits to violate the DMA API semantics for the case of unavailable IOMMU, of course that may change in the future and by that time I hope we'll manage to get the DMA API usage done right.

Okay, let's see if we can temporarily restore the performance on older devices. But I'd also like to eventually figure out why it's even regressing on performance that badly.

Thierry

...

The solution below restores old behavior for the case of unavailable IOMMU and for the case of explicit IOMMU domain, it should work in the case of implicit IOMMU as well.

...
...
What do you think about this variant:

--- >8 --- diff --git a/drivers/gpu/drm/tegra/gem.c b/drivers/gpu/drm/tegra/gem.c index 1237df157e05..555a6424e9fa 100644 --- a/drivers/gpu/drm/tegra/gem.c +++ b/drivers/gpu/drm/tegra/gem.c @@ -54,16 +54,25 @@ static struct sg_table *tegra_bo_pin(struct device *dev, struct host1x_bo *bo, dma_addr_t *phys) { struct tegra_bo *obj = host1x_to_tegra_bo(bo);

struct tegra_drm *tegra = obj->gem.dev->dev_private; struct sg_table *sgt; int err;
/*
* If we've manually mapped the buffer object through the IOMMU, make
* sure to return the IOVA address of our mapping.
*/
if (phys && obj->mm) {
*phys = obj->iova;
return NULL;
if (phys && iommu_get_domain_for_dev(dev) == tegra->domain) {
/* if IOMMU isn't available, return IOVA=PHYS of the mapping */
...
...
if (obj->vaddr) {
If this is unacceptable on ARM64 (dma_handle != phys_addr), then it could be:
if (obj->vaddr && IS_ENABLED(CONFIG_ARM)) {
...
...
	*phys = obj->iova;
	return NULL;
}
/*
 * If we've manually mapped the buffer object through the
 * IOMMU, make sure to return the IOVA address of our mapping.
 */
if (obj->mm) {
	*phys = obj->iova;
	return NULL;
}
}

/*
diff --git a/drivers/gpu/drm/tegra/plane.c b/drivers/gpu/drm/tegra/plane.c index 6d1872ddef17..91304b9034fa 100644 --- a/drivers/gpu/drm/tegra/plane.c +++ b/drivers/gpu/drm/tegra/plane.c @@ -97,16 +97,15 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)

for (i = 0; i < state->base.fb->format->num_planes; i++) { struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i);
struct sg_table *sgt;
if (!dc->client.group) {
	struct sg_table *sgt;
	sgt = host1x_bo_pin(dc->dev, &bo->base, NULL);
	if (IS_ERR(sgt)) {
		err = PTR_ERR(sgt);
		goto unpin;
	}
sgt = host1x_bo_pin(dc->dev, &bo->base, &state->iova[i]);
if (IS_ERR(sgt)) {
	err = PTR_ERR(sgt);
	goto unpin;
}
if (sgt) {
err = dma_map_sg(dc->dev, sgt->sgl, sgt->nents,
		 DMA_TO_DEVICE);
if (err == 0) {
@@ -127,8 +126,6 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state)
	state->iova[i] = sg_dma_address(sgt->sgl);
	state->sgt[i] = sgt;
} else {
	state->iova[i] = bo->iova;
} }
@@ -141,8 +138,11 @@ static int tegra_dc_pin(struct tegra_dc *dc, struct tegra_plane_state *state) struct tegra_bo *bo = tegra_fb_get_plane(state->base.fb, i); struct sg_table *sgt = state->sgt[i];
dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents, DMA_TO_DEVICE);
host1x_bo_unpin(dc->dev, &bo->base, sgt);
if (sgt) {
	dma_unmap_sg(dc->dev, sgt->sgl, sgt->nents,
		     DMA_TO_DEVICE);
	host1x_bo_unpin(dc->dev, &bo->base, sgt);
}
state->iova[i] = DMA_MAPPING_ERROR; state->sgt[i] = NULL;
diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c index 60b2fedd0061..538c0f0b40a4 100644 --- a/drivers/gpu/host1x/job.c +++ b/drivers/gpu/host1x/job.c @@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)

for (i = 0; i < job->num_relocs; i++) { struct host1x_reloc *reloc = &job->relocs[i];
dma_addr_t phys_addr, *phys;
dma_addr_t phys_addr;
struct sg_table *sgt;

reloc->target.bo = host1x_bo_get(reloc->target.bo);
@@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }
if (client->group)
	phys = &phys_addr;
else
	phys = NULL;
sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
if (IS_ERR(sgt)) { err = PTR_ERR(sgt); goto unpin;
@@ -168,6 +163,13 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) job->num_unpins++; }
/*
* In a case of enabled firewall, all user gathers are copied into
* the secured job->gather_copy_mapped.
*/
if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))
return 0;
for (i = 0; i < job->num_gathers; i++) { struct host1x_job_gather *g = &job->gathers[i]; size_t gather_size = 0;
@@ -184,13 +186,13 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job) goto unpin; }
sgt = host1x_bo_pin(host->dev, g->bo, NULL);
sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
if (IS_ERR(sgt)) { err = PTR_ERR(sgt); goto unpin; }
if (!IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) && host->domain) {
if (host->domain) {
for_each_sg(sgt->sgl, sg, sgt->nents, j)
	gather_size += sg->length;
gather_size = iova_align(&host->iova, gather_size);
@@ -214,7 +216,7 @@ static unsigned int pin_job(struct host1x *host, struct host1x_job *job)
	job->unpins[job->num_unpins].size = gather_size;
	phys_addr = iova_dma_addr(&host->iova, alloc);
} else {
} else if (sgt) {
err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
		 DMA_TO_DEVICE);
if (!err) {
--- >8 ---

1911

Age (days ago)

1964

Last active (days ago)

dri-devel@lists.freedesktop.org

9 comments

2 participants

tags (0)

participants (2)

Dmitry Osipenko
Thierry Reding