[PATCH v3 00/16] drm/tegra: Fix IOVA space on Tegra186 and later

List overview All Threads
Download

newer

older

[PATCH -next] drm/nouveau: fix...

[PULL] drm-misc-next

Thierry Reding

1 Feb 2019 1 Feb '19

1:28 p.m.

From: Thierry Reding treding@nvidia.com

Tegra186 and later are different from earlier generations in that they use an ARM SMMU rather than the Tegra SMMU. The ARM SMMU driver behaves slightly differently in that the geometry for IOMMU domains is set only after a device was attached to it. This is to make sure that the SMMU instance that the domain belongs to is known, because each instance can have a different input address space (i.e. geometry).

Work around this by moving all IOVA allocations to a point where the geometry of the domain is properly initialized.

This second version of the series addresses all review comments and adds a number of patches that will actually allow host1x to work with an SMMU enabled on Tegra186. The patches also add programming required to address the full 40 bits of address space.

The third version of the series fixes the 40-bit addressing code by making sure that wide opcodes are always written atomically to the push buffer. Another pair of patches are introduced to fix a long-standing bug where the HOST1X_CHANNEL_DMAEND register wasn't properly programmed and one push buffer memory optimization that avoid wasting almost one full memory page per push buffer.

This supersedes the following patch:

https://patchwork.kernel.org/patch/10775579/

Thierry

Thierry Reding (16): gpu: host1x: Set up stream ID table gpu: host1x: Program the channel stream ID gpu: host1x: Introduce support for wide opcodes gpu: host1x: Support 40-bit addressing gpu: host1x: Use direct DMA with IOMMU API usage gpu: host1x: Restrict IOVA space to DMA mask gpu: host1x: Support 40-bit addressing on Tegra186 gpu: host1x: Use correct semantics for HOST1X_CHANNEL_DMAEND gpu: host1x: Optimize CDMA push buffer memory usage drm/tegra: Store parent pointer in Tegra DRM clients drm/tegra: vic: Load firmware on demand drm/tegra: Setup shared IOMMU domain after initialization drm/tegra: Restrict IOVA space to DMA mask drm/tegra: vic: Do not clear driver data drm/tegra: vic: Support stream ID register programming arm64: tegra: Enable SMMU for VIC on Tegra186

arch/arm64/boot/dts/nvidia/tegra186.dtsi | 1 + drivers/gpu/drm/tegra/drm.c | 57 +++++---- drivers/gpu/drm/tegra/drm.h | 1 + drivers/gpu/drm/tegra/vic.c | 75 ++++++++--- drivers/gpu/drm/tegra/vic.h | 9 ++ drivers/gpu/host1x/cdma.c | 132 ++++++++++++++++++-- drivers/gpu/host1x/cdma.h | 2 + drivers/gpu/host1x/dev.c | 49 +++++++- drivers/gpu/host1x/dev.h | 8 ++ drivers/gpu/host1x/hw/cdma_hw.c | 32 ++++- drivers/gpu/host1x/hw/channel_hw.c | 42 ++++++- drivers/gpu/host1x/hw/host1x06_hardware.h | 6 + drivers/gpu/host1x/hw/host1x07_hardware.h | 6 + drivers/gpu/host1x/hw/hw_host1x06_channel.h | 11 ++ drivers/gpu/host1x/hw/hw_host1x07_channel.h | 11 ++ include/trace/events/host1x.h | 26 ++++ 16 files changed, 404 insertions(+), 64 deletions(-) create mode 100644 drivers/gpu/host1x/hw/hw_host1x06_channel.h create mode 100644 drivers/gpu/host1x/hw/hw_host1x07_channel.h

-- 2.19.1

Show replies by date

Thierry Reding

1 Feb 1 Feb

1:28 p.m.

New subject: [PATCH v3 01/16] gpu: host1x: Set up stream ID table

From: Thierry Reding treding@nvidia.com

In order to enable the MMIO path stream ID protection provided by the incarnation of host1x found in Tegra186 and later, the host1x must be provided with the list of stream ID register offsets for each of its clients. Some clients (such as VIC) have multiple stream ID registers that are assumed to be contiguous. The host1x is programmed with the base offset and a limit which provide the range of registers that the host1x needs to monitor for writes.

Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/host1x/dev.c | 38 ++++++++++++++++++++++++++++++++++++++ drivers/gpu/host1x/dev.h | 8 ++++++++ 2 files changed, 46 insertions(+)

diff --git a/drivers/gpu/host1x/dev.c b/drivers/gpu/host1x/dev.c index 419d8929a98f..4c044ee54fe6 100644 --- a/drivers/gpu/host1x/dev.c +++ b/drivers/gpu/host1x/dev.c @@ -120,6 +120,15 @@ static const struct host1x_info host1x05_info = { .dma_mask = DMA_BIT_MASK(34), };

+static const struct host1x_sid_entry tegra186_sid_table[] = { + { + /* VIC */ + .base = 0x1af0, + .offset = 0x30, + .limit = 0x34 + }, +}; + static const struct host1x_info host1x06_info = { .nb_channels = 63, .nb_pts = 576, @@ -129,6 +138,17 @@ static const struct host1x_info host1x06_info = { .sync_offset = 0x0, .dma_mask = DMA_BIT_MASK(34), .has_hypervisor = true, + .num_sid_entries = ARRAY_SIZE(tegra186_sid_table), + .sid_table = tegra186_sid_table, +}; + +static const struct host1x_sid_entry tegra194_sid_table[] = { + { + /* VIC */ + .base = 0x1af0, + .offset = 0x30, + .limit = 0x34 + }, };

static const struct host1x_info host1x07_info = { @@ -140,6 +160,8 @@ static const struct host1x_info host1x07_info = { .sync_offset = 0x0, .dma_mask = DMA_BIT_MASK(40), .has_hypervisor = true, + .num_sid_entries = ARRAY_SIZE(tegra194_sid_table), + .sid_table = tegra194_sid_table, };

static const struct of_device_id host1x_of_match[] = { @@ -154,6 +176,19 @@ static const struct of_device_id host1x_of_match[] = { }; MODULE_DEVICE_TABLE(of, host1x_of_match);

+static void host1x_setup_sid_table(struct host1x *host) +{ + const struct host1x_info *info = host->info; + unsigned int i; + + for (i = 0; i < info->num_sid_entries; i++) { + const struct host1x_sid_entry *entry = &info->sid_table[i]; + + host1x_hypervisor_writel(host, entry->offset, entry->base); + host1x_hypervisor_writel(host, entry->limit, entry->base + 4); + } +} + static int host1x_probe(struct platform_device *pdev) { struct host1x *host; @@ -316,6 +351,9 @@ static int host1x_probe(struct platform_device *pdev)

host1x_debug_init(host);

+ if (host->info->has_hypervisor) + host1x_setup_sid_table(host); + err = host1x_register(host); if (err < 0) goto fail_deinit_intr; diff --git a/drivers/gpu/host1x/dev.h b/drivers/gpu/host1x/dev.h index 36f44ffebe73..05216a7e4830 100644 --- a/drivers/gpu/host1x/dev.h +++ b/drivers/gpu/host1x/dev.h @@ -94,6 +94,12 @@ struct host1x_intr_ops { int (*free_syncpt_irq)(struct host1x *host); };

+struct host1x_sid_entry { + unsigned int base; + unsigned int offset; + unsigned int limit; +}; + struct host1x_info { unsigned int nb_channels; /* host1x: number of channels supported */ unsigned int nb_pts; /* host1x: number of syncpoints supported */ @@ -103,6 +109,8 @@ struct host1x_info { unsigned int sync_offset; /* offset of syncpoint registers */ u64 dma_mask; /* mask of addressable memory */ bool has_hypervisor; /* has hypervisor registers */ + unsigned int num_sid_entries; + const struct host1x_sid_entry *sid_table; };

struct host1x {

-- 2.19.1

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 02/16] gpu: host1x: Program the channel stream ID

From: Thierry Reding treding@nvidia.com

When processing command streams, make sure the host1x's stream ID is programmed for the channel so that addresses are properly translated through the SMMU.

Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/host1x/hw/channel_hw.c | 12 ++++++++++++ drivers/gpu/host1x/hw/host1x06_hardware.h | 1 + drivers/gpu/host1x/hw/host1x07_hardware.h | 1 + drivers/gpu/host1x/hw/hw_host1x06_channel.h | 11 +++++++++++ drivers/gpu/host1x/hw/hw_host1x07_channel.h | 11 +++++++++++ 5 files changed, 36 insertions(+) create mode 100644 drivers/gpu/host1x/hw/hw_host1x06_channel.h create mode 100644 drivers/gpu/host1x/hw/hw_host1x07_channel.h

diff --git a/drivers/gpu/host1x/hw/channel_hw.c b/drivers/gpu/host1x/hw/channel_hw.c index 95ea81172a83..384f6ac91afa 100644 --- a/drivers/gpu/host1x/hw/channel_hw.c +++ b/drivers/gpu/host1x/hw/channel_hw.c @@ -89,6 +89,16 @@ static inline void synchronize_syncpt_base(struct host1x_job *job) HOST1X_UCLASS_LOAD_SYNCPT_BASE_VALUE_F(value)); }

+static void host1x_channel_set_streamid(struct host1x_channel *channel) +{ +#if HOST1X_HW >= 6 + struct iommu_fwspec *spec = dev_iommu_fwspec_get(channel->dev->parent); + u32 sid = spec->ids[0] & 0xffff; + + host1x_ch_writel(channel, sid, HOST1X_CHANNEL_SMMU_STREAMID); +#endif +} + static int channel_submit(struct host1x_job *job) { struct host1x_channel *ch = job->channel; @@ -120,6 +130,8 @@ static int channel_submit(struct host1x_job *job) goto error; }

+ host1x_channel_set_streamid(ch); + /* begin a CDMA submit */ err = host1x_cdma_begin(&ch->cdma, job); if (err) { diff --git a/drivers/gpu/host1x/hw/host1x06_hardware.h b/drivers/gpu/host1x/hw/host1x06_hardware.h index 3039c92ea605..eab753b91f24 100644 --- a/drivers/gpu/host1x/hw/host1x06_hardware.h +++ b/drivers/gpu/host1x/hw/host1x06_hardware.h @@ -22,6 +22,7 @@ #include <linux/types.h> #include <linux/bitops.h>

+#include "hw_host1x06_channel.h" #include "hw_host1x06_uclass.h" #include "hw_host1x06_vm.h" #include "hw_host1x06_hypervisor.h" diff --git a/drivers/gpu/host1x/hw/host1x07_hardware.h b/drivers/gpu/host1x/hw/host1x07_hardware.h index 1353e7ab71dd..a79f57dc87bb 100644 --- a/drivers/gpu/host1x/hw/host1x07_hardware.h +++ b/drivers/gpu/host1x/hw/host1x07_hardware.h @@ -22,6 +22,7 @@ #include <linux/types.h> #include <linux/bitops.h>

+#include "hw_host1x07_channel.h" #include "hw_host1x07_uclass.h" #include "hw_host1x07_vm.h" #include "hw_host1x07_hypervisor.h" diff --git a/drivers/gpu/host1x/hw/hw_host1x06_channel.h b/drivers/gpu/host1x/hw/hw_host1x06_channel.h new file mode 100644 index 000000000000..18ae1c57bbea --- /dev/null +++ b/drivers/gpu/host1x/hw/hw_host1x06_channel.h @@ -0,0 +1,11 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 NVIDIA Corporation. + */ + +#ifndef HOST1X_HW_HOST1X06_CHANNEL_H +#define HOST1X_HW_HOST1X06_CHANNEL_H + +#define HOST1X_CHANNEL_SMMU_STREAMID 0x084 + +#endif diff --git a/drivers/gpu/host1x/hw/hw_host1x07_channel.h b/drivers/gpu/host1x/hw/hw_host1x07_channel.h new file mode 100644 index 000000000000..96fa72bbd7ab --- /dev/null +++ b/drivers/gpu/host1x/hw/hw_host1x07_channel.h @@ -0,0 +1,11 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 NVIDIA Corporation. + */ + +#ifndef HOST1X_HW_HOST1X07_CHANNEL_H +#define HOST1X_HW_HOST1X07_CHANNEL_H + +#define HOST1X_CHANNEL_SMMU_STREAMID 0x084 + +#endif

-- 2.19.1

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 03/16] gpu: host1x: Introduce support for wide opcodes

From: Thierry Reding treding@nvidia.com

The CDMA push buffer can currently only handle opcodes that take a single word parameter. However, the host1x implementation on Tegra186 and later supports opcodes that require multiple words as parameters.

Unfortunately the way the push buffer is structured, these wide opcodes cannot simply be composed of two regular opcodes because that could result in the wide opcode being split across the end of the push buffer and the final RESTART opcode required to wrap the push buffer around would break the wide opcode.

One way to fix this would be to remove the concept of slots to simplify push buffer operations. However, that's not entirely trivial and should be done in a separate patch. For now, simply use a different function to push four-word opcodes into the push buffer. Technically only three words are pushed, with the fourth word used as padding to preserve the 2-word alignment required by the slots abstraction. The fourth word is always a NOP opcode.

Additional care must be taken when the end of the push buffer is reached. If a four-word opcode doesn't fit into the push buffer without being split by the boundary, NOP opcodes will be introduced and the new wide opcode placed at the beginning of the push buffer.

Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/host1x/cdma.c | 92 +++++++++++++++++++++++++++++++++++ drivers/gpu/host1x/cdma.h | 2 + include/trace/events/host1x.h | 26 ++++++++++ 3 files changed, 120 insertions(+)

diff --git a/drivers/gpu/host1x/cdma.c b/drivers/gpu/host1x/cdma.c index 91df51e631b2..e2d106fa3c0b 100644 --- a/drivers/gpu/host1x/cdma.c +++ b/drivers/gpu/host1x/cdma.c @@ -217,6 +217,45 @@ unsigned int host1x_cdma_wait_locked(struct host1x_cdma *cdma, return 0; }

+/* + * Sleep (if necessary) until the push buffer has enough free space. + * + * Must be called with the cdma lock held. + */ +int host1x_cdma_wait_pushbuffer_space(struct host1x *host1x, + struct host1x_cdma *cdma, + unsigned int needed) +{ + while (true) { + struct push_buffer *pb = &cdma->push_buffer; + unsigned int space; + + space = host1x_pushbuffer_space(pb); + if (space >= needed) + break; + + trace_host1x_wait_cdma(dev_name(cdma_to_channel(cdma)->dev), + CDMA_EVENT_PUSH_BUFFER_SPACE); + + host1x_hw_cdma_flush(host1x, cdma); + + /* If somebody has managed to already start waiting, yield */ + if (cdma->event != CDMA_EVENT_NONE) { + mutex_unlock(&cdma->lock); + schedule(); + mutex_lock(&cdma->lock); + continue; + } + + cdma->event = CDMA_EVENT_PUSH_BUFFER_SPACE; + + mutex_unlock(&cdma->lock); + down(&cdma->sem); + mutex_lock(&cdma->lock); + } + + return 0; +} /* * Start timer that tracks the time spent by the job. * Must be called with the cdma lock held. @@ -509,6 +548,59 @@ void host1x_cdma_push(struct host1x_cdma *cdma, u32 op1, u32 op2) host1x_pushbuffer_push(pb, op1, op2); }

+/* + * Push four words into two consecutive push buffer slots. Note that extra + * care needs to be taken not to split the two slots across the end of the + * push buffer. Otherwise the RESTART opcode at the end of the push buffer + * that ensures processing will restart at the beginning will break up the + * four words. + * + * Blocks as necessary if the push buffer is full. + */ +void host1x_cdma_push_wide(struct host1x_cdma *cdma, u32 op1, u32 op2, + u32 op3, u32 op4) +{ + struct host1x_channel *channel = cdma_to_channel(cdma); + struct host1x *host1x = cdma_to_host1x(cdma); + struct push_buffer *pb = &cdma->push_buffer; + unsigned int needed = 2, extra = 0, i; + unsigned int space = cdma->slots_free; + + if (host1x_debug_trace_cmdbuf) + trace_host1x_cdma_push_wide(dev_name(channel->dev), op1, op2, + op3, op4); + + /* compute number of extra slots needed for padding */ + if (pb->pos + 16 > pb->size) { + extra = (pb->size - pb->pos) / 8; + needed += extra; + } + + host1x_cdma_wait_pushbuffer_space(host1x, cdma, needed); + space = host1x_pushbuffer_space(pb); + + cdma->slots_free = space - needed; + cdma->slots_used += needed; + + /* + * Note that we rely on the fact that this is only used to submit wide + * gather opcodes, which consist of 3 words, and they are padded with + * a NOP to avoid having to deal with fractional slots (a slot always + * represents 2 words). The fourth opcode passed to this function will + * therefore always be a NOP. + * + * This works around a slight ambiguity when it comes to opcodes. For + * all current host1x incarnations the NOP opcode uses the exact same + * encoding (0x20000000), so we could hard-code the value here, but a + * new incarnation may change it and break that assumption. + */ + for (i = 0; i < extra; i++) + host1x_pushbuffer_push(pb, op4, op4); + + host1x_pushbuffer_push(pb, op1, op2); + host1x_pushbuffer_push(pb, op3, op4); +} + /* * End a cdma submit * Kick off DMA, add job to the sync queue, and a number of slots to be freed diff --git a/drivers/gpu/host1x/cdma.h b/drivers/gpu/host1x/cdma.h index e97e17b82370..ebf6c6a06541 100644 --- a/drivers/gpu/host1x/cdma.h +++ b/drivers/gpu/host1x/cdma.h @@ -90,6 +90,8 @@ int host1x_cdma_init(struct host1x_cdma *cdma); int host1x_cdma_deinit(struct host1x_cdma *cdma); int host1x_cdma_begin(struct host1x_cdma *cdma, struct host1x_job *job); void host1x_cdma_push(struct host1x_cdma *cdma, u32 op1, u32 op2); +void host1x_cdma_push_wide(struct host1x_cdma *cdma, u32 op1, u32 op2, + u32 op3, u32 op4); void host1x_cdma_end(struct host1x_cdma *cdma, struct host1x_job *job); void host1x_cdma_update(struct host1x_cdma *cdma); void host1x_cdma_peek(struct host1x_cdma *cdma, u32 dmaget, int slot, diff --git a/include/trace/events/host1x.h b/include/trace/events/host1x.h index a37ef73092e5..3d340b6f1ea3 100644 --- a/include/trace/events/host1x.h +++ b/include/trace/events/host1x.h @@ -80,6 +80,32 @@ TRACE_EVENT(host1x_cdma_push, __entry->name, __entry->op1, __entry->op2) );

+TRACE_EVENT(host1x_cdma_push_wide, + TP_PROTO(const char *name, u32 op1, u32 op2, u32 op3, u32 op4), + + TP_ARGS(name, op1, op2, op3, op4), + + TP_STRUCT__entry( + __field(const char *, name) + __field(u32, op1) + __field(u32, op2) + __field(u32, op3) + __field(u32, op4) + ), + + TP_fast_assign( + __entry->name = name; + __entry->op1 = op1; + __entry->op2 = op2; + __entry->op3 = op3; + __entry->op4 = op4; + ), + + TP_printk("name=%s, op1=%08x, op2=%08x, op3=%08x op4=%08x", + __entry->name, __entry->op1, __entry->op2, __entry->op3, + __entry->op4) +); + TRACE_EVENT(host1x_cdma_push_gather, TP_PROTO(const char *name, struct host1x_bo *bo, u32 words, u32 offset, void *cmdbuf),

-- 2.19.1

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 04/16] gpu: host1x: Support 40-bit addressing

From: Thierry Reding treding@nvidia.com

Tegra186 and later support 40 bits of address space. Additional registers need to be programmed to store the full 40 bits of push buffer addresses.

Since command stream gathers can also reside in buffers in a 40-bit address space, a new variant of the GATHER opcode is also introduced. It takes two parameters: the first parameter contains the lower 32 bits of the address and the second parameter contains bits 32 to 39.

Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/host1x/hw/cdma_hw.c | 32 ++++++++++++++++++----- drivers/gpu/host1x/hw/channel_hw.c | 30 ++++++++++++++++++--- drivers/gpu/host1x/hw/host1x06_hardware.h | 5 ++++ drivers/gpu/host1x/hw/host1x07_hardware.h | 5 ++++ 4 files changed, 62 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/host1x/hw/cdma_hw.c b/drivers/gpu/host1x/hw/cdma_hw.c index ce320534cbed..485aef5761af 100644 --- a/drivers/gpu/host1x/hw/cdma_hw.c +++ b/drivers/gpu/host1x/hw/cdma_hw.c @@ -68,20 +68,31 @@ static void cdma_timeout_cpu_incr(struct host1x_cdma *cdma, u32 getptr, static void cdma_start(struct host1x_cdma *cdma) { struct host1x_channel *ch = cdma_to_channel(cdma); + u64 start, end;

if (cdma->running) return;

cdma->last_pos = cdma->push_buffer.pos; + start = cdma->push_buffer.dma; + end = start + cdma->push_buffer.size + 4;

host1x_ch_writel(ch, HOST1X_CHANNEL_DMACTRL_DMASTOP, HOST1X_CHANNEL_DMACTRL);

/* set base, put and end pointer */ - host1x_ch_writel(ch, cdma->push_buffer.dma, HOST1X_CHANNEL_DMASTART); + host1x_ch_writel(ch, lower_32_bits(start), HOST1X_CHANNEL_DMASTART); +#if HOST1X_HW >= 6 + host1x_ch_writel(ch, upper_32_bits(start), HOST1X_CHANNEL_DMASTART_HI); +#endif host1x_ch_writel(ch, cdma->push_buffer.pos, HOST1X_CHANNEL_DMAPUT); - host1x_ch_writel(ch, cdma->push_buffer.dma + cdma->push_buffer.size + 4, - HOST1X_CHANNEL_DMAEND); +#if HOST1X_HW >= 6 + host1x_ch_writel(ch, 0, HOST1X_CHANNEL_DMAPUT_HI); +#endif + host1x_ch_writel(ch, lower_32_bits(end), HOST1X_CHANNEL_DMAEND); +#if HOST1X_HW >= 6 + host1x_ch_writel(ch, upper_32_bits(end), HOST1X_CHANNEL_DMAEND_HI); +#endif

/* reset GET */ host1x_ch_writel(ch, HOST1X_CHANNEL_DMACTRL_DMASTOP | @@ -104,6 +115,7 @@ static void cdma_timeout_restart(struct host1x_cdma *cdma, u32 getptr) { struct host1x *host1x = cdma_to_host1x(cdma); struct host1x_channel *ch = cdma_to_channel(cdma); + u64 start, end;

if (cdma->running) return; @@ -113,10 +125,18 @@ static void cdma_timeout_restart(struct host1x_cdma *cdma, u32 getptr) host1x_ch_writel(ch, HOST1X_CHANNEL_DMACTRL_DMASTOP, HOST1X_CHANNEL_DMACTRL);

+ start = cdma->push_buffer.dma; + end = start + cdma->push_buffer.size + 4; + /* set base, end pointer (all of memory) */ - host1x_ch_writel(ch, cdma->push_buffer.dma, HOST1X_CHANNEL_DMASTART); - host1x_ch_writel(ch, cdma->push_buffer.dma + cdma->push_buffer.size, - HOST1X_CHANNEL_DMAEND); + host1x_ch_writel(ch, lower_32_bits(start), HOST1X_CHANNEL_DMASTART); +#if HOST1X_HW >= 6 + host1x_ch_writel(ch, upper_32_bits(start), HOST1X_CHANNEL_DMASTART_HI); +#endif + host1x_ch_writel(ch, lower_32_bits(end), HOST1X_CHANNEL_DMAEND); +#if HOST1X_HW >= 6 + host1x_ch_writel(ch, upper_32_bits(end), HOST1X_CHANNEL_DMAEND_HI); +#endif

/* set GET, by loading the value in PUT (then reset GET) */ host1x_ch_writel(ch, getptr, HOST1X_CHANNEL_DMAPUT); diff --git a/drivers/gpu/host1x/hw/channel_hw.c b/drivers/gpu/host1x/hw/channel_hw.c index 384f6ac91afa..73f11a418ee6 100644 --- a/drivers/gpu/host1x/hw/channel_hw.c +++ b/drivers/gpu/host1x/hw/channel_hw.c @@ -60,15 +60,37 @@ static void trace_write_gather(struct host1x_cdma *cdma, struct host1x_bo *bo, static void submit_gathers(struct host1x_job *job) { struct host1x_cdma *cdma = &job->channel->cdma; +#if HOST1X_HW < 6 + struct device *dev = job->channel->dev; +#endif unsigned int i;

for (i = 0; i < job->num_gathers; i++) { struct host1x_job_gather *g = &job->gathers[i]; - u32 op1 = host1x_opcode_gather(g->words); - u32 op2 = g->base + g->offset; + dma_addr_t addr = g->base + g->offset; + u32 op2, op3; + + op2 = lower_32_bits(addr); + op3 = upper_32_bits(addr); + + trace_write_gather(cdma, g->bo, g->offset, g->words); + + if (op3 != 0) { +#if HOST1X_HW >= 6 + u32 op1 = host1x_opcode_gather_wide(g->words); + u32 op4 = HOST1X_OPCODE_NOP;

- trace_write_gather(cdma, g->bo, g->offset, op1 & 0xffff); - host1x_cdma_push(cdma, op1, op2); + host1x_cdma_push_wide(cdma, op1, op2, op3, op4); +#else + dev_err(dev, "invalid gather for push buffer %pad\n", + &addr); + continue; +#endif + } else { + u32 op1 = host1x_opcode_gather(g->words); + + host1x_cdma_push(cdma, op1, op2); + } } }

diff --git a/drivers/gpu/host1x/hw/host1x06_hardware.h b/drivers/gpu/host1x/hw/host1x06_hardware.h index eab753b91f24..dd37b10c8d04 100644 --- a/drivers/gpu/host1x/hw/host1x06_hardware.h +++ b/drivers/gpu/host1x/hw/host1x06_hardware.h @@ -138,6 +138,11 @@ static inline u32 host1x_opcode_gather_incr(unsigned offset, unsigned count) return (6 << 28) | (offset << 16) | BIT(15) | BIT(14) | count; }

+static inline u32 host1x_opcode_gather_wide(unsigned count) +{ + return (12 << 28) | count; +} + #define HOST1X_OPCODE_NOP host1x_opcode_nonincr(0, 0)

#endif diff --git a/drivers/gpu/host1x/hw/host1x07_hardware.h b/drivers/gpu/host1x/hw/host1x07_hardware.h index a79f57dc87bb..9f6da4ee5443 100644 --- a/drivers/gpu/host1x/hw/host1x07_hardware.h +++ b/drivers/gpu/host1x/hw/host1x07_hardware.h @@ -138,6 +138,11 @@ static inline u32 host1x_opcode_gather_incr(unsigned offset, unsigned count) return (6 << 28) | (offset << 16) | BIT(15) | BIT(14) | count; }

+static inline u32 host1x_opcode_gather_wide(unsigned count) +{ + return (12 << 28) | count; +} + #define HOST1X_OPCODE_NOP host1x_opcode_nonincr(0, 0)

#endif

-- 2.19.1

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 05/16] gpu: host1x: Use direct DMA with IOMMU API usage

From: Thierry Reding treding@nvidia.com

If we use the IOMMU API directly to map buffers into host1x' IOVA space, we must make sure that the DMA API doesn't already set up a mapping, or else translation will fail.

The direct DMA API allows us to allocate memory that will not be mapped through an IOMMU automatically.

Reviewed-by: Dmitry Osipenko digetx@gmail.com Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/host1x/cdma.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/host1x/cdma.c b/drivers/gpu/host1x/cdma.c index e2d106fa3c0b..a96c4dd1e449 100644 --- a/drivers/gpu/host1x/cdma.c +++ b/drivers/gpu/host1x/cdma.c @@ -19,6 +19,7 @@

#include <asm/cacheflush.h> #include <linux/device.h> +#include <linux/dma-direct.h> #include <linux/dma-mapping.h> #include <linux/host1x.h> #include <linux/interrupt.h> @@ -70,6 +71,7 @@ static void host1x_pushbuffer_destroy(struct push_buffer *pb) */ static int host1x_pushbuffer_init(struct push_buffer *pb) { + unsigned long attrs = DMA_ATTR_WRITE_COMBINE; struct host1x_cdma *cdma = pb_to_cdma(pb); struct host1x *host1x = cdma_to_host1x(cdma); struct iova *alloc; @@ -91,8 +93,8 @@ static int host1x_pushbuffer_init(struct push_buffer *pb)

size = iova_align(&host1x->iova, size);

- pb->mapped = dma_alloc_wc(host1x->dev, size, &pb->phys, - GFP_KERNEL); + pb->mapped = dma_direct_alloc(host1x->dev, size, &pb->phys, + GFP_KERNEL, attrs); if (!pb->mapped) return -ENOMEM;

@@ -127,7 +129,10 @@ static int host1x_pushbuffer_init(struct push_buffer *pb) iommu_free_iova: __free_iova(&host1x->iova, alloc); iommu_free_mem: - dma_free_wc(host1x->dev, size, pb->mapped, pb->phys); + if (host1x->domain) + dma_direct_free(host1x->dev, size, pb->mapped, pb->phys, attrs); + else + dma_free_wc(host1x->dev, size, pb->mapped, pb->phys);

return err; }

-- 2.19.1

Dmitry Osipenko

2:46 p.m.

New subject: [PATCH v3 05/16] gpu: host1x: Use direct DMA with IOMMU API usage

01.02.2019 16:28, Thierry Reding пишет:

...

From: Thierry Reding treding@nvidia.com

If we use the IOMMU API directly to map buffers into host1x' IOVA space, we must make sure that the DMA API doesn't already set up a mapping, or else translation will fail.

The direct DMA API allows us to allocate memory that will not be mapped through an IOMMU automatically.

Reviewed-by: Dmitry Osipenko digetx@gmail.com Signed-off-by: Thierry Reding treding@nvidia.com

drivers/gpu/host1x/cdma.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/host1x/cdma.c b/drivers/gpu/host1x/cdma.c index e2d106fa3c0b..a96c4dd1e449 100644 --- a/drivers/gpu/host1x/cdma.c +++ b/drivers/gpu/host1x/cdma.c @@ -19,6 +19,7 @@

#include <asm/cacheflush.h> #include <linux/device.h> +#include <linux/dma-direct.h> #include <linux/dma-mapping.h> #include <linux/host1x.h> #include <linux/interrupt.h> @@ -70,6 +71,7 @@ static void host1x_pushbuffer_destroy(struct push_buffer *pb) */ static int host1x_pushbuffer_init(struct push_buffer *pb) {

unsigned long attrs = DMA_ATTR_WRITE_COMBINE; struct host1x_cdma *cdma = pb_to_cdma(pb); struct host1x *host1x = cdma_to_host1x(cdma); struct iova *alloc;

@@ -91,8 +93,8 @@ static int host1x_pushbuffer_init(struct push_buffer *pb)
size = iova_align(&host1x->iova, size);
pb->mapped = dma_alloc_wc(host1x->dev, size, &pb->phys,
			  GFP_KERNEL);
pb->mapped = dma_direct_alloc(host1x->dev, size, &pb->phys,
			      GFP_KERNEL, attrs);
if (!pb->mapped) return -ENOMEM;
@@ -127,7 +129,10 @@ static int host1x_pushbuffer_init(struct push_buffer *pb) iommu_free_iova: __free_iova(&host1x->iova, alloc); iommu_free_mem:

dma_free_wc(host1x->dev, size, pb->mapped, pb->phys);
if (host1x->domain)
dma_direct_free(host1x->dev, size, pb->mapped, pb->phys, attrs);
else
dma_free_wc(host1x->dev, size, pb->mapped, pb->phys);
return err;
}

Reviewed-by: Dmitry Osipenko digetx@gmail.com Tested-by: Dmitry Osipenko digetx@gmail.com

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 06/16] gpu: host1x: Restrict IOVA space to DMA mask

From: Thierry Reding treding@nvidia.com

On Tegra186 and later, the ARM SMMU provides an input address space that is 48 bits wide. However, memory clients can only address up to 40 bits. If the geometry is used as-is, allocations of IOVA space can end up in a region that is not addressable by the memory clients.

To fix this, restrict the IOVA space to the DMA mask of the host1x device.

Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/host1x/dev.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/host1x/dev.c b/drivers/gpu/host1x/dev.c index 4c044ee54fe6..544b67f2b3ff 100644 --- a/drivers/gpu/host1x/dev.c +++ b/drivers/gpu/host1x/dev.c @@ -283,6 +283,8 @@ static int host1x_probe(struct platform_device *pdev) host->group = iommu_group_get(&pdev->dev); if (host->group) { struct iommu_domain_geometry *geometry; + u64 mask = dma_get_mask(host->dev); + dma_addr_t start, end; unsigned long order;

err = iova_cache_get(); @@ -310,11 +312,12 @@ static int host1x_probe(struct platform_device *pdev) }

geometry = &host->domain->geometry; + start = geometry->aperture_start & mask; + end = geometry->aperture_end & mask;

order = __ffs(host->domain->pgsize_bitmap); - init_iova_domain(&host->iova, 1UL << order, - geometry->aperture_start >> order); - host->iova_end = geometry->aperture_end; + init_iova_domain(&host->iova, 1UL << order, start >> order); + host->iova_end = end; }

skip_iommu:

-- 2.19.1

Dmitry Osipenko

2:47 p.m.

New subject: [PATCH v3 06/16] gpu: host1x: Restrict IOVA space to DMA mask

01.02.2019 16:28, Thierry Reding пишет:

...

From: Thierry Reding treding@nvidia.com

On Tegra186 and later, the ARM SMMU provides an input address space that is 48 bits wide. However, memory clients can only address up to 40 bits. If the geometry is used as-is, allocations of IOVA space can end up in a region that is not addressable by the memory clients.

To fix this, restrict the IOVA space to the DMA mask of the host1x device.

Signed-off-by: Thierry Reding treding@nvidia.com

drivers/gpu/host1x/dev.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/host1x/dev.c b/drivers/gpu/host1x/dev.c index 4c044ee54fe6..544b67f2b3ff 100644 --- a/drivers/gpu/host1x/dev.c +++ b/drivers/gpu/host1x/dev.c @@ -283,6 +283,8 @@ static int host1x_probe(struct platform_device *pdev) host->group = iommu_group_get(&pdev->dev); if (host->group) { struct iommu_domain_geometry *geometry;
u64 mask = dma_get_mask(host->dev);
dma_addr_t start, end;
unsigned long order;

err = iova_cache_get();
@@ -310,11 +312,12 @@ static int host1x_probe(struct platform_device *pdev) }
geometry = &host->domain->geometry;
start = geometry->aperture_start & mask;
end = geometry->aperture_end & mask;
order = __ffs(host->domain->pgsize_bitmap);
init_iova_domain(&host->iova, 1UL << order,
		 geometry->aperture_start >> order);
host->iova_end = geometry->aperture_end;
init_iova_domain(&host->iova, 1UL << order, start >> order);
host->iova_end = end;
}
skip_iommu:

For older Tegra's:

Reviewed-by: Dmitry Osipenko digetx@gmail.com Tested-by: Dmitry Osipenko digetx@gmail.com

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 07/16] gpu: host1x: Support 40-bit addressing on Tegra186

From: Thierry Reding treding@nvidia.com

The host1x and clients instantiated on Tegra186 support addressing 40 bits of memory.

Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/host1x/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/host1x/dev.c b/drivers/gpu/host1x/dev.c index 544b67f2b3ff..ee3c7b81a29d 100644 --- a/drivers/gpu/host1x/dev.c +++ b/drivers/gpu/host1x/dev.c @@ -136,7 +136,7 @@ static const struct host1x_info host1x06_info = { .nb_bases = 16, .init = host1x06_init, .sync_offset = 0x0, - .dma_mask = DMA_BIT_MASK(34), + .dma_mask = DMA_BIT_MASK(40), .has_hypervisor = true, .num_sid_entries = ARRAY_SIZE(tegra186_sid_table), .sid_table = tegra186_sid_table,

-- 2.19.1

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 08/16] gpu: host1x: Use correct semantics for HOST1X_CHANNEL_DMAEND

From: Thierry Reding treding@nvidia.com

The HOST1X_CHANNEL_DMAEND is an offset relative to the value written to the HOST1X_CHANNEL_DMASTART register, but it is currently treated as an absolute address. This can cause SMMU faults if the CDMA fetches past a pushbuffer's IOMMU mapping.

Properly setting the DMAEND prevents the CDMA from fetching beyond that address and avoid such issues. This is currently not observed because a whole (almost) page of essentially scratch space absorbs any excessive prefetching by CDMA. However, changing the number of slots in the push buffer can trigger these SMMU faults.

Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/host1x/hw/cdma_hw.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/host1x/hw/cdma_hw.c b/drivers/gpu/host1x/hw/cdma_hw.c index 485aef5761af..a24c090ac96f 100644 --- a/drivers/gpu/host1x/hw/cdma_hw.c +++ b/drivers/gpu/host1x/hw/cdma_hw.c @@ -75,7 +75,7 @@ static void cdma_start(struct host1x_cdma *cdma)

cdma->last_pos = cdma->push_buffer.pos; start = cdma->push_buffer.dma; - end = start + cdma->push_buffer.size + 4; + end = cdma->push_buffer.size + 4;

host1x_ch_writel(ch, HOST1X_CHANNEL_DMACTRL_DMASTOP, HOST1X_CHANNEL_DMACTRL); @@ -126,7 +126,7 @@ static void cdma_timeout_restart(struct host1x_cdma *cdma, u32 getptr) HOST1X_CHANNEL_DMACTRL);

start = cdma->push_buffer.dma; - end = start + cdma->push_buffer.size + 4; + end = cdma->push_buffer.size + 4;

/* set base, end pointer (all of memory) */ host1x_ch_writel(ch, lower_32_bits(start), HOST1X_CHANNEL_DMASTART);

-- 2.19.1

Dmitry Osipenko

1:37 p.m.

New subject: [PATCH v3 08/16] gpu: host1x: Use correct semantics for HOST1X_CHANNEL_DMAEND

01.02.2019 16:28, Thierry Reding пишет:

...

This seems fixes problem that was added by a previous patch in this series, "gpu: host1x: Support 40-bit addressing". What about to just squash the patches?

Thierry Reding

1:41 p.m.

New subject: [PATCH v3 08/16] gpu: host1x: Use correct semantics for HOST1X_CHANNEL_DMAEND

On Fri, Feb 01, 2019 at 04:37:59PM +0300, Dmitry Osipenko wrote:

...

This actually fixes a bug that's always been there. This just happens to touch the same lines as an earlier patch as a result of some refactoring that the earlier patch did.

Thierry

Dmitry Osipenko

1:48 p.m.

New subject: [PATCH v3 08/16] gpu: host1x: Use correct semantics for HOST1X_CHANNEL_DMAEND

01.02.2019 16:41, Thierry Reding пишет:

...

On Fri, Feb 01, 2019 at 04:37:59PM +0300, Dmitry Osipenko wrote:

...
01.02.2019 16:28, Thierry Reding пишет:

...
From: Thierry Reding treding@nvidia.com

The HOST1X_CHANNEL_DMAEND is an offset relative to the value written to the HOST1X_CHANNEL_DMASTART register, but it is currently treated as an absolute address. This can cause SMMU faults if the CDMA fetches past a pushbuffer's IOMMU mapping.

Properly setting the DMAEND prevents the CDMA from fetching beyond that address and avoid such issues. This is currently not observed because a whole (almost) page of essentially scratch space absorbs any excessive prefetching by CDMA. However, changing the number of slots in the push buffer can trigger these SMMU faults.

Signed-off-by: Thierry Reding treding@nvidia.com

drivers/gpu/host1x/hw/cdma_hw.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/host1x/hw/cdma_hw.c b/drivers/gpu/host1x/hw/cdma_hw.c index 485aef5761af..a24c090ac96f 100644 --- a/drivers/gpu/host1x/hw/cdma_hw.c +++ b/drivers/gpu/host1x/hw/cdma_hw.c @@ -75,7 +75,7 @@ static void cdma_start(struct host1x_cdma *cdma)

cdma->last_pos = cdma->push_buffer.pos; start = cdma->push_buffer.dma;

end = start + cdma->push_buffer.size + 4;

end = cdma->push_buffer.size + 4;

host1x_ch_writel(ch, HOST1X_CHANNEL_DMACTRL_DMASTOP, HOST1X_CHANNEL_DMACTRL);

@@ -126,7 +126,7 @@ static void cdma_timeout_restart(struct host1x_cdma *cdma, u32 getptr) HOST1X_CHANNEL_DMACTRL);

start = cdma->push_buffer.dma;

end = start + cdma->push_buffer.size + 4;

end = cdma->push_buffer.size + 4;

/* set base, end pointer (all of memory) */ host1x_ch_writel(ch, lower_32_bits(start), HOST1X_CHANNEL_DMASTART);

This seems fixes problem that was added by a previous patch in this series, "gpu: host1x: Support 40-bit addressing". What about to just squash the patches?

This actually fixes a bug that's always been there. This just happens to touch the same lines as an earlier patch as a result of some refactoring that the earlier patch did.

Oh, wow. Indeed! That's a bit unfortunate :) Though it's quite difficult to spot that bug by looking at the code, good that it got caught. Was this bug triggered by trying to move IOVA allocation to the end of the AS?

Thierry Reding

2:10 p.m.

New subject: [PATCH v3 08/16] gpu: host1x: Use correct semantics for HOST1X_CHANNEL_DMAEND

On Fri, Feb 01, 2019 at 04:48:56PM +0300, Dmitry Osipenko wrote:

...

01.02.2019 16:41, Thierry Reding пишет:

...
On Fri, Feb 01, 2019 at 04:37:59PM +0300, Dmitry Osipenko wrote:

...
01.02.2019 16:28, Thierry Reding пишет:

...
From: Thierry Reding treding@nvidia.com

The HOST1X_CHANNEL_DMAEND is an offset relative to the value written to the HOST1X_CHANNEL_DMASTART register, but it is currently treated as an absolute address. This can cause SMMU faults if the CDMA fetches past a pushbuffer's IOMMU mapping.

Properly setting the DMAEND prevents the CDMA from fetching beyond that address and avoid such issues. This is currently not observed because a whole (almost) page of essentially scratch space absorbs any excessive prefetching by CDMA. However, changing the number of slots in the push buffer can trigger these SMMU faults.

Signed-off-by: Thierry Reding treding@nvidia.com

drivers/gpu/host1x/hw/cdma_hw.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/host1x/hw/cdma_hw.c b/drivers/gpu/host1x/hw/cdma_hw.c index 485aef5761af..a24c090ac96f 100644 --- a/drivers/gpu/host1x/hw/cdma_hw.c +++ b/drivers/gpu/host1x/hw/cdma_hw.c @@ -75,7 +75,7 @@ static void cdma_start(struct host1x_cdma *cdma)

cdma->last_pos = cdma->push_buffer.pos; start = cdma->push_buffer.dma;

end = start + cdma->push_buffer.size + 4;

end = cdma->push_buffer.size + 4;

host1x_ch_writel(ch, HOST1X_CHANNEL_DMACTRL_DMASTOP, HOST1X_CHANNEL_DMACTRL);

@@ -126,7 +126,7 @@ static void cdma_timeout_restart(struct host1x_cdma *cdma, u32 getptr) HOST1X_CHANNEL_DMACTRL);

start = cdma->push_buffer.dma;

end = start + cdma->push_buffer.size + 4;

end = cdma->push_buffer.size + 4;

/* set base, end pointer (all of memory) */ host1x_ch_writel(ch, lower_32_bits(start), HOST1X_CHANNEL_DMASTART);

This seems fixes problem that was added by a previous patch in this series, "gpu: host1x: Support 40-bit addressing". What about to just squash the patches?

This actually fixes a bug that's always been there. This just happens to touch the same lines as an earlier patch as a result of some refactoring that the earlier patch did.

Oh, wow. Indeed! That's a bit unfortunate :) Though it's quite difficult to spot that bug by looking at the code, good that it got caught. Was this bug triggered by trying to move IOVA allocation to the end of the AS?

This was actually triggered because I noticed that we were using 512 slots in the push buffer, which means 4096 bytes, but then we needed that extra 4 bytes for the RESTART opcode, which means that we're currently allocating 8192 bytes of which only 4092 are being used.

So I thought: "Well, we should be able to live with 511 slots per push buffer and save a full memory page. This is an easy optimization!" Boy was I wrong... After making that change I started to see SMMU faults for the address immediately after the push buffer mapping. I think the same error happens regardless of where the push buffer is located. The reason for the SMMU faults seem to be that CDMA happily keeps on pre- fetching (a lot of) commands before it wraps around because of the RESTART opcode. DMAEND is designed as a mechanism to prevent CDMA from excessively fetching commands, but if you program a really large value into it, it'll add that really large value to the DMASTART as offset and the mechanism doesn't work anymore.

We don't currently see this because the 4092 bytes in the "scratch" page of the push buffer allocation are enough to absorb the prefetching that CDMA does.

We would also likely never see it happen in non-SMMU cases because the CDMA would just keep on prefetching whatever memory happened to be after the push buffer.

Thierry

Dmitry Osipenko

2:40 p.m.

New subject: [PATCH v3 08/16] gpu: host1x: Use correct semantics for HOST1X_CHANNEL_DMAEND

01.02.2019 17:10, Thierry Reding пишет:

...

On Fri, Feb 01, 2019 at 04:48:56PM +0300, Dmitry Osipenko wrote:

...
01.02.2019 16:41, Thierry Reding пишет:

...
On Fri, Feb 01, 2019 at 04:37:59PM +0300, Dmitry Osipenko wrote:

...
01.02.2019 16:28, Thierry Reding пишет:

...
From: Thierry Reding treding@nvidia.com

The HOST1X_CHANNEL_DMAEND is an offset relative to the value written to the HOST1X_CHANNEL_DMASTART register, but it is currently treated as an absolute address. This can cause SMMU faults if the CDMA fetches past a pushbuffer's IOMMU mapping.

Properly setting the DMAEND prevents the CDMA from fetching beyond that address and avoid such issues. This is currently not observed because a whole (almost) page of essentially scratch space absorbs any excessive prefetching by CDMA. However, changing the number of slots in the push buffer can trigger these SMMU faults.

Signed-off-by: Thierry Reding treding@nvidia.com

drivers/gpu/host1x/hw/cdma_hw.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/host1x/hw/cdma_hw.c b/drivers/gpu/host1x/hw/cdma_hw.c index 485aef5761af..a24c090ac96f 100644 --- a/drivers/gpu/host1x/hw/cdma_hw.c +++ b/drivers/gpu/host1x/hw/cdma_hw.c @@ -75,7 +75,7 @@ static void cdma_start(struct host1x_cdma *cdma)

cdma->last_pos = cdma->push_buffer.pos; start = cdma->push_buffer.dma;

end = start + cdma->push_buffer.size + 4;

end = cdma->push_buffer.size + 4;

host1x_ch_writel(ch, HOST1X_CHANNEL_DMACTRL_DMASTOP, HOST1X_CHANNEL_DMACTRL);

@@ -126,7 +126,7 @@ static void cdma_timeout_restart(struct host1x_cdma *cdma, u32 getptr) HOST1X_CHANNEL_DMACTRL);

start = cdma->push_buffer.dma;

end = start + cdma->push_buffer.size + 4;

end = cdma->push_buffer.size + 4;

/* set base, end pointer (all of memory) */ host1x_ch_writel(ch, lower_32_bits(start), HOST1X_CHANNEL_DMASTART);

This seems fixes problem that was added by a previous patch in this series, "gpu: host1x: Support 40-bit addressing". What about to just squash the patches?

This actually fixes a bug that's always been there. This just happens to touch the same lines as an earlier patch as a result of some refactoring that the earlier patch did.

Oh, wow. Indeed! That's a bit unfortunate :) Though it's quite difficult to spot that bug by looking at the code, good that it got caught. Was this bug triggered by trying to move IOVA allocation to the end of the AS?

This was actually triggered because I noticed that we were using 512 slots in the push buffer, which means 4096 bytes, but then we needed that extra 4 bytes for the RESTART opcode, which means that we're currently allocating 8192 bytes of which only 4092 are being used.

So I thought: "Well, we should be able to live with 511 slots per push buffer and save a full memory page. This is an easy optimization!" Boy was I wrong... After making that change I started to see SMMU faults for the address immediately after the push buffer mapping. I think the same error happens regardless of where the push buffer is located. The reason for the SMMU faults seem to be that CDMA happily keeps on pre- fetching (a lot of) commands before it wraps around because of the RESTART opcode. DMAEND is designed as a mechanism to prevent CDMA from excessively fetching commands, but if you program a really large value into it, it'll add that really large value to the DMASTART as offset and the mechanism doesn't work anymore.

We don't currently see this because the 4092 bytes in the "scratch" page of the push buffer allocation are enough to absorb the prefetching that CDMA does.

We would also likely never see it happen in non-SMMU cases because the CDMA would just keep on prefetching whatever memory happened to be after the push buffer.

Yeah, that's what usually happens when the code is getting improved. Good job! :)

Dmitry Osipenko

2:47 p.m.

New subject: [PATCH v3 08/16] gpu: host1x: Use correct semantics for HOST1X_CHANNEL_DMAEND

01.02.2019 16:28, Thierry Reding пишет:

...

Reviewed-by: Dmitry Osipenko digetx@gmail.com Tested-by: Dmitry Osipenko digetx@gmail.com

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 09/16] gpu: host1x: Optimize CDMA push buffer memory usage

From: Thierry Reding treding@nvidia.com

The host1x CDMA push buffer is terminated by a special opcode (RESTART) that tells the CDMA to wrap around to the beginning of the push buffer. To accomodate the RESTART opcode, an extra 4 bytes are allocated on top of the 512 * 8 = 4096 bytes needed for the 512 slots (1 slot = 2 words) that are used for other commands passed to CDMA. This requires that two memory pages are allocated, but most of the second page (4092 bytes) is never used.

Decrease the number of slots to 511 so that the RESTART opcode fits within the page. Adjust the push buffer wraparound code to take into account push buffer sizes that are not a power of two.

Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/host1x/cdma.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/host1x/cdma.c b/drivers/gpu/host1x/cdma.c index a96c4dd1e449..50c1370b56c7 100644 --- a/drivers/gpu/host1x/cdma.c +++ b/drivers/gpu/host1x/cdma.c @@ -42,7 +42,17 @@ * means that the push buffer is full, not empty. */

-#define HOST1X_PUSHBUFFER_SLOTS 512 +/* + * Typically the commands written into the push buffer are a pair of words. We + * use slots to represent each of these pairs and to simplify things. Note the + * strange number of slots allocated here. 512 slots will fit exactly within a + * single memory page. We also need one additional word at the end of the push + * buffer for the RESTART opcode that will instruct the CDMA to jump back to + * the beginning of the push buffer. With 512 slots, this means that we'll use + * 2 memory pages and waste 4092 bytes of the second page that will never be + * used. + */ +#define HOST1X_PUSHBUFFER_SLOTS 511

/* * Clean up push buffer resources @@ -148,7 +158,10 @@ static void host1x_pushbuffer_push(struct push_buffer *pb, u32 op1, u32 op2) WARN_ON(pb->pos == pb->fence); *(p++) = op1; *(p++) = op2; - pb->pos = (pb->pos + 8) & (pb->size - 1); + pb->pos += 8; + + if (pb->pos >= pb->size) + pb->pos -= pb->size; }

/* @@ -158,7 +171,10 @@ static void host1x_pushbuffer_push(struct push_buffer *pb, u32 op1, u32 op2) static void host1x_pushbuffer_pop(struct push_buffer *pb, unsigned int slots) { /* Advance the next write position */ - pb->fence = (pb->fence + slots * 8) & (pb->size - 1); + pb->fence += slots * 8; + + if (pb->fence >= pb->size) + pb->fence -= pb->size; }

/* @@ -166,7 +182,12 @@ static void host1x_pushbuffer_pop(struct push_buffer *pb, unsigned int slots) */ static u32 host1x_pushbuffer_space(struct push_buffer *pb) { - return ((pb->fence - pb->pos) & (pb->size - 1)) / 8; + unsigned int fence = pb->fence; + + if (pb->fence < pb->pos) + fence += pb->size; + + return (fence - pb->pos) / 8; }

-- 2.19.1

Dmitry Osipenko

2:48 p.m.

New subject: [PATCH v3 09/16] gpu: host1x: Optimize CDMA push buffer memory usage

01.02.2019 16:28, Thierry Reding пишет:

...

From: Thierry Reding treding@nvidia.com

The host1x CDMA push buffer is terminated by a special opcode (RESTART) that tells the CDMA to wrap around to the beginning of the push buffer. To accomodate the RESTART opcode, an extra 4 bytes are allocated on top of the 512 * 8 = 4096 bytes needed for the 512 slots (1 slot = 2 words) that are used for other commands passed to CDMA. This requires that two memory pages are allocated, but most of the second page (4092 bytes) is never used.

Decrease the number of slots to 511 so that the RESTART opcode fits within the page. Adjust the push buffer wraparound code to take into account push buffer sizes that are not a power of two.

Signed-off-by: Thierry Reding treding@nvidia.com

drivers/gpu/host1x/cdma.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/host1x/cdma.c b/drivers/gpu/host1x/cdma.c index a96c4dd1e449..50c1370b56c7 100644 --- a/drivers/gpu/host1x/cdma.c +++ b/drivers/gpu/host1x/cdma.c @@ -42,7 +42,17 @@

means that the push buffer is full, not empty.

*/

-#define HOST1X_PUSHBUFFER_SLOTS 512 +/*

Typically the commands written into the push buffer are a pair of words. We

use slots to represent each of these pairs and to simplify things. Note the

strange number of slots allocated here. 512 slots will fit exactly within a

single memory page. We also need one additional word at the end of the push

buffer for the RESTART opcode that will instruct the CDMA to jump back to

the beginning of the push buffer. With 512 slots, this means that we'll use

2 memory pages and waste 4092 bytes of the second page that will never be

used.

*/

+#define HOST1X_PUSHBUFFER_SLOTS 511

/*

Clean up push buffer resources

@@ -148,7 +158,10 @@ static void host1x_pushbuffer_push(struct push_buffer *pb, u32 op1, u32 op2) WARN_ON(pb->pos == pb->fence); *(p++) = op1; *(p++) = op2;

pb->pos = (pb->pos + 8) & (pb->size - 1);
pb->pos += 8;

if (pb->pos >= pb->size)
pb->pos -= pb->size;
}

/* @@ -158,7 +171,10 @@ static void host1x_pushbuffer_push(struct push_buffer *pb, u32 op1, u32 op2) static void host1x_pushbuffer_pop(struct push_buffer *pb, unsigned int slots) { /* Advance the next write position */

pb->fence = (pb->fence + slots * 8) & (pb->size - 1);
pb->fence += slots * 8;

if (pb->fence >= pb->size)
pb->fence -= pb->size;
}

/* @@ -166,7 +182,12 @@ static void host1x_pushbuffer_pop(struct push_buffer *pb, unsigned int slots) */ static u32 host1x_pushbuffer_space(struct push_buffer *pb) {

return ((pb->fence - pb->pos) & (pb->size - 1)) / 8;
unsigned int fence = pb->fence;

if (pb->fence < pb->pos)
fence += pb->size;
return (fence - pb->pos) / 8;
}

/*

Reviewed-by: Dmitry Osipenko digetx@gmail.com Tested-by: Dmitry Osipenko digetx@gmail.com

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 10/16] drm/tegra: Store parent pointer in Tegra DRM clients

From: Thierry Reding treding@nvidia.com

Tegra DRM clients need access to their parent, so store a pointer to it upon registration. It's technically possible to get at this by going via the host1x client's parent and getting the driver data, but that's quite complicated and not very transparent. It's much more straightforward and natural to let the children know about their parent.

Signed-off-by: Thierry Reding treding@nvidia.com Reviewed-by: Dmitry Osipenko digetx@gmail.com --- drivers/gpu/drm/tegra/drm.c | 2 ++ drivers/gpu/drm/tegra/drm.h | 1 + 2 files changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c index 4b70ce664c41..61dcbd218ffc 100644 --- a/drivers/gpu/drm/tegra/drm.c +++ b/drivers/gpu/drm/tegra/drm.c @@ -1041,6 +1041,7 @@ int tegra_drm_register_client(struct tegra_drm *tegra, { mutex_lock(&tegra->clients_lock); list_add_tail(&client->list, &tegra->clients); + client->drm = tegra; mutex_unlock(&tegra->clients_lock);

return 0; @@ -1051,6 +1052,7 @@ int tegra_drm_unregister_client(struct tegra_drm *tegra, { mutex_lock(&tegra->clients_lock); list_del_init(&client->list); + client->drm = NULL; mutex_unlock(&tegra->clients_lock);

return 0; diff --git a/drivers/gpu/drm/tegra/drm.h b/drivers/gpu/drm/tegra/drm.h index 7370f7a0fdb8..70154c253d45 100644 --- a/drivers/gpu/drm/tegra/drm.h +++ b/drivers/gpu/drm/tegra/drm.h @@ -88,6 +88,7 @@ int tegra_drm_submit(struct tegra_drm_context *context, struct tegra_drm_client { struct host1x_client base; struct list_head list; + struct tegra_drm *drm;

unsigned int version; const struct tegra_drm_client_ops *ops;

-- 2.19.1

Dmitry Osipenko

2:48 p.m.

New subject: [PATCH v3 10/16] drm/tegra: Store parent pointer in Tegra DRM clients

01.02.2019 16:28, Thierry Reding пишет:

...

Reviewed-by: Dmitry Osipenko digetx@gmail.com Tested-by: Dmitry Osipenko digetx@gmail.com

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 11/16] drm/tegra: vic: Load firmware on demand

From: Thierry Reding treding@nvidia.com

Loading the firmware requires an allocation of IOVA space to make sure that the VIC's Falcon microcontroller can read the firmware if address translation via the SMMU is enabled.

However, the allocation currently happens at a time where the geometry of an IOMMU domain may not have been initialized yet. This happens for example on Tegra186 and later where an ARM SMMU is used. Domains which are created by the ARM SMMU driver postpone the geometry setup until a device is attached to the domain. This is because IOMMU domains aren't attached to a specific IOMMU instance at allocation time and hence the input address space, which defines the geometry, is not known yet.

Work around this by postponing the firmware load until it is needed at the time where a channel is opened to the VIC. At this time the shared IOMMU domain's geometry has been properly initialized.

As a byproduct this allows the Tegra DRM to be created in the absence of VIC firmware, since the VIC initialization no longer fails if the firmware can't be found.

Based on an earlier patch by Dmitry Osipenko digetx@gmail.com.

Signed-off-by: Thierry Reding treding@nvidia.com Reviewed-by: Dmitry Osipenko digetx@gmail.com --- drivers/gpu/drm/tegra/vic.c | 53 +++++++++++++++++++++++++------------ 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/tegra/vic.c b/drivers/gpu/drm/tegra/vic.c index d47983deb1cf..739b894661ab 100644 --- a/drivers/gpu/drm/tegra/vic.c +++ b/drivers/gpu/drm/tegra/vic.c @@ -181,13 +181,6 @@ static int vic_init(struct host1x_client *client) vic->domain = tegra->domain; }

- if (!vic->falcon.data) { - vic->falcon.data = tegra; - err = falcon_load_firmware(&vic->falcon); - if (err < 0) - goto detach; - } - vic->channel = host1x_channel_request(client->dev); if (!vic->channel) { err = -ENOMEM; @@ -246,6 +239,30 @@ static const struct host1x_client_ops vic_client_ops = { .exit = vic_exit, };

+static int vic_load_firmware(struct vic *vic) +{ + int err; + + if (vic->falcon.data) + return 0; + + vic->falcon.data = vic->client.drm; + + err = falcon_read_firmware(&vic->falcon, vic->config->firmware); + if (err < 0) + goto cleanup; + + err = falcon_load_firmware(&vic->falcon); + if (err < 0) + goto cleanup; + + return 0; + +cleanup: + vic->falcon.data = NULL; + return err; +} + static int vic_open_channel(struct tegra_drm_client *client, struct tegra_drm_context *context) { @@ -256,19 +273,25 @@ static int vic_open_channel(struct tegra_drm_client *client, if (err < 0) return err;

+ err = vic_load_firmware(vic); + if (err < 0) + goto rpm_put; + err = vic_boot(vic); - if (err < 0) { - pm_runtime_put(vic->dev); - return err; - } + if (err < 0) + goto rpm_put;

context->channel = host1x_channel_get(vic->channel); if (!context->channel) { - pm_runtime_put(vic->dev); - return -ENOMEM; + err = -ENOMEM; + goto rpm_put; }

return 0; + +rpm_put: + pm_runtime_put(vic->dev); + return err; }

static void vic_close_channel(struct tegra_drm_context *context) @@ -372,10 +395,6 @@ static int vic_probe(struct platform_device *pdev) if (err < 0) return err;

- err = falcon_read_firmware(&vic->falcon, vic->config->firmware); - if (err < 0) - goto exit_falcon; - platform_set_drvdata(pdev, vic);

INIT_LIST_HEAD(&vic->client.base.list);

-- 2.19.1

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 12/16] drm/tegra: Setup shared IOMMU domain after initialization

From: Thierry Reding treding@nvidia.com

Move initialization of the shared IOMMU domain after the host1x device has been initialized. At this point all the Tegra DRM clients have been attached to the shared IOMMU domain.

This is important because Tegra186 and later use an ARM SMMU, for which the driver defers setting up the geometry for a domain until a device is attached to it. This is to ensure that the domain is properly set up for a specific ARM SMMU instance, which is unknown at allocation time.

Reviewed-by: Dmitry Osipenko digetx@gmail.com Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/drm/tegra/drm.c | 54 ++++++++++++++++++++----------------- 1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c index 61dcbd218ffc..271c7a5fc954 100644 --- a/drivers/gpu/drm/tegra/drm.c +++ b/drivers/gpu/drm/tegra/drm.c @@ -92,10 +92,6 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags) return -ENOMEM;

if (iommu_present(&platform_bus_type)) { - u64 carveout_start, carveout_end, gem_start, gem_end; - struct iommu_domain_geometry *geometry; - unsigned long order; - tegra->domain = iommu_domain_alloc(&platform_bus_type); if (!tegra->domain) { err = -ENOMEM; @@ -105,27 +101,6 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags) err = iova_cache_get(); if (err < 0) goto domain; - - geometry = &tegra->domain->geometry; - gem_start = geometry->aperture_start; - gem_end = geometry->aperture_end - CARVEOUT_SZ; - carveout_start = gem_end + 1; - carveout_end = geometry->aperture_end; - - order = __ffs(tegra->domain->pgsize_bitmap); - init_iova_domain(&tegra->carveout.domain, 1UL << order, - carveout_start >> order); - - tegra->carveout.shift = iova_shift(&tegra->carveout.domain); - tegra->carveout.limit = carveout_end >> tegra->carveout.shift; - - drm_mm_init(&tegra->mm, gem_start, gem_end - gem_start + 1); - mutex_init(&tegra->mm_lock); - - DRM_DEBUG("IOMMU apertures:\n"); - DRM_DEBUG(" GEM: %#llx-%#llx\n", gem_start, gem_end); - DRM_DEBUG(" Carveout: %#llx-%#llx\n", carveout_start, - carveout_end); }

mutex_init(&tegra->clients_lock); @@ -159,6 +134,35 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags) if (err < 0) goto fbdev;

+ if (tegra->domain) { + u64 carveout_start, carveout_end, gem_start, gem_end; + dma_addr_t start, end; + unsigned long order; + + start = tegra->domain->geometry.aperture_start; + end = tegra->domain->geometry.aperture_end; + + gem_start = start; + gem_end = end - CARVEOUT_SZ; + carveout_start = gem_end + 1; + carveout_end = end; + + order = __ffs(tegra->domain->pgsize_bitmap); + init_iova_domain(&tegra->carveout.domain, 1UL << order, + carveout_start >> order); + + tegra->carveout.shift = iova_shift(&tegra->carveout.domain); + tegra->carveout.limit = carveout_end >> tegra->carveout.shift; + + drm_mm_init(&tegra->mm, gem_start, gem_end - gem_start + 1); + mutex_init(&tegra->mm_lock); + + DRM_DEBUG("IOMMU apertures:\n"); + DRM_DEBUG(" GEM: %#llx-%#llx\n", gem_start, gem_end); + DRM_DEBUG(" Carveout: %#llx-%#llx\n", carveout_start, + carveout_end); + } + if (tegra->hub) { err = tegra_display_hub_prepare(tegra->hub); if (err < 0)

-- 2.19.1

Dmitry Osipenko

2:48 p.m.

New subject: [PATCH v3 12/16] drm/tegra: Setup shared IOMMU domain after initialization

01.02.2019 16:28, Thierry Reding пишет:

...

From: Thierry Reding treding@nvidia.com

Move initialization of the shared IOMMU domain after the host1x device has been initialized. At this point all the Tegra DRM clients have been attached to the shared IOMMU domain.

This is important because Tegra186 and later use an ARM SMMU, for which the driver defers setting up the geometry for a domain until a device is attached to it. This is to ensure that the domain is properly set up for a specific ARM SMMU instance, which is unknown at allocation time.

Reviewed-by: Dmitry Osipenko digetx@gmail.com Signed-off-by: Thierry Reding treding@nvidia.com

drivers/gpu/drm/tegra/drm.c | 54 ++++++++++++++++++++----------------- 1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c index 61dcbd218ffc..271c7a5fc954 100644 --- a/drivers/gpu/drm/tegra/drm.c +++ b/drivers/gpu/drm/tegra/drm.c @@ -92,10 +92,6 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags) return -ENOMEM;

if (iommu_present(&platform_bus_type)) {
u64 carveout_start, carveout_end, gem_start, gem_end;
struct iommu_domain_geometry *geometry;
unsigned long order;
tegra->domain = iommu_domain_alloc(&platform_bus_type); if (!tegra->domain) { err = -ENOMEM;
@@ -105,27 +101,6 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags) err = iova_cache_get(); if (err < 0) goto domain;
geometry = &tegra->domain->geometry;
gem_start = geometry->aperture_start;
gem_end = geometry->aperture_end - CARVEOUT_SZ;
carveout_start = gem_end + 1;
carveout_end = geometry->aperture_end;
order = __ffs(tegra->domain->pgsize_bitmap);
init_iova_domain(&tegra->carveout.domain, 1UL << order,
		 carveout_start >> order);
tegra->carveout.shift = iova_shift(&tegra->carveout.domain);
tegra->carveout.limit = carveout_end >> tegra->carveout.shift;
drm_mm_init(&tegra->mm, gem_start, gem_end - gem_start + 1);
mutex_init(&tegra->mm_lock);
DRM_DEBUG("IOMMU apertures:\n");
DRM_DEBUG("  GEM: %#llx-%#llx\n", gem_start, gem_end);
DRM_DEBUG("  Carveout: %#llx-%#llx\n", carveout_start,
	  carveout_end);
}

mutex_init(&tegra->clients_lock);
@@ -159,6 +134,35 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags) if (err < 0) goto fbdev;
if (tegra->domain) {
u64 carveout_start, carveout_end, gem_start, gem_end;
dma_addr_t start, end;
unsigned long order;
start = tegra->domain->geometry.aperture_start;
end = tegra->domain->geometry.aperture_end;
gem_start = start;
gem_end = end - CARVEOUT_SZ;
carveout_start = gem_end + 1;
carveout_end = end;
order = __ffs(tegra->domain->pgsize_bitmap);
init_iova_domain(&tegra->carveout.domain, 1UL << order,
		 carveout_start >> order);
tegra->carveout.shift = iova_shift(&tegra->carveout.domain);
tegra->carveout.limit = carveout_end >> tegra->carveout.shift;
drm_mm_init(&tegra->mm, gem_start, gem_end - gem_start + 1);
mutex_init(&tegra->mm_lock);
DRM_DEBUG("IOMMU apertures:\n");
DRM_DEBUG("  GEM: %#llx-%#llx\n", gem_start, gem_end);
DRM_DEBUG("  Carveout: %#llx-%#llx\n", carveout_start,
	  carveout_end);
}

if (tegra->hub) { err = tegra_display_hub_prepare(tegra->hub); if (err < 0)

Reviewed-by: Dmitry Osipenko digetx@gmail.com Tested-by: Dmitry Osipenko digetx@gmail.com

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 13/16] drm/tegra: Restrict IOVA space to DMA mask

From: Thierry Reding treding@nvidia.com

To fix this, restrict the IOVA space to the DMA mask of the host1x device. Note that, technically, the IOVA space needs to be restricted to the intersection of the DMA masks for all clients that are attached to the IOMMU domain. In practice using the DMA mask of the host1x device is sufficient because all host1x clients share the same DMA mask.

Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/drm/tegra/drm.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c index 271c7a5fc954..0c5f1e6a0446 100644 --- a/drivers/gpu/drm/tegra/drm.c +++ b/drivers/gpu/drm/tegra/drm.c @@ -136,11 +136,12 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags)

if (tegra->domain) { u64 carveout_start, carveout_end, gem_start, gem_end; + u64 dma_mask = dma_get_mask(&device->dev); dma_addr_t start, end; unsigned long order;

- start = tegra->domain->geometry.aperture_start; - end = tegra->domain->geometry.aperture_end; + start = tegra->domain->geometry.aperture_start & dma_mask; + end = tegra->domain->geometry.aperture_end & dma_mask;

gem_start = start; gem_end = end - CARVEOUT_SZ;

-- 2.19.1

Dmitry Osipenko

2:49 p.m.

New subject: [PATCH v3 13/16] drm/tegra: Restrict IOVA space to DMA mask

01.02.2019 16:28, Thierry Reding пишет:

...

For older Tegra's:

Reviewed-by: Dmitry Osipenko digetx@gmail.com Tested-by: Dmitry Osipenko digetx@gmail.com

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 14/16] drm/tegra: vic: Do not clear driver data

From: Thierry Reding treding@nvidia.com

Upon driver failure, the driver core will take care of clearing the driver data, so there's no need to do so explicitly in the driver.

Reviewed-by: Dmitry Osipenko digetx@gmail.com Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/drm/tegra/vic.c | 1 - 1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/tegra/vic.c b/drivers/gpu/drm/tegra/vic.c index 739b894661ab..55a8cc162e9d 100644 --- a/drivers/gpu/drm/tegra/vic.c +++ b/drivers/gpu/drm/tegra/vic.c @@ -412,7 +412,6 @@ static int vic_probe(struct platform_device *pdev) err = host1x_client_register(&vic->client.base); if (err < 0) { dev_err(dev, "failed to register host1x client: %d\n", err); - platform_set_drvdata(pdev, NULL); goto exit_falcon; }

-- 2.19.1

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 15/16] drm/tegra: vic: Support stream ID register programming

From: Thierry Reding treding@nvidia.com

The version of VIC found in Tegra186 and later incorporates improvements with regards to context isolation. As part of those improvements, stream ID registers were added that allow to specify separate stream IDs for the Falcon microcontroller and the VIC memory interface.

While it is possible to also set the stream ID dynamically at runtime to allow userspace contexts to be completely separated, this commit doesn't implement that yet. Instead, the static VIC stream ID is programmed when the Falcon is booted. This ensures that memory accesses by the Falcon or the VIC are properly translated via the SMMU.

Signed-off-by: Thierry Reding treding@nvidia.com --- drivers/gpu/drm/tegra/vic.c | 21 +++++++++++++++++++++ drivers/gpu/drm/tegra/vic.h | 9 +++++++++ 2 files changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/tegra/vic.c b/drivers/gpu/drm/tegra/vic.c index 55a8cc162e9d..aef8c16bfbee 100644 --- a/drivers/gpu/drm/tegra/vic.c +++ b/drivers/gpu/drm/tegra/vic.c @@ -26,6 +26,7 @@ struct vic_config { const char *firmware; unsigned int version; + bool supports_sid; };

struct vic { @@ -105,6 +106,22 @@ static int vic_boot(struct vic *vic) if (vic->booted) return 0;

+ if (vic->config->supports_sid) { + struct iommu_fwspec *spec = dev_iommu_fwspec_get(vic->dev); + u32 value; + + value = TRANSCFG_ATT(1, TRANSCFG_SID_FALCON) | + TRANSCFG_ATT(0, TRANSCFG_SID_HW); + vic_writel(vic, value, VIC_TFBIF_TRANSCFG); + + if (spec->num_ids > 0) { + value = spec->ids[0] & 0xffff; + + vic_writel(vic, value, VIC_THI_STREAMID0); + vic_writel(vic, value, VIC_THI_STREAMID1); + } + } + /* setup clockgating registers */ vic_writel(vic, CG_IDLE_CG_DLY_CNT(4) | CG_IDLE_CG_EN | @@ -314,6 +331,7 @@ static const struct tegra_drm_client_ops vic_ops = { static const struct vic_config vic_t124_config = { .firmware = NVIDIA_TEGRA_124_VIC_FIRMWARE, .version = 0x40, + .supports_sid = false, };

#define NVIDIA_TEGRA_210_VIC_FIRMWARE "nvidia/tegra210/vic04_ucode.bin" @@ -321,6 +339,7 @@ static const struct vic_config vic_t124_config = { static const struct vic_config vic_t210_config = { .firmware = NVIDIA_TEGRA_210_VIC_FIRMWARE, .version = 0x21, + .supports_sid = false, };

#define NVIDIA_TEGRA_186_VIC_FIRMWARE "nvidia/tegra186/vic04_ucode.bin" @@ -328,6 +347,7 @@ static const struct vic_config vic_t210_config = { static const struct vic_config vic_t186_config = { .firmware = NVIDIA_TEGRA_186_VIC_FIRMWARE, .version = 0x18, + .supports_sid = true, };

#define NVIDIA_TEGRA_194_VIC_FIRMWARE "nvidia/tegra194/vic.bin" @@ -335,6 +355,7 @@ static const struct vic_config vic_t186_config = { static const struct vic_config vic_t194_config = { .firmware = NVIDIA_TEGRA_194_VIC_FIRMWARE, .version = 0x19, + .supports_sid = true, };

static const struct of_device_id vic_match[] = { diff --git a/drivers/gpu/drm/tegra/vic.h b/drivers/gpu/drm/tegra/vic.h index 21844817a7e1..017584340dd6 100644 --- a/drivers/gpu/drm/tegra/vic.h +++ b/drivers/gpu/drm/tegra/vic.h @@ -17,11 +17,20 @@

/* VIC registers */

+#define VIC_THI_STREAMID0 0x00000030 +#define VIC_THI_STREAMID1 0x00000034 + #define NV_PVIC_MISC_PRI_VIC_CG 0x000016d0 #define CG_IDLE_CG_DLY_CNT(val) ((val & 0x3f) << 0) #define CG_IDLE_CG_EN (1 << 6) #define CG_WAKEUP_DLY_CNT(val) ((val & 0xf) << 16)

+#define VIC_TFBIF_TRANSCFG 0x00002044 +#define TRANSCFG_ATT(i, v) (((v) & 0x3) << (i * 4)) +#define TRANSCFG_SID_HW 0 +#define TRANSCFG_SID_PHY 1 +#define TRANSCFG_SID_FALCON 2 + /* Firmware offsets */

#define VIC_UCODE_FCE_HEADER_OFFSET (6*4)

-- 2.19.1

Thierry Reding

1:28 p.m.

New subject: [PATCH v3 16/16] arm64: tegra: Enable SMMU for VIC on Tegra186

From: Thierry Reding treding@nvidia.com

Enable address translation for VIC via the SMMU on Tegra186.

Signed-off-by: Thierry Reding treding@nvidia.com --- arch/arm64/boot/dts/nvidia/tegra186.dtsi | 1 + 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/boot/dts/nvidia/tegra186.dtsi b/arch/arm64/boot/dts/nvidia/tegra186.dtsi index 66ea7e7c79f5..67b9f8ed3c9b 100644 --- a/arch/arm64/boot/dts/nvidia/tegra186.dtsi +++ b/arch/arm64/boot/dts/nvidia/tegra186.dtsi @@ -799,6 +799,7 @@ reset-names = "vic";

power-domains = <&bpmp TEGRA186_POWER_DOMAIN_VIC>; + iommus = <&smmu TEGRA186_SID_VIC>; };

dsib: dsi@15400000 {

-- 2.19.1

Dmitry Osipenko

2:46 p.m.

01.02.2019 16:28, Thierry Reding пишет:

...

From: Thierry Reding treding@nvidia.com

Tegra186 and later are different from earlier generations in that they use an ARM SMMU rather than the Tegra SMMU. The ARM SMMU driver behaves slightly differently in that the geometry for IOMMU domains is set only after a device was attached to it. This is to make sure that the SMMU instance that the domain belongs to is known, because each instance can have a different input address space (i.e. geometry).

Work around this by moving all IOVA allocations to a point where the geometry of the domain is properly initialized.

This second version of the series addresses all review comments and adds a number of patches that will actually allow host1x to work with an SMMU enabled on Tegra186. The patches also add programming required to address the full 40 bits of address space.

The third version of the series fixes the 40-bit addressing code by making sure that wide opcodes are always written atomically to the push buffer. Another pair of patches are introduced to fix a long-standing bug where the HOST1X_CHANNEL_DMAEND register wasn't properly programmed and one push buffer memory optimization that avoid wasting almost one full memory page per push buffer.

This supersedes the following patch:

https://patchwork.kernel.org/patch/10775579/

Thierry

Thierry Reding (16): gpu: host1x: Set up stream ID table gpu: host1x: Program the channel stream ID gpu: host1x: Introduce support for wide opcodes gpu: host1x: Support 40-bit addressing gpu: host1x: Use direct DMA with IOMMU API usage gpu: host1x: Restrict IOVA space to DMA mask gpu: host1x: Support 40-bit addressing on Tegra186 gpu: host1x: Use correct semantics for HOST1X_CHANNEL_DMAEND gpu: host1x: Optimize CDMA push buffer memory usage drm/tegra: Store parent pointer in Tegra DRM clients drm/tegra: vic: Load firmware on demand drm/tegra: Setup shared IOMMU domain after initialization drm/tegra: Restrict IOVA space to DMA mask drm/tegra: vic: Do not clear driver data drm/tegra: vic: Support stream ID register programming arm64: tegra: Enable SMMU for VIC on Tegra186

arch/arm64/boot/dts/nvidia/tegra186.dtsi | 1 + drivers/gpu/drm/tegra/drm.c | 57 +++++---- drivers/gpu/drm/tegra/drm.h | 1 + drivers/gpu/drm/tegra/vic.c | 75 ++++++++--- drivers/gpu/drm/tegra/vic.h | 9 ++ drivers/gpu/host1x/cdma.c | 132 ++++++++++++++++++-- drivers/gpu/host1x/cdma.h | 2 + drivers/gpu/host1x/dev.c | 49 +++++++- drivers/gpu/host1x/dev.h | 8 ++ drivers/gpu/host1x/hw/cdma_hw.c | 32 ++++- drivers/gpu/host1x/hw/channel_hw.c | 42 ++++++- drivers/gpu/host1x/hw/host1x06_hardware.h | 6 + drivers/gpu/host1x/hw/host1x07_hardware.h | 6 + drivers/gpu/host1x/hw/hw_host1x06_channel.h | 11 ++ drivers/gpu/host1x/hw/hw_host1x07_channel.h | 11 ++ include/trace/events/host1x.h | 26 ++++ 16 files changed, 404 insertions(+), 64 deletions(-) create mode 100644 drivers/gpu/host1x/hw/hw_host1x06_channel.h create mode 100644 drivers/gpu/host1x/hw/hw_host1x07_channel.h

I gave a test to this series on T20 and T30. Opentegra works, grate tests work, glxgears are spinning.. everything working fine.

2281

Age (days ago)

2281

Last active (days ago)

dri-devel@lists.freedesktop.org

29 comments

2 participants

tags (0)

participants (2)

Dmitry Osipenko
Thierry Reding