[RFC 0/4] drm/msm: GPU crash state

List overview All Threads
Download

newer

older

[Bug 104216] Firefox quirks (black...

[PATCH v2] drm: clean up comments...

Jordan Crouse

5 Jan 2018 5 Jan '18

6 p.m.

This is a request for comment on code to store and dump a GPU state a hang with inspiration from the very good i915 GPU error state and the binary GPU snapshot in the downstream kernel.

The goal is to store and provide enough information to debug software and hardware issues on the Adreno hardware in a semi human-readable format that can also be parsed by scripts.

The goal for this request for comment is to get some consensus about the format and work through some of the technical issues.

The first patch is a bit of a standalone - it finally consolidates the 'show' and 'dump' code (where show goes to a seq_file and 'dump' goes to the kernel log) through judicious use of drm_printer.

The next patch captures the ringbuffer and register information the same way that 'show' does - most of the code is generic but eventually will get more platform specific as we add new chunks of data (especially for a5xx and beyond).

The third patch converts the existing show function to use the state instead of reading the data directly. This gets a little big ugly because due to the first patch, if we don't capture a state here then nothing gets printed out.

The final patch prints the crash state through its own debugfs node. Only one crash state is captured at a time and it persists until the state is cleared by a write to the debugfs node.

After we mostly agree on the behavior and design of the code the next step will be to pull over more state for 5xx including ringbuffer contents, active commands and buffers, shader caches and other debug information.

Jordan Crouse (4): drm/msm: gpu: Use drm_printer to consolidate the show/dump code drm/msm: gpu: Capture the state of the GPU drm/msm: gpu: Convert the GPU show functions to use the GPU state drm/msm: gpu: Capture the GPU state on a GPU hang

drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 43 +++++----- drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 39 ++++----- drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 44 +++++----- drivers/gpu/drm/msm/adreno/adreno_gpu.c | 138 +++++++++++++++++++------------- drivers/gpu/drm/msm/adreno/adreno_gpu.h | 10 ++- drivers/gpu/drm/msm/msm_debugfs.c | 78 ++++++++++++++++-- drivers/gpu/drm/msm/msm_gpu.h | 64 ++++++++++++++- 7 files changed, 296 insertions(+), 120 deletions(-)

-- 1.9.1

Show replies by date

Jordan Crouse

5 Jan 5 Jan

6 p.m.

New subject: [PATCH 1/4] drm/msm: gpu: Use drm_printer to consolidate the show/dump code

Use drm_printer to abstract the show/dump debug code so we can use the same code for both the kernel log and seq_file paths. Maintaining the existing dump format mandates that we break the single show function into two but thats a small price to pay for removing a bunch of code and consolidating the path.

Signed-off-by: Jordan Crouse jcrouse@codeaurora.org --- drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 18 +++------ drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 15 ++----- drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 18 +++------ drivers/gpu/drm/msm/adreno/adreno_gpu.c | 70 +++++++++------------------------ drivers/gpu/drm/msm/adreno/adreno_gpu.h | 2 + 5 files changed, 34 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c index 4baef27..29bb15b 100644 --- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c @@ -39,7 +39,6 @@

extern bool hang_debug;

-static void a3xx_dump(struct msm_gpu *gpu); static bool a3xx_idle(struct msm_gpu *gpu);

static bool a3xx_me_init(struct msm_gpu *gpu) @@ -300,18 +299,18 @@ static int a3xx_hw_init(struct msm_gpu *gpu)

static void a3xx_recover(struct msm_gpu *gpu) { + struct drm_printer p = drm_info_printer(gpu->dev->dev); int i;

- adreno_dump_info(gpu); + adreno_show_info(gpu, &p);

- for (i = 0; i < 8; i++) { - printk("CP_SCRATCH_REG%d: %u\n", i, + for (i = 0; i < 8; i++) + drm_printf(&p, "CP_SCRATCH_REG%d: %u\n", i, gpu_read(gpu, REG_AXXX_CP_SCRATCH_REG0 + i)); - }

/* dump registers before resetting gpu, if enabled: */ if (hang_debug) - a3xx_dump(gpu); + adreno_show_regs(gpu, &p);

gpu_write(gpu, REG_A3XX_RBBM_SW_RESET_CMD, 1); gpu_read(gpu, REG_A3XX_RBBM_SW_RESET_CMD); @@ -419,13 +418,6 @@ static void a3xx_show(struct msm_gpu *gpu, struct seq_file *m) } #endif

-/* would be nice to not have to duplicate the _show() stuff with printk(): */ -static void a3xx_dump(struct msm_gpu *gpu) -{ - printk("status: %08x\n", - gpu_read(gpu, REG_A3XX_RBBM_STATUS)); - adreno_dump(gpu); -} /* Register offset defines for A3XX */ static const unsigned int a3xx_register_offsets[REG_ADRENO_REGISTER_MAX] = { REG_ADRENO_DEFINE(REG_ADRENO_CP_RB_BASE, REG_AXXX_CP_RB_BASE), diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c index 8199a4b..d9742b1 100644 --- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c @@ -30,7 +30,6 @@ A4XX_INT0_UCHE_OOB_ACCESS)

extern bool hang_debug; -static void a4xx_dump(struct msm_gpu *gpu); static bool a4xx_idle(struct msm_gpu *gpu);

/* @@ -298,18 +297,19 @@ static int a4xx_hw_init(struct msm_gpu *gpu)

static void a4xx_recover(struct msm_gpu *gpu) { + struct drm_printer p = drm_info_printer(gpu->dev->dev); int i;

- adreno_dump_info(gpu); + adreno_show_info(gpu, &p);

for (i = 0; i < 8; i++) { - printk("CP_SCRATCH_REG%d: %u\n", i, + drm_printf(&p, "CP_SCRATCH_REG%d: %u\n", i, gpu_read(gpu, REG_AXXX_CP_SCRATCH_REG0 + i)); }

/* dump registers before resetting gpu, if enabled: */ if (hang_debug) - a4xx_dump(gpu); + adreno_show_regs(gpu, &p);

gpu_write(gpu, REG_A4XX_RBBM_SW_RESET_CMD, 1); gpu_read(gpu, REG_A4XX_RBBM_SW_RESET_CMD); @@ -475,13 +475,6 @@ static void a4xx_show(struct msm_gpu *gpu, struct seq_file *m) REG_ADRENO_DEFINE(REG_ADRENO_CP_RB_CNTL, REG_A4XX_CP_RB_CNTL), };

-static void a4xx_dump(struct msm_gpu *gpu) -{ - printk("status: %08x\n", - gpu_read(gpu, REG_A4XX_RBBM_STATUS)); - adreno_dump(gpu); -} - static int a4xx_pm_resume(struct msm_gpu *gpu) { struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); int ret; diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c index a1f4eee..deec597 100644 --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c @@ -22,7 +22,6 @@ #include "a5xx_gpu.h"

extern bool hang_debug; -static void a5xx_dump(struct msm_gpu *gpu);

#define GPU_PAS_ID 13

@@ -755,17 +754,17 @@ static int a5xx_hw_init(struct msm_gpu *gpu)

static void a5xx_recover(struct msm_gpu *gpu) { + struct drm_printer p = drm_info_printer(gpu->dev->dev); int i;

- adreno_dump_info(gpu); + adreno_show_info(gpu, &p);

- for (i = 0; i < 8; i++) { - printk("CP_SCRATCH_REG%d: %u\n", i, + for (i = 0; i < 8; i++) + drm_printf(&p, "CP_SCRATCH_REG%d: %u\n", i, gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(i))); - }

if (hang_debug) - a5xx_dump(gpu); + adreno_show_regs(gpu, &p);

gpu_write(gpu, REG_A5XX_RBBM_SW_RESET_CMD, 1); gpu_read(gpu, REG_A5XX_RBBM_SW_RESET_CMD); @@ -1073,13 +1072,6 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu) 0xB9A0, 0xB9BF, ~0 };

-static void a5xx_dump(struct msm_gpu *gpu) -{ - dev_info(gpu->dev->dev, "status: %08x\n", - gpu_read(gpu, REG_A5XX_RBBM_STATUS)); - adreno_dump(gpu); -} - static int a5xx_pm_resume(struct msm_gpu *gpu) { int ret; diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c index 2f0610f..60b3c99 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c @@ -362,13 +362,12 @@ bool adreno_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring) return false; }

-#ifdef CONFIG_DEBUG_FS -void adreno_show(struct msm_gpu *gpu, struct seq_file *m) +void adreno_show_info(struct msm_gpu *gpu, struct drm_printer *p) { struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); int i;

- seq_printf(m, "revision: %d (%d.%d.%d.%d)\n", + drm_printf(p, "revision: %d (%d.%d.%d.%d)\n", adreno_gpu->info->revn, adreno_gpu->rev.core, adreno_gpu->rev.major, adreno_gpu->rev.minor, adreno_gpu->rev.patchid); @@ -376,65 +375,22 @@ void adreno_show(struct msm_gpu *gpu, struct seq_file *m) for (i = 0; i < gpu->nr_rings; i++) { struct msm_ringbuffer *ring = gpu->rb[i];

- seq_printf(m, "rb %d: fence: %d/%d\n", i, + drm_printf(p, "rb %d: fence: %d/%d\n", i, ring->memptrs->fence, ring->seqno);

- seq_printf(m, " rptr: %d\n", + drm_printf(p, " rptr: %d\n", get_rptr(adreno_gpu, ring)); - seq_printf(m, "rb wptr: %d\n", get_wptr(ring)); - } - - /* dump these out in a form that can be parsed by demsm: */ - seq_printf(m, "IO:region %s 00000000 00020000\n", gpu->name); - for (i = 0; adreno_gpu->registers[i] != ~0; i += 2) { - uint32_t start = adreno_gpu->registers[i]; - uint32_t end = adreno_gpu->registers[i+1]; - uint32_t addr; - - for (addr = start; addr <= end; addr++) { - uint32_t val = gpu_read(gpu, addr); - seq_printf(m, "IO:R %08x %08x\n", addr<<2, val); - } + drm_printf(p, "rb wptr: %d\n", get_wptr(ring)); } } -#endif

-/* Dump common gpu status and scratch registers on any hang, to make - * the hangcheck logs more useful. The scratch registers seem always - * safe to read when GPU has hung (unlike some other regs, depending - * on how the GPU hung), and they are useful to match up to cmdstream - * dumps when debugging hangs: - */ -void adreno_dump_info(struct msm_gpu *gpu) -{ - struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); - int i; - - printk("revision: %d (%d.%d.%d.%d)\n", - adreno_gpu->info->revn, adreno_gpu->rev.core, - adreno_gpu->rev.major, adreno_gpu->rev.minor, - adreno_gpu->rev.patchid); - - for (i = 0; i < gpu->nr_rings; i++) { - struct msm_ringbuffer *ring = gpu->rb[i]; - - printk("rb %d: fence: %d/%d\n", i, - ring->memptrs->fence, - ring->seqno); - - printk("rptr: %d\n", get_rptr(adreno_gpu, ring)); - printk("rb wptr: %d\n", get_wptr(ring)); - } -} - -/* would be nice to not have to duplicate the _show() stuff with printk(): */ -void adreno_dump(struct msm_gpu *gpu) +void adreno_show_regs(struct msm_gpu *gpu, struct drm_printer *p) { struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); int i;

/* dump these out in a form that can be parsed by demsm: */ - printk("IO:region %s 00000000 00020000\n", gpu->name); + drm_printf(p, "IO:region %s 00000000 00020000\n", gpu->name); for (i = 0; adreno_gpu->registers[i] != ~0; i += 2) { uint32_t start = adreno_gpu->registers[i]; uint32_t end = adreno_gpu->registers[i+1]; @@ -442,11 +398,21 @@ void adreno_dump(struct msm_gpu *gpu)

for (addr = start; addr <= end; addr++) { uint32_t val = gpu_read(gpu, addr); - printk("IO:R %08x %08x\n", addr<<2, val); + drm_printf(p, "IO:R %08x %08x\n", addr<<2, val); } } }

+#ifdef CONFIG_DEBUG_FS +void adreno_show(struct msm_gpu *gpu, struct seq_file *m) +{ + struct drm_printer p = drm_seq_file_printer(m); + + adreno_show_info(gpu, &p); + adreno_show_regs(gpu, &p); +} +#endif + static uint32_t ring_freewords(struct msm_ringbuffer *ring) { struct adreno_gpu *adreno_gpu = to_adreno_gpu(ring->gpu); diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h index 28e3de6..3443071 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h @@ -223,6 +223,8 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev, int nr_rings); void adreno_gpu_cleanup(struct adreno_gpu *gpu);

+void adreno_show_info(struct msm_gpu *gpu, struct drm_printer *p); +void adreno_show_regs(struct msm_gpu *gpu, struct drm_printer *p);

/* ringbuffer helpers (the parts that are adreno specific) */

-- 1.9.1

Jordan Crouse

6 p.m.

New subject: [PATCH 2/4] drm/msm: gpu: Capture the state of the GPU

Add the infrastructure to capture the state current state of the GPU and store it in memory. This is useful for storing the state of a hung GPU so it can be dumped later.

For now grab the same basic ringbuffer information and registers that are provided by the debugfs 'gpu' node but obviously this can be extended to capture a much larger set of GPU information.

Signed-off-by: Jordan Crouse jcrouse@codeaurora.org --- drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 14 +++++++++ drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 14 +++++++++ drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 22 ++++++++++++++ drivers/gpu/drm/msm/adreno/adreno_gpu.c | 54 +++++++++++++++++++++++++++++++++ drivers/gpu/drm/msm/adreno/adreno_gpu.h | 3 ++ drivers/gpu/drm/msm/msm_gpu.h | 19 ++++++++++++ 6 files changed, 126 insertions(+)

diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c index 29bb15b..5427748 100644 --- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c @@ -418,6 +418,18 @@ static void a3xx_show(struct msm_gpu *gpu, struct seq_file *m) } #endif

+static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu) +{ + struct msm_gpu_state *state = adreno_gpu_state_get(gpu); + + if (IS_ERR(state)) + return state; + + state->rbbm_status = gpu_read(gpu, REG_A3XX_RBBM_STATUS); + + return state; +} + /* Register offset defines for A3XX */ static const unsigned int a3xx_register_offsets[REG_ADRENO_REGISTER_MAX] = { REG_ADRENO_DEFINE(REG_ADRENO_CP_RB_BASE, REG_AXXX_CP_RB_BASE), @@ -444,6 +456,8 @@ static void a3xx_show(struct msm_gpu *gpu, struct seq_file *m) #ifdef CONFIG_DEBUG_FS .show = a3xx_show, #endif + .gpu_state_get = a3xx_gpu_state_get, + .gpu_state_put = adreno_gpu_state_put, }, };

diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c index d9742b1..4c66c54 100644 --- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c @@ -464,6 +464,18 @@ static void a4xx_show(struct msm_gpu *gpu, struct seq_file *m) } #endif

+static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu) +{ + struct msm_gpu_state *state = adreno_gpu_state_get(gpu); + + if (IS_ERR(state)) + return state; + + state->rbbm_status = gpu_read(gpu, REG_A4XX_RBBM_STATUS); + + return state; +} + /* Register offset defines for A4XX, in order of enum adreno_regs */ static const unsigned int a4xx_register_offsets[REG_ADRENO_REGISTER_MAX] = { REG_ADRENO_DEFINE(REG_ADRENO_CP_RB_BASE, REG_A4XX_CP_RB_BASE), @@ -533,6 +545,8 @@ static int a4xx_get_timestamp(struct msm_gpu *gpu, uint64_t *value) #ifdef CONFIG_DEBUG_FS .show = a4xx_show, #endif + .gpu_state_get = a4xx_gpu_state_get, + .gpu_state_put = adreno_gpu_state_put, }, .get_timestamp = a4xx_get_timestamp, }; diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c index deec597..05046e4 100644 --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c @@ -1133,6 +1133,26 @@ static int a5xx_get_timestamp(struct msm_gpu *gpu, uint64_t *value) return 0; }

+static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu) +{ + struct msm_gpu_state *state; + + /* + * Temporarily disable hardware clock gating before going into + * adreno_show to avoid issues while reading the registers + */ + a5xx_set_hwcg(gpu, false); + + state = adreno_gpu_state_get(gpu); + + if (!IS_ERR(state)) + state->rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS); + + a5xx_set_hwcg(gpu, true); + + return state; +} + #ifdef CONFIG_DEBUG_FS static void a5xx_show(struct msm_gpu *gpu, struct seq_file *m) { @@ -1172,6 +1192,8 @@ static struct msm_ringbuffer *a5xx_active_ring(struct msm_gpu *gpu) #ifdef CONFIG_DEBUG_FS .show = a5xx_show, #endif + .gpu_state_get = a5xx_gpu_state_get, + .gpu_state_put = adreno_gpu_state_put, }, .get_timestamp = a5xx_get_timestamp, }; diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c index 60b3c99..64da461 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c @@ -362,6 +362,60 @@ bool adreno_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring) return false; }

+struct msm_gpu_state *adreno_gpu_state_get(struct msm_gpu *gpu) +{ + struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); + struct msm_gpu_state *state; + int i, count = 0; + + state = kzalloc(sizeof(*state), GFP_KERNEL); + if (!state) + return ERR_PTR(-ENOMEM); + + do_gettimeofday(&state->time); + + for (i = 0; i < gpu->nr_rings; i++) { + state->ring[i].fence = gpu->rb[i]->memptrs->fence; + state->ring[i].seqno = gpu->rb[i]->seqno; + state->ring[i].rptr = get_rptr(adreno_gpu, gpu->rb[i]); + state->ring[i].wptr = get_wptr(gpu->rb[i]); + } + + /* Count the number of registers */ + for (i = 0; adreno_gpu->registers[i] != ~0; i += 2) + count += adreno_gpu->registers[i + 1] - + adreno_gpu->registers[i] + 1; + + state->registers = kcalloc(count * 2, sizeof(u32), GFP_KERNEL); + if (state->registers) { + int pos = 0; + + for (i = 0; adreno_gpu->registers[i] != ~0; i += 2) { + uint32_t start = adreno_gpu->registers[i]; + uint32_t end = adreno_gpu->registers[i+1]; + uint32_t addr; + + for (addr = start; addr <= end; addr++) { + state->registers[pos++] = addr; + state->registers[pos++] = gpu_read(gpu, addr); + } + } + + state->nr_registers = count; + } + + return state; +} + +void adreno_gpu_state_put(struct msm_gpu_state *state) +{ + if (IS_ERR_OR_NULL(state)) + return; + + kfree(state->registers); + kfree(state); +} + void adreno_show_info(struct msm_gpu *gpu, struct drm_printer *p) { struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h index 3443071..b7c95ee 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h @@ -226,6 +226,9 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev, void adreno_show_info(struct msm_gpu *gpu, struct drm_printer *p); void adreno_show_regs(struct msm_gpu *gpu, struct drm_printer *p);

+struct msm_gpu_state *adreno_gpu_state_get(struct msm_gpu *gpu); +void adreno_gpu_state_put(struct msm_gpu_state *state); + /* ringbuffer helpers (the parts that are adreno specific) */

static inline void diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h index e113d64..d781489 100644 --- a/drivers/gpu/drm/msm/msm_gpu.h +++ b/drivers/gpu/drm/msm/msm_gpu.h @@ -27,6 +27,7 @@

struct msm_gem_submit; struct msm_gpu_perfcntr; +struct msm_gpu_state;

struct msm_gpu_config { const char *ioname; @@ -66,6 +67,8 @@ struct msm_gpu_funcs { /* show GPU status in debugfs: */ void (*show)(struct msm_gpu *gpu, struct seq_file *m); #endif + struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu); + void (*gpu_state_put)(struct msm_gpu_state *state); };

struct msm_gpu { @@ -171,6 +174,22 @@ struct msm_gpu_submitqueue { struct kref ref; };

+struct msm_gpu_state { + struct timeval time; + + struct { + u32 fence; + u32 seqno; + u32 rptr; + u32 wptr; + } ring[MSM_GPU_MAX_RINGS]; + + int nr_registers; + u32 *registers; + + u32 rbbm_status; +}; + static inline void gpu_write(struct msm_gpu *gpu, u32 reg, u32 data) { msm_writel(data, gpu->mmio + (reg << 2));

-- 1.9.1

Jordan Crouse

6 p.m.

New subject: [PATCH 3/4] drm/msm: gpu: Convert the GPU show functions to use the GPU state

Convert the existing GPU show functions to use the GPU state to dump the information rather than reading it directly from the hardware. This will require an additional step to capture the state before dumping it for the existing nodes but it will greatly facilitate reusing the same code for dumping a previously captured state from a GPU hang.

Signed-off-by: Jordan Crouse jcrouse@codeaurora.org --- drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 20 ++++++--------- drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 21 ++++++---------- drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 27 ++++++-------------- drivers/gpu/drm/msm/adreno/adreno_gpu.c | 44 ++++++++++++++++----------------- drivers/gpu/drm/msm/adreno/adreno_gpu.h | 9 ++++--- drivers/gpu/drm/msm/msm_debugfs.c | 21 +++++++++++----- drivers/gpu/drm/msm/msm_gpu.h | 3 ++- 7 files changed, 69 insertions(+), 76 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c index 5427748..8a7d56ec 100644 --- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c @@ -300,9 +300,12 @@ static int a3xx_hw_init(struct msm_gpu *gpu) static void a3xx_recover(struct msm_gpu *gpu) { struct drm_printer p = drm_info_printer(gpu->dev->dev); + struct msm_gpu_state *state; int i;

- adreno_show_info(gpu, &p); + state = gpu->funcs->gpu_state_get(gpu); + + adreno_show_info(gpu, state, &p);

for (i = 0; i < 8; i++) drm_printf(&p, "CP_SCRATCH_REG%d: %u\n", i, @@ -310,12 +313,14 @@ static void a3xx_recover(struct msm_gpu *gpu)

/* dump registers before resetting gpu, if enabled: */ if (hang_debug) - adreno_show_regs(gpu, &p); + adreno_show_regs(gpu, state, &p);

gpu_write(gpu, REG_A3XX_RBBM_SW_RESET_CMD, 1); gpu_read(gpu, REG_A3XX_RBBM_SW_RESET_CMD); gpu_write(gpu, REG_A3XX_RBBM_SW_RESET_CMD, 0); adreno_recover(gpu); + + gpu->funcs->gpu_state_put(state); }

static void a3xx_destroy(struct msm_gpu *gpu) @@ -409,15 +414,6 @@ static irqreturn_t a3xx_irq(struct msm_gpu *gpu) ~0 /* sentinel */ };

-#ifdef CONFIG_DEBUG_FS -static void a3xx_show(struct msm_gpu *gpu, struct seq_file *m) -{ - seq_printf(m, "status: %08x\n", - gpu_read(gpu, REG_A3XX_RBBM_STATUS)); - adreno_show(gpu, m); -} -#endif - static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu) { struct msm_gpu_state *state = adreno_gpu_state_get(gpu); @@ -454,7 +450,7 @@ static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu) .irq = a3xx_irq, .destroy = a3xx_destroy, #ifdef CONFIG_DEBUG_FS - .show = a3xx_show, + .show = adreno_show, #endif .gpu_state_get = a3xx_gpu_state_get, .gpu_state_put = adreno_gpu_state_put, diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c index 4c66c54..e64c7fc 100644 --- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c @@ -298,9 +298,12 @@ static int a4xx_hw_init(struct msm_gpu *gpu) static void a4xx_recover(struct msm_gpu *gpu) { struct drm_printer p = drm_info_printer(gpu->dev->dev); + struct msm_gpu_state *state; int i;

- adreno_show_info(gpu, &p); + state = gpu->funcs->gpu_state_get(gpu); + + adreno_show_info(gpu, state, &p);

for (i = 0; i < 8; i++) { drm_printf(&p, "CP_SCRATCH_REG%d: %u\n", i, @@ -309,12 +312,14 @@ static void a4xx_recover(struct msm_gpu *gpu)

/* dump registers before resetting gpu, if enabled: */ if (hang_debug) - adreno_show_regs(gpu, &p); + adreno_show_regs(gpu, state, &p);

gpu_write(gpu, REG_A4XX_RBBM_SW_RESET_CMD, 1); gpu_read(gpu, REG_A4XX_RBBM_SW_RESET_CMD); gpu_write(gpu, REG_A4XX_RBBM_SW_RESET_CMD, 0); adreno_recover(gpu); + + gpu->funcs->gpu_state_put(state); }

static void a4xx_destroy(struct msm_gpu *gpu) @@ -454,16 +459,6 @@ static irqreturn_t a4xx_irq(struct msm_gpu *gpu) ~0 /* sentinel */ };

-#ifdef CONFIG_DEBUG_FS -static void a4xx_show(struct msm_gpu *gpu, struct seq_file *m) -{ - seq_printf(m, "status: %08x\n", - gpu_read(gpu, REG_A4XX_RBBM_STATUS)); - adreno_show(gpu, m); - -} -#endif - static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu) { struct msm_gpu_state *state = adreno_gpu_state_get(gpu); @@ -543,7 +538,7 @@ static int a4xx_get_timestamp(struct msm_gpu *gpu, uint64_t *value) .irq = a4xx_irq, .destroy = a4xx_destroy, #ifdef CONFIG_DEBUG_FS - .show = a4xx_show, + .show = adreno_show, #endif .gpu_state_get = a4xx_gpu_state_get, .gpu_state_put = adreno_gpu_state_put, diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c index 05046e4..6747b7b 100644 --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c @@ -755,21 +755,26 @@ static int a5xx_hw_init(struct msm_gpu *gpu) static void a5xx_recover(struct msm_gpu *gpu) { struct drm_printer p = drm_info_printer(gpu->dev->dev); + struct msm_gpu_state *state; int i;

- adreno_show_info(gpu, &p); + state = gpu->funcs->gpu_state_get(gpu); + + adreno_show_info(gpu, state, &p);

for (i = 0; i < 8; i++) drm_printf(&p, "CP_SCRATCH_REG%d: %u\n", i, gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(i)));

if (hang_debug) - adreno_show_regs(gpu, &p); + adreno_show_regs(gpu, state, &p);

gpu_write(gpu, REG_A5XX_RBBM_SW_RESET_CMD, 1); gpu_read(gpu, REG_A5XX_RBBM_SW_RESET_CMD); gpu_write(gpu, REG_A5XX_RBBM_SW_RESET_CMD, 0); adreno_recover(gpu); + + gpu->funcs->gpu_state_put(state); }

static void a5xx_destroy(struct msm_gpu *gpu) @@ -1153,22 +1158,6 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu) return state; }

-#ifdef CONFIG_DEBUG_FS -static void a5xx_show(struct msm_gpu *gpu, struct seq_file *m) -{ - seq_printf(m, "status: %08x\n", - gpu_read(gpu, REG_A5XX_RBBM_STATUS)); - - /* - * Temporarily disable hardware clock gating before going into - * adreno_show to avoid issues while reading the registers - */ - a5xx_set_hwcg(gpu, false); - adreno_show(gpu, m); - a5xx_set_hwcg(gpu, true); -} -#endif - static struct msm_ringbuffer *a5xx_active_ring(struct msm_gpu *gpu) { struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); @@ -1190,7 +1179,7 @@ static struct msm_ringbuffer *a5xx_active_ring(struct msm_gpu *gpu) .irq = a5xx_irq, .destroy = a5xx_destroy, #ifdef CONFIG_DEBUG_FS - .show = a5xx_show, + .show = adreno_show, #endif .gpu_state_get = a5xx_gpu_state_get, .gpu_state_put = adreno_gpu_state_put, diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c index 64da461..ba1b912 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c @@ -416,54 +416,54 @@ void adreno_gpu_state_put(struct msm_gpu_state *state) kfree(state); }

-void adreno_show_info(struct msm_gpu *gpu, struct drm_printer *p) +void adreno_show_info(struct msm_gpu *gpu, struct msm_gpu_state *state, + struct drm_printer *p) { struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); int i;

+ if (IS_ERR_OR_NULL(state)) + return; + + drm_printf(p, "status: %08x\n", state->rbbm_status); drm_printf(p, "revision: %d (%d.%d.%d.%d)\n", adreno_gpu->info->revn, adreno_gpu->rev.core, adreno_gpu->rev.major, adreno_gpu->rev.minor, adreno_gpu->rev.patchid);

for (i = 0; i < gpu->nr_rings; i++) { - struct msm_ringbuffer *ring = gpu->rb[i]; - drm_printf(p, "rb %d: fence: %d/%d\n", i, - ring->memptrs->fence, ring->seqno); + state->ring[i].fence, state->ring[i].seqno);

- drm_printf(p, " rptr: %d\n", - get_rptr(adreno_gpu, ring)); - drm_printf(p, "rb wptr: %d\n", get_wptr(ring)); + drm_printf(p, " rptr: %d\n", state->ring[i].rptr); + drm_printf(p, "rb wptr: %d\n", state->ring[i].wptr); } }

-void adreno_show_regs(struct msm_gpu *gpu, struct drm_printer *p) +void adreno_show_regs(struct msm_gpu *gpu, struct msm_gpu_state *state, + struct drm_printer *p) { - struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); int i;

+ if (IS_ERR_OR_NULL(state)) + return; + /* dump these out in a form that can be parsed by demsm: */ drm_printf(p, "IO:region %s 00000000 00020000\n", gpu->name); - for (i = 0; adreno_gpu->registers[i] != ~0; i += 2) { - uint32_t start = adreno_gpu->registers[i]; - uint32_t end = adreno_gpu->registers[i+1]; - uint32_t addr; - - for (addr = start; addr <= end; addr++) { - uint32_t val = gpu_read(gpu, addr); - drm_printf(p, "IO:R %08x %08x\n", addr<<2, val); - } - } + for (i = 0; i < state->nr_registers; i++) + drm_printf(p, "IO:R %08x %08x\n", + state->registers[i * 2] << 2, + state->registers[(i * 2) + 1]); }

#ifdef CONFIG_DEBUG_FS -void adreno_show(struct msm_gpu *gpu, struct seq_file *m) +void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state, + struct seq_file *m) { struct drm_printer p = drm_seq_file_printer(m);

- adreno_show_info(gpu, &p); - adreno_show_regs(gpu, &p); + adreno_show_info(gpu, state, &p); + adreno_show_regs(gpu, state, &p); } #endif

diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h index b7c95ee..4542b6b 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h @@ -211,7 +211,8 @@ void adreno_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit, void adreno_flush(struct msm_gpu *gpu, struct msm_ringbuffer *ring); bool adreno_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring); #ifdef CONFIG_DEBUG_FS -void adreno_show(struct msm_gpu *gpu, struct seq_file *m); +void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state, + struct seq_file *m); #endif void adreno_dump_info(struct msm_gpu *gpu); void adreno_dump(struct msm_gpu *gpu); @@ -223,8 +224,10 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev, int nr_rings); void adreno_gpu_cleanup(struct adreno_gpu *gpu);

-void adreno_show_info(struct msm_gpu *gpu, struct drm_printer *p); -void adreno_show_regs(struct msm_gpu *gpu, struct drm_printer *p); +void adreno_show_info(struct msm_gpu *gpu, struct msm_gpu_state *state, + struct drm_printer *p); +void adreno_show_regs(struct msm_gpu *gpu, struct msm_gpu_state *state, + struct drm_printer *p);

struct msm_gpu_state *adreno_gpu_state_get(struct msm_gpu *gpu); void adreno_gpu_state_put(struct msm_gpu_state *state); diff --git a/drivers/gpu/drm/msm/msm_debugfs.c b/drivers/gpu/drm/msm/msm_debugfs.c index 1855182..89ee74b 100644 --- a/drivers/gpu/drm/msm/msm_debugfs.c +++ b/drivers/gpu/drm/msm/msm_debugfs.c @@ -25,13 +25,22 @@ static int msm_gpu_show(struct drm_device *dev, struct seq_file *m) { struct msm_drm_private *priv = dev->dev_private; struct msm_gpu *gpu = priv->gpu; + struct msm_gpu_state *state;

- if (gpu) { - seq_printf(m, "%s Status:\n", gpu->name); - pm_runtime_get_sync(&gpu->pdev->dev); - gpu->funcs->show(gpu, m); - pm_runtime_put_sync(&gpu->pdev->dev); - } + if (!gpu) + return 0; + + pm_runtime_get_sync(&gpu->pdev->dev); + state = gpu->funcs->gpu_state_get(gpu); + pm_runtime_put_sync(&gpu->pdev->dev); + + if (IS_ERR(state)) + return PTR_ERR(state); + + seq_printf(m, "%s Status:\n", gpu->name); + gpu->funcs->show(gpu, state, m); + + gpu->funcs->gpu_state_put(state);

return 0; } diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h index d781489..cff52ca 100644 --- a/drivers/gpu/drm/msm/msm_gpu.h +++ b/drivers/gpu/drm/msm/msm_gpu.h @@ -65,7 +65,8 @@ struct msm_gpu_funcs { void (*destroy)(struct msm_gpu *gpu); #ifdef CONFIG_DEBUG_FS /* show GPU status in debugfs: */ - void (*show)(struct msm_gpu *gpu, struct seq_file *m); + void (*show)(struct msm_gpu *gpu, struct msm_gpu_state *state, + struct seq_file *m); #endif struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu); void (*gpu_state_put)(struct msm_gpu_state *state);

-- 1.9.1

Jordan Crouse

6 p.m.

New subject: [PATCH 4/4] drm/msm: gpu: Capture the GPU state on a GPU hang

Capture the GPU state on a GPU hang and store it for later playback using the 'crash' node in the debugfs directory. Only one crash state is stored at a time on the assumption that the first hang is usually the most interesting. The existing crash state can be cleared by writing to the debugfs node and then a new one will be captured on the next hang.

Signed-off-by: Jordan Crouse jcrouse@codeaurora.org --- drivers/gpu/drm/msm/adreno/a3xx_gpu.c | 1 + drivers/gpu/drm/msm/adreno/a4xx_gpu.c | 1 + drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 1 + drivers/gpu/drm/msm/adreno/adreno_gpu.c | 16 +++++++-- drivers/gpu/drm/msm/adreno/adreno_gpu.h | 2 +- drivers/gpu/drm/msm/msm_debugfs.c | 57 +++++++++++++++++++++++++++++++++ drivers/gpu/drm/msm/msm_gpu.h | 44 ++++++++++++++++++++++++- 7 files changed, 117 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c index 8a7d56ec..be65b4e 100644 --- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c @@ -320,6 +320,7 @@ static void a3xx_recover(struct msm_gpu *gpu) gpu_write(gpu, REG_A3XX_RBBM_SW_RESET_CMD, 0); adreno_recover(gpu);

+ msm_gpu_crashstate_set(gpu, state); gpu->funcs->gpu_state_put(state); }

diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c index e64c7fc..943e13f 100644 --- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c @@ -319,6 +319,7 @@ static void a4xx_recover(struct msm_gpu *gpu) gpu_write(gpu, REG_A4XX_RBBM_SW_RESET_CMD, 0); adreno_recover(gpu);

+ msm_gpu_crashstate_set(gpu, state); gpu->funcs->gpu_state_put(state); }

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c index 6747b7b..1e32c2e 100644 --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c @@ -774,6 +774,7 @@ static void a5xx_recover(struct msm_gpu *gpu) gpu_write(gpu, REG_A5XX_RBBM_SW_RESET_CMD, 0); adreno_recover(gpu);

+ msm_gpu_crashstate_set(gpu, state); gpu->funcs->gpu_state_put(state); }

diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c index ba1b912..e1785c2 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c @@ -372,6 +372,8 @@ struct msm_gpu_state *adreno_gpu_state_get(struct msm_gpu *gpu) if (!state) return ERR_PTR(-ENOMEM);

+ kref_init(&state->ref); + do_gettimeofday(&state->time);

for (i = 0; i < gpu->nr_rings; i++) { @@ -407,15 +409,23 @@ struct msm_gpu_state *adreno_gpu_state_get(struct msm_gpu *gpu) return state; }

-void adreno_gpu_state_put(struct msm_gpu_state *state) +static void adreno_gpu_state_destroy(struct kref *kref) { - if (IS_ERR_OR_NULL(state)) - return; + struct msm_gpu_state *state = container_of(kref, + struct msm_gpu_state, ref);

kfree(state->registers); kfree(state); }

+int adreno_gpu_state_put(struct msm_gpu_state *state) +{ + if (IS_ERR_OR_NULL(state)) + return 1; + + return kref_put(&state->ref, adreno_gpu_state_destroy); +} + void adreno_show_info(struct msm_gpu *gpu, struct msm_gpu_state *state, struct drm_printer *p) { diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h index 4542b6b..e304a3e 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h @@ -230,7 +230,7 @@ void adreno_show_regs(struct msm_gpu *gpu, struct msm_gpu_state *state, struct drm_printer *p);

struct msm_gpu_state *adreno_gpu_state_get(struct msm_gpu *gpu); -void adreno_gpu_state_put(struct msm_gpu_state *state); +int adreno_gpu_state_put(struct msm_gpu_state *state);

/* ringbuffer helpers (the parts that are adreno specific) */

diff --git a/drivers/gpu/drm/msm/msm_debugfs.c b/drivers/gpu/drm/msm/msm_debugfs.c index 89ee74b..1bde88d 100644 --- a/drivers/gpu/drm/msm/msm_debugfs.c +++ b/drivers/gpu/drm/msm/msm_debugfs.c @@ -16,11 +16,65 @@ */

#ifdef CONFIG_DEBUG_FS + +#include <generated/utsrelease.h> +#include <linux/debugfs.h> #include "msm_drv.h" #include "msm_gpu.h" #include "msm_kms.h" #include "msm_debugfs.h"

+static int msm_gpu_crash_show(struct seq_file *m, void *data) +{ + struct msm_gpu *gpu = m->private; + struct msm_gpu_state *state; + + state = msm_gpu_crashstate_get(gpu); + if (!state) + return 0; + + seq_printf(m, "%s Crash Status:\n", gpu->name); + seq_puts(m, "Kernel: " UTS_RELEASE "\n"); + seq_printf(m, "Time: %ld s %ld us\n", + state->time.tv_sec, state->time.tv_usec); + + gpu->funcs->show(gpu, state, m); + + msm_gpu_crashstate_put(gpu); + + return 0; +} + +static ssize_t msm_gpu_crash_write(struct file *file, const char __user *buf, + size_t count, loff_t *pos) +{ + struct msm_gpu *gpu = ((struct seq_file *)file->private_data)->private; + + dev_err(gpu->dev->dev, "Releasing the GPU crash state\n"); + msm_gpu_crashstate_put(gpu); + + return count; +} + +static int msm_gpu_crash_open(struct inode *inode, struct file *file) +{ + struct msm_drm_private *priv = inode->i_private; + + if (!priv->gpu) + return -ENODEV; + + return single_open(file, msm_gpu_crash_show, priv->gpu); +} + +static const struct file_operations msm_gpu_crash_fops = { + .owner = THIS_MODULE, + .open = msm_gpu_crash_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, + .write = msm_gpu_crash_write, +}; + static int msm_gpu_show(struct drm_device *dev, struct seq_file *m) { struct msm_drm_private *priv = dev->dev_private; @@ -170,6 +224,9 @@ int msm_debugfs_init(struct drm_minor *minor) return ret; }

+ debugfs_create_file("crash", 0644, minor->debugfs_root, + priv, &msm_gpu_crash_fops); + if (priv->kms->funcs->debugfs_init) ret = priv->kms->funcs->debugfs_init(priv->kms, minor);

diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h index cff52ca..7ce2cba 100644 --- a/drivers/gpu/drm/msm/msm_gpu.h +++ b/drivers/gpu/drm/msm/msm_gpu.h @@ -69,7 +69,7 @@ struct msm_gpu_funcs { struct seq_file *m); #endif struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu); - void (*gpu_state_put)(struct msm_gpu_state *state); + int (*gpu_state_put)(struct msm_gpu_state *state); };

struct msm_gpu { @@ -129,6 +129,8 @@ struct msm_gpu { struct work_struct recover_work;

struct drm_gem_object *memptrs_bo; + + struct msm_gpu_state *crashstate; };

/* It turns out that all targets use the same ringbuffer size */ @@ -176,6 +178,7 @@ struct msm_gpu_submitqueue { };

struct msm_gpu_state { + struct kref ref; struct timeval time;

struct { @@ -270,4 +273,43 @@ static inline void msm_submitqueue_put(struct msm_gpu_submitqueue *queue) kref_put(&queue->ref, msm_submitqueue_destroy); }

+static inline void msm_gpu_crashstate_set(struct msm_gpu *gpu, + struct msm_gpu_state *state) +{ + /* FIXME: make sure the mutex is set? */ + + if (!IS_ERR_OR_NULL(state) && !gpu->crashstate) { + kref_get(&state->ref); + gpu->crashstate = state; + } +} + +static inline struct msm_gpu_state *msm_gpu_crashstate_get(struct msm_gpu *gpu) +{ + struct msm_gpu_state *state = NULL; + + mutex_lock(&gpu->dev->struct_mutex); + + if (gpu->crashstate) { + kref_get(&gpu->crashstate->ref); + state = gpu->crashstate; + } + + mutex_unlock(&gpu->dev->struct_mutex); + + return state; +} + +static inline void msm_gpu_crashstate_put(struct msm_gpu *gpu) +{ + mutex_lock(&gpu->dev->struct_mutex); + + if (gpu->crashstate) { + if (gpu->funcs->gpu_state_put(gpu->crashstate)) + gpu->crashstate = NULL; + } + + mutex_unlock(&gpu->dev->struct_mutex); +} + #endif /* __MSM_GPU_H__ */

-- 1.9.1

Chris Wilson

6:32 p.m.

Quoting Jordan Crouse (2018-01-05 18:00:17)

...

This is a request for comment on code to store and dump a GPU state a hang with inspiration from the very good i915 GPU error state and the binary GPU snapshot in the downstream kernel.

The goal is to store and provide enough information to debug software and hardware issues on the Adreno hardware in a semi human-readable format that can also be parsed by scripts.

The goal for this request for comment is to get some consensus about the format and work through some of the technical issues.

My biggest regret for i915/error is that we didn't adopt a sensible file format and organically grew it from dmesg-style logging. This is quite a hindrance when it comes to trying to improve the capture whilst maintaining compatibility with the existing tools. Switching to json/yaml at this point won't be too difficult to spot the change in format, just a large chunk of technical debt to pay off. So I would recommend you pick a an adaptable, human readable, file format for ease of tool development.

The second important feature for capturing error state is to include as much user information as possible. You want to be able to identify which library generated the hang in a post-mortem dump from a user in 6-12 months time, and just as importantly, why the library did what it did. I like the idea of userspace being able to attach buffers that are included in the error state (supplied as auxiliary information to the guilty command stream) to provide a flight-data-recorder from the user's pov. So design your interface with a view to extending to include blobs.

It would be interesting to have a common file format... While interpreting the data is going to highly specific to a gpu/driver, the data itself will be similar between drivers. If we had a common file format, we could extend something like mesa/intel/aubinator_error_decode and throw in a bunch of xml descriptors for the different gpus. Just a thought... -Chris

Jordan Crouse

10:11 p.m.

New subject: [Freedreno] [RFC 0/4] drm/msm: GPU crash state

On Fri, Jan 05, 2018 at 06:32:22PM +0000, Chris Wilson wrote:

...

Quoting Jordan Crouse (2018-01-05 18:00:17)

...
This is a request for comment on code to store and dump a GPU state a hang with inspiration from the very good i915 GPU error state and the binary GPU snapshot in the downstream kernel.

The goal is to store and provide enough information to debug software and hardware issues on the Adreno hardware in a semi human-readable format that can also be parsed by scripts.

The goal for this request for comment is to get some consensus about the format and work through some of the technical issues.

My biggest regret for i915/error is that we didn't adopt a sensible file format and organically grew it from dmesg-style logging. This is quite a hindrance when it comes to trying to improve the capture whilst maintaining compatibility with the existing tools. Switching to json/yaml at this point won't be too difficult to spot the change in format, just a large chunk of technical debt to pay off. So I would recommend you pick a an adaptable, human readable, file format for ease of tool development.

This is a really great suggestion. The downstream qcom kernel uses a strictly binary format which is also problematic for other reasons. I like the idea of having something standard and extensible while remaining human readable without tools.

...

The second important feature for capturing error state is to include as much user information as possible. You want to be able to identify which library generated the hang in a post-mortem dump from a user in 6-12 months time, and just as importantly, why the library did what it did. I like the idea of userspace being able to attach buffers that are included in the error state (supplied as auxiliary information to the guilty command stream) to provide a flight-data-recorder from the user's pov. So design your interface with a view to extending to include blobs.

I love the ascii85 and compression stuff that i915 does and that would fit in well a nice file format as well.

...

It would be interesting to have a common file format... While interpreting the data is going to highly specific to a gpu/driver, the data itself will be similar between drivers. If we had a common file format, we could extend something like mesa/intel/aubinator_error_decode and throw in a bunch of xml descriptors for the different gpus. Just a thought...

I'm definitely open to this. There is never anything wrong with improved debugging for everybody.

Thanks, Jordan

-- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

Rob Clark

10:51 p.m.

New subject: [Freedreno] [RFC 0/4] drm/msm: GPU crash state

On Fri, Jan 5, 2018 at 5:11 PM, Jordan Crouse jcrouse@codeaurora.org wrote:

...

On Fri, Jan 05, 2018 at 06:32:22PM +0000, Chris Wilson wrote:

...
Quoting Jordan Crouse (2018-01-05 18:00:17)

...
This is a request for comment on code to store and dump a GPU state a hang with inspiration from the very good i915 GPU error state and the binary GPU snapshot in the downstream kernel.

The goal is to store and provide enough information to debug software and hardware issues on the Adreno hardware in a semi human-readable format that can also be parsed by scripts.

The goal for this request for comment is to get some consensus about the format and work through some of the technical issues.

My biggest regret for i915/error is that we didn't adopt a sensible file format and organically grew it from dmesg-style logging. This is quite a hindrance when it comes to trying to improve the capture whilst maintaining compatibility with the existing tools. Switching to json/yaml at this point won't be too difficult to spot the change in format, just a large chunk of technical debt to pay off. So I would recommend you pick a an adaptable, human readable, file format for ease of tool development.

This is a really great suggestion. The downstream qcom kernel uses a strictly binary format which is also problematic for other reasons. I like the idea of having something standard and extensible while remaining human readable without tools.

...
The second important feature for capturing error state is to include as much user information as possible. You want to be able to identify which library generated the hang in a post-mortem dump from a user in 6-12 months time, and just as importantly, why the library did what it did. I like the idea of userspace being able to attach buffers that are included in the error state (supplied as auxiliary information to the guilty command stream) to provide a flight-data-recorder from the user's pov. So design your interface with a view to extending to include blobs.

I love the ascii85 and compression stuff that i915 does and that would fit in well a nice file format as well.

I guess I should dig out from under a pile of snow and unread email and read your patches.. but if you don't already, including at least the full cmdline of the process that triggered the hang would be hugely useful. This was a big improvement when I added it to hangcheck dbg msgs and hangrd (ie. otherwise 90% of piglit crashes just showed up as "shader_runner" or something equally generic)..

maybe it is possible to capture more of /proc/$pid/* ? I guess that is a good area where code could be shared across drivers (ie. generic sections for /proc/$pid/cmdline and /proc/$pid/maps and whatever else, plus driver specific sections for various register/cmdstream/debug-state dumps)

Perhaps buffer-snapshots could be shared too (at least as far as file format).. I guess usefulness to other drivers depends on how much cmdstream has pointers to other useful stuff (like more cmdstream or shaders). This is probably the sorta thing that should be tunable since sometimes capturing every bo ref'd by the submit could be a bit much.

I'd definitely be open to extending my cmdstream parsing tools to a new format.. the "rd" format was pretty organically grown (although the type/length/value binary format gave enough extensibility that I didn't need to throw it away yet).. but I could definitely see the value in something more human readable yet still parsable to feed into cmdstream/register parsing.

BR, -R

...

...
It would be interesting to have a common file format... While interpreting the data is going to highly specific to a gpu/driver, the data itself will be similar between drivers. If we had a common file format, we could extend something like mesa/intel/aubinator_error_decode and throw in a bunch of xml descriptors for the different gpus. Just a thought...

I'm definitely open to this. There is never anything wrong with improved debugging for everybody.

Thanks, Jordan

-- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

2664

Age (days ago)

2664

Last active (days ago)

dri-devel@lists.freedesktop.org

7 comments

3 participants

tags (0)

participants (3)

Chris Wilson
Jordan Crouse
Rob Clark