Hello,
Sorry for the noise, I forgot to Cc dri-devel on my v2 :-/.
This is a complete rework of the perf counter stuff I submitted a few weeks back. This version is pretty trivial compared to the first implementation and it's not meant to be the final solution, but rather something that allow mesa developers to debug perf-related issues while we settle on something more generic to allow GPU drivers to expose their global perf counters.
I'd like to make it clear that this debugfs interface is unstable and should be deprecated as soon as we have the generic solution in place, so please don't consider it as part of the stable ABI.
Regards,
Boris
Boris Brezillon (2): drm/panfrost: Move gpu_{write,read}() macros to panfrost_regs.h drm/panfrost: Expose perf counters through debugfs
drivers/gpu/drm/panfrost/Makefile | 3 +- drivers/gpu/drm/panfrost/panfrost_device.c | 9 + drivers/gpu/drm/panfrost/panfrost_device.h | 3 + drivers/gpu/drm/panfrost/panfrost_drv.c | 7 + drivers/gpu/drm/panfrost/panfrost_gpu.c | 10 +- drivers/gpu/drm/panfrost/panfrost_perfcnt.c | 339 ++++++++++++++++++++ drivers/gpu/drm/panfrost/panfrost_perfcnt.h | 15 + drivers/gpu/drm/panfrost/panfrost_regs.h | 22 ++ 8 files changed, 404 insertions(+), 4 deletions(-) create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.c create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.h
So they can be used from other files.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com --- Changes in v2: - None --- drivers/gpu/drm/panfrost/panfrost_gpu.c | 3 --- drivers/gpu/drm/panfrost/panfrost_regs.h | 3 +++ 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c index 58ef25573cda..6e68a100291c 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -17,9 +17,6 @@ #include "panfrost_gpu.h" #include "panfrost_regs.h"
-#define gpu_write(dev, reg, data) writel(data, dev->iomem + reg) -#define gpu_read(dev, reg) readl(dev->iomem + reg) - static irqreturn_t panfrost_gpu_irq_handler(int irq, void *data) { struct panfrost_device *pfdev = data; diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h index 578c5fc2188b..42d08860fd76 100644 --- a/drivers/gpu/drm/panfrost/panfrost_regs.h +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h @@ -295,4 +295,7 @@ #define AS_FAULTSTATUS_ACCESS_TYPE_READ (0x2 << 8) #define AS_FAULTSTATUS_ACCESS_TYPE_WRITE (0x3 << 8)
+#define gpu_write(dev, reg, data) writel(data, dev->iomem + reg) +#define gpu_read(dev, reg) readl(dev->iomem + reg) + #endif
Add a way to dump perf counters through debugfs. The implementation is kept simple and has a number of limitations:
* it's not designed for multi-user usage as the counter values are reset after each dump and there's no per-user context * only accessible to root users * no counters naming/position abstraction. Things are dumped in a raw format that has to be parsed by the user who has to know where the relevant values are and what they mean
This implementation is intended to be used by mesa developers to help debug perf-related issues while we work on a more generic approach that would allow all GPU drivers to expose their counters in a consistent way. As a result, this debugfs interface is considered unstable and might be deprecated in the future.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com --- Changes in v2: * Expose counters through debugfs and keep things simple for now (we'll work on a generic solution in parallel) --- drivers/gpu/drm/panfrost/Makefile | 3 +- drivers/gpu/drm/panfrost/panfrost_device.c | 9 + drivers/gpu/drm/panfrost/panfrost_device.h | 3 + drivers/gpu/drm/panfrost/panfrost_drv.c | 7 + drivers/gpu/drm/panfrost/panfrost_gpu.c | 7 + drivers/gpu/drm/panfrost/panfrost_perfcnt.c | 339 ++++++++++++++++++++ drivers/gpu/drm/panfrost/panfrost_perfcnt.h | 15 + drivers/gpu/drm/panfrost/panfrost_regs.h | 19 ++ 8 files changed, 401 insertions(+), 1 deletion(-) create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.c create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.h
diff --git a/drivers/gpu/drm/panfrost/Makefile b/drivers/gpu/drm/panfrost/Makefile index 6de72d13c58f..ecf0864cb515 100644 --- a/drivers/gpu/drm/panfrost/Makefile +++ b/drivers/gpu/drm/panfrost/Makefile @@ -7,6 +7,7 @@ panfrost-y := \ panfrost_gem.o \ panfrost_gpu.o \ panfrost_job.o \ - panfrost_mmu.o + panfrost_mmu.o \ + panfrost_perfcnt.o
obj-$(CONFIG_DRM_PANFROST) += panfrost.o diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c index 3b2bced1b015..b2395fc68a1f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_device.c +++ b/drivers/gpu/drm/panfrost/panfrost_device.c @@ -14,6 +14,7 @@ #include "panfrost_gpu.h" #include "panfrost_job.h" #include "panfrost_mmu.h" +#include "panfrost_perfcnt.h"
static int panfrost_reset_init(struct panfrost_device *pfdev) { @@ -149,7 +150,13 @@ int panfrost_device_init(struct panfrost_device *pfdev) pm_runtime_mark_last_busy(pfdev->dev); pm_runtime_put_autosuspend(pfdev->dev);
+ err = panfrost_perfcnt_init(pfdev); + if (err) + goto err_out5; + return 0; +err_out5: + panfrost_job_fini(pfdev); err_out4: panfrost_mmu_fini(pfdev); err_out3: @@ -165,6 +172,7 @@ int panfrost_device_init(struct panfrost_device *pfdev)
void panfrost_device_fini(struct panfrost_device *pfdev) { + panfrost_perfcnt_fini(pfdev); panfrost_job_fini(pfdev); panfrost_mmu_fini(pfdev); panfrost_gpu_fini(pfdev); @@ -237,6 +245,7 @@ int panfrost_device_resume(struct device *dev) panfrost_mmu_enable(pfdev, 0); panfrost_job_enable_interrupts(pfdev); panfrost_devfreq_resume(pfdev); + panfrost_perfcnt_resume(pfdev);
return 0; } diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h index 56f452dfb490..628d704b76ba 100644 --- a/drivers/gpu/drm/panfrost/panfrost_device.h +++ b/drivers/gpu/drm/panfrost/panfrost_device.h @@ -14,6 +14,7 @@ struct panfrost_device; struct panfrost_mmu; struct panfrost_job_slot; struct panfrost_job; +struct panfrost_perfcnt;
#define NUM_JOB_SLOTS 3
@@ -77,6 +78,8 @@ struct panfrost_device { struct panfrost_job *jobs[NUM_JOB_SLOTS]; struct list_head scheduled_jobs;
+ struct panfrost_perfcnt *perfcnt; + struct mutex sched_lock; struct mutex reset_lock;
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index 94b0819ad50b..202397537eea 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -19,6 +19,7 @@ #include "panfrost_mmu.h" #include "panfrost_job.h" #include "panfrost_gpu.h" +#include "panfrost_perfcnt.h"
static int panfrost_ioctl_get_param(struct drm_device *ddev, void *data, struct drm_file *file) { @@ -340,6 +341,11 @@ static const struct drm_ioctl_desc panfrost_drm_driver_ioctls[] = {
DEFINE_DRM_GEM_SHMEM_FOPS(panfrost_drm_driver_fops);
+static int panfrost_debugfs_init(struct drm_minor *minor) +{ + return panfrost_perfcnt_debugfs_init(minor); +} + static struct drm_driver panfrost_drm_driver = { .driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_PRIME | DRIVER_SYNCOBJ, @@ -359,6 +365,7 @@ static struct drm_driver panfrost_drm_driver = { .prime_fd_to_handle = drm_gem_prime_fd_to_handle, .gem_prime_import_sg_table = panfrost_gem_prime_import_sg_table, .gem_prime_mmap = drm_gem_prime_mmap, + .debugfs_init = panfrost_debugfs_init, };
static int panfrost_probe(struct platform_device *pdev) diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c index 6e68a100291c..20ab333fc925 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -15,6 +15,7 @@ #include "panfrost_features.h" #include "panfrost_issues.h" #include "panfrost_gpu.h" +#include "panfrost_perfcnt.h" #include "panfrost_regs.h"
static irqreturn_t panfrost_gpu_irq_handler(int irq, void *data) @@ -40,6 +41,12 @@ static irqreturn_t panfrost_gpu_irq_handler(int irq, void *data) gpu_write(pfdev, GPU_INT_MASK, 0); }
+ if (state & GPU_IRQ_PERFCNT_SAMPLE_COMPLETED) + panfrost_perfcnt_sample_done(pfdev); + + if (state & GPU_IRQ_CLEAN_CACHES_COMPLETED) + panfrost_perfcnt_clean_cache_done(pfdev); + gpu_write(pfdev, GPU_INT_CLEAR, state);
return IRQ_HANDLED; diff --git a/drivers/gpu/drm/panfrost/panfrost_perfcnt.c b/drivers/gpu/drm/panfrost/panfrost_perfcnt.c new file mode 100644 index 000000000000..8093783a5a26 --- /dev/null +++ b/drivers/gpu/drm/panfrost/panfrost_perfcnt.c @@ -0,0 +1,339 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright 2019 Collabora Ltd */ + +#include <drm/drm_file.h> +#include <drm/drm_gem_shmem_helper.h> +#include <drm/panfrost_drm.h> +#include <linux/completion.h> +#include <linux/debugfs.h> +#include <linux/iopoll.h> +#include <linux/pm_runtime.h> +#include <linux/slab.h> +#include <linux/uaccess.h> + +#include "panfrost_device.h" +#include "panfrost_features.h" +#include "panfrost_gem.h" +#include "panfrost_issues.h" +#include "panfrost_job.h" +#include "panfrost_mmu.h" +#include "panfrost_regs.h" + +#define COUNTERS_PER_BLOCK 64 +#define BYTES_PER_COUNTER 4 +#define BLOCKS_PER_COREGROUP 8 +#define V4_SHADERS_PER_COREGROUP 4 + +struct panfrost_perfcnt { + struct panfrost_gem_object *bo; + size_t bosize; + void *buf; + bool enabled; + struct mutex lock; + struct completion dump_comp; +}; + +void panfrost_perfcnt_clean_cache_done(struct panfrost_device *pfdev) +{ + complete(&pfdev->perfcnt->dump_comp); +} + +void panfrost_perfcnt_sample_done(struct panfrost_device *pfdev) +{ + gpu_write(pfdev, GPU_CMD, GPU_CMD_CLEAN_CACHES); +} + +static void panfrost_perfcnt_setup(struct panfrost_device *pfdev) +{ + u32 cfg; + + /* + * Always use address space 0 for now. + * FIXME: this needs to be updated when we start using different + * address space. + */ + cfg = GPU_PERFCNT_CFG_AS(0); + if (panfrost_model_cmp(pfdev, 0x1000) >= 0) + cfg |= GPU_PERFCNT_CFG_SETSEL(1); + + gpu_write(pfdev, GPU_PERFCNT_CFG, + cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF)); + + if (!pfdev->perfcnt->enabled) + return; + + gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0xffffffff); + gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0xffffffff); + gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0xffffffff); + + /* + * Due to PRLAM-8186 we need to disable the Tiler before we enable HW + * counters. + */ + if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186)) + gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0); + else + gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff); + + gpu_write(pfdev, GPU_PERFCNT_CFG, + cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_MANUAL)); + + if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186)) + gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff); +} + +static int panfrost_perfcnt_dump(struct panfrost_device *pfdev) +{ + u64 gpuva; + int ret; + + reinit_completion(&pfdev->perfcnt->dump_comp); + gpuva = pfdev->perfcnt->bo->node.start << PAGE_SHIFT; + gpu_write(pfdev, GPU_PERFCNT_BASE_LO, gpuva); + gpu_write(pfdev, GPU_PERFCNT_BASE_HI, gpuva >> 32); + gpu_write(pfdev, GPU_CMD, GPU_CMD_PERFCNT_SAMPLE); + ret = wait_for_completion_interruptible_timeout(&pfdev->perfcnt->dump_comp, + msecs_to_jiffies(1000)); + if (!ret) + ret = -ETIMEDOUT; + else if (ret > 0) + ret = 0; + + return ret; +} + +void panfrost_perfcnt_resume(struct panfrost_device *pfdev) +{ + if (pfdev->perfcnt) + panfrost_perfcnt_setup(pfdev); +} + +static ssize_t panfrost_perfcnt_enable_read(struct file *file, + char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct panfrost_device *pfdev = file->private_data; + ssize_t ret; + + mutex_lock(&pfdev->perfcnt->lock); + ret = simple_read_from_buffer(user_buf, count, ppos, + pfdev->perfcnt->enabled ? "Y\n" : "N\n", + 2); + mutex_unlock(&pfdev->perfcnt->lock); + + return ret; +} + +static ssize_t panfrost_perfcnt_enable_write(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct panfrost_device *pfdev = file->private_data; + bool enable; + int ret; + + ret = kstrtobool_from_user(user_buf, count, &enable); + if (ret) + return ret; + + ret = pm_runtime_get_sync(pfdev->dev); + if (ret < 0) + return ret; + + mutex_lock(&pfdev->perfcnt->lock); + if (enable != pfdev->perfcnt->enabled) { + pfdev->perfcnt->enabled = enable; + panfrost_perfcnt_setup(pfdev); + } + mutex_unlock(&pfdev->perfcnt->lock); + + pm_runtime_mark_last_busy(pfdev->dev); + pm_runtime_put_autosuspend(pfdev->dev); + + return count; +} + +static const struct file_operations panfrost_perfcnt_enable_fops = { + .read = panfrost_perfcnt_enable_read, + .write = panfrost_perfcnt_enable_write, + .open = simple_open, + .llseek = default_llseek, +}; + +static ssize_t panfrost_perfcnt_dump_read(struct file *file, + char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct panfrost_device *pfdev = file->private_data; + ssize_t ret; + + ret = pm_runtime_get_sync(pfdev->dev); + if (ret < 0) + return ret; + + mutex_lock(&pfdev->perfcnt->lock); + if (!pfdev->perfcnt->enabled) { + ret = -EINVAL; + goto out; + } + + ret = panfrost_perfcnt_dump(pfdev); + if (ret) + goto out; + + ret = simple_read_from_buffer(user_buf, count, ppos, + pfdev->perfcnt->buf, + pfdev->perfcnt->bosize); + +out: + mutex_unlock(&pfdev->perfcnt->lock); + pm_runtime_mark_last_busy(pfdev->dev); + pm_runtime_put_autosuspend(pfdev->dev); + + return ret; +} + +static const struct file_operations panfrost_perfcnt_dump_fops = { + .read = panfrost_perfcnt_dump_read, + .open = simple_open, + .llseek = default_llseek, +}; + +int panfrost_perfcnt_debugfs_init(struct drm_minor *minor) +{ + struct panfrost_device *pfdev = to_panfrost_device(minor->dev); + struct dentry *file, *dir; + + dir = debugfs_create_dir("perfcnt", minor->debugfs_root); + if (IS_ERR(dir)) + return PTR_ERR(dir); + + file = debugfs_create_file("dump", 0400, dir, pfdev, + &panfrost_perfcnt_dump_fops); + if (IS_ERR(file)) + return PTR_ERR(file); + + file = debugfs_create_file("enable", 0600, dir, pfdev, + &panfrost_perfcnt_enable_fops); + if (IS_ERR(file)) + return PTR_ERR(file); + + return 0; +} + +int panfrost_perfcnt_init(struct panfrost_device *pfdev) +{ + struct panfrost_perfcnt *perfcnt; + struct drm_gem_shmem_object *bo; + size_t size; + u32 status; + int ret; + + if (panfrost_has_hw_feature(pfdev, HW_FEATURE_V4)) { + unsigned int ncoregroups; + + ncoregroups = hweight64(pfdev->features.l2_present); + size = ncoregroups * BLOCKS_PER_COREGROUP * + COUNTERS_PER_BLOCK * BYTES_PER_COUNTER; + } else { + unsigned int nl2c, ncores; + + /* + * TODO: define a macro to extract the number of l2 caches from + * mem_features. + */ + nl2c = ((pfdev->features.mem_features >> 8) & GENMASK(3, 0)) + 1; + + /* + * The ARM driver is grouping cores per core group and then + * only using the number of cores in group 0 to calculate the + * size. Not sure why this is done like that, but I guess + * shader_present will only show cores in the first group + * anyway. + */ + ncores = hweight64(pfdev->features.shader_present); + + /* + * There's always one JM and one Tiler block, hence the '+ 2' + * here. + */ + size = (nl2c + ncores + 2) * + COUNTERS_PER_BLOCK * BYTES_PER_COUNTER; + } + + perfcnt = devm_kzalloc(pfdev->dev, sizeof(*perfcnt), GFP_KERNEL); + if (!perfcnt) + return -ENOMEM; + + bo = drm_gem_shmem_create(pfdev->ddev, size); + if (IS_ERR(bo)) + return PTR_ERR(bo); + + perfcnt->bo = to_panfrost_bo(&bo->base); + perfcnt->bosize = size; + + /* + * We always use the same buffer, so let's map it once and keep it + * mapped until the driver is unloaded. This might be a problem if + * we start using different AS and the perfcnt BO is not mapped at + * the same GPU virtual address. + */ + ret = panfrost_mmu_map(perfcnt->bo); + if (ret) + goto err_put_bo; + + /* Disable everything. */ + gpu_write(pfdev, GPU_PERFCNT_CFG, + GPU_PERFCNT_CFG_AS(0) | + GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF) | + (panfrost_model_cmp(pfdev, 0x1000) >= 0 ? + GPU_PERFCNT_CFG_SETSEL(1) : 0)); + gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0); + gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0); + gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0); + gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0); + + perfcnt->buf = drm_gem_vmap(&bo->base); + if (IS_ERR(perfcnt->buf)) { + ret = PTR_ERR(perfcnt->buf); + goto err_put_bo; + } + + init_completion(&perfcnt->dump_comp); + mutex_init(&perfcnt->lock); + pfdev->perfcnt = perfcnt; + + /* + * Invalidate the cache and clear the counters to start from a fresh + * state. + */ + gpu_write(pfdev, GPU_INT_MASK, 0); + gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_CLEAN_CACHES_COMPLETED); + + gpu_write(pfdev, GPU_CMD, GPU_CMD_PERFCNT_CLEAR); + gpu_write(pfdev, GPU_CMD, GPU_CMD_CLEAN_INV_CACHES); + ret = readl_relaxed_poll_timeout(pfdev->iomem + GPU_INT_RAWSTAT, + status, + status & + GPU_IRQ_CLEAN_CACHES_COMPLETED, + 100, 10000); + if (ret) + goto err_gem_vunmap; + + gpu_write(pfdev, GPU_INT_MASK, GPU_IRQ_MASK_ALL); + + return 0; + +err_gem_vunmap: + drm_gem_vunmap(&pfdev->perfcnt->bo->base.base, pfdev->perfcnt->buf); + +err_put_bo: + drm_gem_object_put_unlocked(&bo->base); + return ret; +} + +void panfrost_perfcnt_fini(struct panfrost_device *pfdev) +{ + drm_gem_vunmap(&pfdev->perfcnt->bo->base.base, pfdev->perfcnt->buf); + drm_gem_object_put_unlocked(&pfdev->perfcnt->bo->base.base); +} diff --git a/drivers/gpu/drm/panfrost/panfrost_perfcnt.h b/drivers/gpu/drm/panfrost/panfrost_perfcnt.h new file mode 100644 index 000000000000..b21745398178 --- /dev/null +++ b/drivers/gpu/drm/panfrost/panfrost_perfcnt.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright 2019 Collabora Ltd */ +#ifndef __PANFROST_PERFCNT_H__ +#define __PANFROST_PERFCNT_H__ + +#include "panfrost_device.h" + +void panfrost_perfcnt_sample_done(struct panfrost_device *pfdev); +void panfrost_perfcnt_clean_cache_done(struct panfrost_device *pfdev); +int panfrost_perfcnt_init(struct panfrost_device *pfdev); +void panfrost_perfcnt_fini(struct panfrost_device *pfdev); +void panfrost_perfcnt_resume(struct panfrost_device *pfdev); +int panfrost_perfcnt_debugfs_init(struct drm_minor *minor); + +#endif diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h index 42d08860fd76..ea38ac60581c 100644 --- a/drivers/gpu/drm/panfrost/panfrost_regs.h +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h @@ -44,12 +44,31 @@ GPU_IRQ_MULTIPLE_FAULT) #define GPU_CMD 0x30 #define GPU_CMD_SOFT_RESET 0x01 +#define GPU_CMD_PERFCNT_CLEAR 0x03 +#define GPU_CMD_PERFCNT_SAMPLE 0x04 +#define GPU_CMD_CLEAN_CACHES 0x07 +#define GPU_CMD_CLEAN_INV_CACHES 0x08 #define GPU_STATUS 0x34 +#define GPU_STATUS_PRFCNT_ACTIVE BIT(2) #define GPU_LATEST_FLUSH_ID 0x38 #define GPU_FAULT_STATUS 0x3C #define GPU_FAULT_ADDRESS_LO 0x40 #define GPU_FAULT_ADDRESS_HI 0x44
+#define GPU_PERFCNT_BASE_LO 0x60 +#define GPU_PERFCNT_BASE_HI 0x64 +#define GPU_PERFCNT_CFG 0x68 +#define GPU_PERFCNT_CFG_MODE(x) (x) +#define GPU_PERFCNT_CFG_MODE_OFF 0 +#define GPU_PERFCNT_CFG_MODE_MANUAL 1 +#define GPU_PERFCNT_CFG_MODE_TILE 2 +#define GPU_PERFCNT_CFG_AS(x) ((x) << 4) +#define GPU_PERFCNT_CFG_SETSEL(x) ((x) << 8) +#define GPU_PRFCNT_JM_EN 0x6c +#define GPU_PRFCNT_SHADER_EN 0x70 +#define GPU_PRFCNT_TILER_EN 0x74 +#define GPU_PRFCNT_MMU_L2_EN 0x7c + #define GPU_THREAD_MAX_THREADS 0x0A0 /* (RO) Maximum number of threads per core */ #define GPU_THREAD_MAX_WORKGROUP_SIZE 0x0A4 /* (RO) Maximum workgroup size */ #define GPU_THREAD_MAX_BARRIER_SIZE 0x0A8 /* (RO) Maximum threads waiting at a barrier */
On Tue, 14 May 2019 at 11:48, Boris Brezillon boris.brezillon@collabora.com wrote:
Add a way to dump perf counters through debugfs. The implementation is kept simple and has a number of limitations:
- it's not designed for multi-user usage as the counter values are reset after each dump and there's no per-user context
- only accessible to root users
- no counters naming/position abstraction. Things are dumped in a raw format that has to be parsed by the user who has to know where the relevant values are and what they mean
This implementation is intended to be used by mesa developers to help debug perf-related issues while we work on a more generic approach that would allow all GPU drivers to expose their counters in a consistent way. As a result, this debugfs interface is considered unstable and might be deprecated in the future.
An idea:
Add module_param_unsafe() module parameter and expose the debugfs files only when set. Seemingly the i915 team have been using it to a similar extend to highlight the feature is unstable.
Note: setting the _unsafe param taints the kernel, which AFAICT is the part which makes is extra useful.
I could be wrong of course :-)
HTH -Emil
On Tue, May 14, 2019 at 5:48 AM Boris Brezillon boris.brezillon@collabora.com wrote:
Add a way to dump perf counters through debugfs. The implementation is kept simple and has a number of limitations:
- it's not designed for multi-user usage as the counter values are reset after each dump and there's no per-user context
- only accessible to root users
- no counters naming/position abstraction. Things are dumped in a raw format that has to be parsed by the user who has to know where the relevant values are and what they mean
This implementation is intended to be used by mesa developers to help debug perf-related issues while we work on a more generic approach that would allow all GPU drivers to expose their counters in a consistent way. As a result, this debugfs interface is considered unstable and might be deprecated in the future.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com
Changes in v2:
- Expose counters through debugfs and keep things simple for now (we'll work on a generic solution in parallel)
drivers/gpu/drm/panfrost/Makefile | 3 +- drivers/gpu/drm/panfrost/panfrost_device.c | 9 + drivers/gpu/drm/panfrost/panfrost_device.h | 3 + drivers/gpu/drm/panfrost/panfrost_drv.c | 7 + drivers/gpu/drm/panfrost/panfrost_gpu.c | 7 + drivers/gpu/drm/panfrost/panfrost_perfcnt.c | 339 ++++++++++++++++++++ drivers/gpu/drm/panfrost/panfrost_perfcnt.h | 15 + drivers/gpu/drm/panfrost/panfrost_regs.h | 19 ++ 8 files changed, 401 insertions(+), 1 deletion(-) create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.c create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.h
diff --git a/drivers/gpu/drm/panfrost/Makefile b/drivers/gpu/drm/panfrost/Makefile index 6de72d13c58f..ecf0864cb515 100644 --- a/drivers/gpu/drm/panfrost/Makefile +++ b/drivers/gpu/drm/panfrost/Makefile @@ -7,6 +7,7 @@ panfrost-y := \ panfrost_gem.o \ panfrost_gpu.o \ panfrost_job.o \
panfrost_mmu.o
panfrost_mmu.o \
panfrost_perfcnt.o
obj-$(CONFIG_DRM_PANFROST) += panfrost.o diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c index 3b2bced1b015..b2395fc68a1f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_device.c +++ b/drivers/gpu/drm/panfrost/panfrost_device.c @@ -14,6 +14,7 @@ #include "panfrost_gpu.h" #include "panfrost_job.h" #include "panfrost_mmu.h" +#include "panfrost_perfcnt.h"
static int panfrost_reset_init(struct panfrost_device *pfdev) { @@ -149,7 +150,13 @@ int panfrost_device_init(struct panfrost_device *pfdev) pm_runtime_mark_last_busy(pfdev->dev); pm_runtime_put_autosuspend(pfdev->dev);
err = panfrost_perfcnt_init(pfdev);
if (err)
goto err_out5;
return 0;
+err_out5:
panfrost_job_fini(pfdev);
err_out4: panfrost_mmu_fini(pfdev); err_out3: @@ -165,6 +172,7 @@ int panfrost_device_init(struct panfrost_device *pfdev)
void panfrost_device_fini(struct panfrost_device *pfdev) {
panfrost_perfcnt_fini(pfdev); panfrost_job_fini(pfdev); panfrost_mmu_fini(pfdev); panfrost_gpu_fini(pfdev);
@@ -237,6 +245,7 @@ int panfrost_device_resume(struct device *dev) panfrost_mmu_enable(pfdev, 0); panfrost_job_enable_interrupts(pfdev); panfrost_devfreq_resume(pfdev);
panfrost_perfcnt_resume(pfdev); return 0;
} diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h index 56f452dfb490..628d704b76ba 100644 --- a/drivers/gpu/drm/panfrost/panfrost_device.h +++ b/drivers/gpu/drm/panfrost/panfrost_device.h @@ -14,6 +14,7 @@ struct panfrost_device; struct panfrost_mmu; struct panfrost_job_slot; struct panfrost_job; +struct panfrost_perfcnt;
#define NUM_JOB_SLOTS 3
@@ -77,6 +78,8 @@ struct panfrost_device { struct panfrost_job *jobs[NUM_JOB_SLOTS]; struct list_head scheduled_jobs;
struct panfrost_perfcnt *perfcnt;
struct mutex sched_lock; struct mutex reset_lock;
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index 94b0819ad50b..202397537eea 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -19,6 +19,7 @@ #include "panfrost_mmu.h" #include "panfrost_job.h" #include "panfrost_gpu.h" +#include "panfrost_perfcnt.h"
static int panfrost_ioctl_get_param(struct drm_device *ddev, void *data, struct drm_file *file) { @@ -340,6 +341,11 @@ static const struct drm_ioctl_desc panfrost_drm_driver_ioctls[] = {
DEFINE_DRM_GEM_SHMEM_FOPS(panfrost_drm_driver_fops);
+static int panfrost_debugfs_init(struct drm_minor *minor) +{
return panfrost_perfcnt_debugfs_init(minor);
+}
static struct drm_driver panfrost_drm_driver = { .driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_PRIME | DRIVER_SYNCOBJ, @@ -359,6 +365,7 @@ static struct drm_driver panfrost_drm_driver = { .prime_fd_to_handle = drm_gem_prime_fd_to_handle, .gem_prime_import_sg_table = panfrost_gem_prime_import_sg_table, .gem_prime_mmap = drm_gem_prime_mmap,
.debugfs_init = panfrost_debugfs_init,
};
static int panfrost_probe(struct platform_device *pdev) diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c index 6e68a100291c..20ab333fc925 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -15,6 +15,7 @@ #include "panfrost_features.h" #include "panfrost_issues.h" #include "panfrost_gpu.h" +#include "panfrost_perfcnt.h" #include "panfrost_regs.h"
static irqreturn_t panfrost_gpu_irq_handler(int irq, void *data) @@ -40,6 +41,12 @@ static irqreturn_t panfrost_gpu_irq_handler(int irq, void *data) gpu_write(pfdev, GPU_INT_MASK, 0); }
if (state & GPU_IRQ_PERFCNT_SAMPLE_COMPLETED)
panfrost_perfcnt_sample_done(pfdev);
if (state & GPU_IRQ_CLEAN_CACHES_COMPLETED)
panfrost_perfcnt_clean_cache_done(pfdev);
gpu_write(pfdev, GPU_INT_CLEAR, state); return IRQ_HANDLED;
diff --git a/drivers/gpu/drm/panfrost/panfrost_perfcnt.c b/drivers/gpu/drm/panfrost/panfrost_perfcnt.c new file mode 100644 index 000000000000..8093783a5a26 --- /dev/null +++ b/drivers/gpu/drm/panfrost/panfrost_perfcnt.c @@ -0,0 +1,339 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright 2019 Collabora Ltd */
+#include <drm/drm_file.h> +#include <drm/drm_gem_shmem_helper.h> +#include <drm/panfrost_drm.h> +#include <linux/completion.h> +#include <linux/debugfs.h> +#include <linux/iopoll.h> +#include <linux/pm_runtime.h> +#include <linux/slab.h> +#include <linux/uaccess.h>
+#include "panfrost_device.h" +#include "panfrost_features.h" +#include "panfrost_gem.h" +#include "panfrost_issues.h" +#include "panfrost_job.h" +#include "panfrost_mmu.h" +#include "panfrost_regs.h"
+#define COUNTERS_PER_BLOCK 64 +#define BYTES_PER_COUNTER 4 +#define BLOCKS_PER_COREGROUP 8 +#define V4_SHADERS_PER_COREGROUP 4
+struct panfrost_perfcnt {
struct panfrost_gem_object *bo;
size_t bosize;
void *buf;
bool enabled;
struct mutex lock;
struct completion dump_comp;
+};
+void panfrost_perfcnt_clean_cache_done(struct panfrost_device *pfdev) +{
complete(&pfdev->perfcnt->dump_comp);
+}
+void panfrost_perfcnt_sample_done(struct panfrost_device *pfdev) +{
gpu_write(pfdev, GPU_CMD, GPU_CMD_CLEAN_CACHES);
+}
+static void panfrost_perfcnt_setup(struct panfrost_device *pfdev) +{
u32 cfg;
/*
* Always use address space 0 for now.
* FIXME: this needs to be updated when we start using different
* address space.
*/
cfg = GPU_PERFCNT_CFG_AS(0);
if (panfrost_model_cmp(pfdev, 0x1000) >= 0)
You've got a couple of these. Perhaps we should add either a model_is_bifrost() helper or an is_bifrost variable to use.
cfg |= GPU_PERFCNT_CFG_SETSEL(1);
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF));
if (!pfdev->perfcnt->enabled)
return;
gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0xffffffff);
/*
* Due to PRLAM-8186 we need to disable the Tiler before we enable HW
* counters.
Seems like you could just always apply the work-around? It doesn't look like it would be much different.
*/
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0);
else
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_MANUAL));
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
+}
+static int panfrost_perfcnt_dump(struct panfrost_device *pfdev) +{
u64 gpuva;
int ret;
reinit_completion(&pfdev->perfcnt->dump_comp);
gpuva = pfdev->perfcnt->bo->node.start << PAGE_SHIFT;
gpu_write(pfdev, GPU_PERFCNT_BASE_LO, gpuva);
gpu_write(pfdev, GPU_PERFCNT_BASE_HI, gpuva >> 32);
gpu_write(pfdev, GPU_CMD, GPU_CMD_PERFCNT_SAMPLE);
ret = wait_for_completion_interruptible_timeout(&pfdev->perfcnt->dump_comp,
msecs_to_jiffies(1000));
if (!ret)
ret = -ETIMEDOUT;
else if (ret > 0)
ret = 0;
return ret;
+}
+void panfrost_perfcnt_resume(struct panfrost_device *pfdev)
No suspend handling needed?
+{
if (pfdev->perfcnt)
panfrost_perfcnt_setup(pfdev);
+}
+static ssize_t panfrost_perfcnt_enable_read(struct file *file,
char __user *user_buf,
size_t count, loff_t *ppos)
+{
struct panfrost_device *pfdev = file->private_data;
ssize_t ret;
mutex_lock(&pfdev->perfcnt->lock);
ret = simple_read_from_buffer(user_buf, count, ppos,
pfdev->perfcnt->enabled ? "Y\n" : "N\n",
2);
mutex_unlock(&pfdev->perfcnt->lock);
return ret;
+}
+static ssize_t panfrost_perfcnt_enable_write(struct file *file,
const char __user *user_buf,
size_t count, loff_t *ppos)
+{
struct panfrost_device *pfdev = file->private_data;
bool enable;
int ret;
ret = kstrtobool_from_user(user_buf, count, &enable);
if (ret)
return ret;
ret = pm_runtime_get_sync(pfdev->dev);
if (ret < 0)
return ret;
mutex_lock(&pfdev->perfcnt->lock);
if (enable != pfdev->perfcnt->enabled) {
pfdev->perfcnt->enabled = enable;
panfrost_perfcnt_setup(pfdev);
}
mutex_unlock(&pfdev->perfcnt->lock);
pm_runtime_mark_last_busy(pfdev->dev);
pm_runtime_put_autosuspend(pfdev->dev);
return count;
+}
+static const struct file_operations panfrost_perfcnt_enable_fops = {
.read = panfrost_perfcnt_enable_read,
.write = panfrost_perfcnt_enable_write,
.open = simple_open,
.llseek = default_llseek,
+};
+static ssize_t panfrost_perfcnt_dump_read(struct file *file,
char __user *user_buf,
size_t count, loff_t *ppos)
+{
struct panfrost_device *pfdev = file->private_data;
ssize_t ret;
ret = pm_runtime_get_sync(pfdev->dev);
if (ret < 0)
return ret;
mutex_lock(&pfdev->perfcnt->lock);
if (!pfdev->perfcnt->enabled) {
ret = -EINVAL;
goto out;
}
ret = panfrost_perfcnt_dump(pfdev);
if (ret)
goto out;
ret = simple_read_from_buffer(user_buf, count, ppos,
pfdev->perfcnt->buf,
pfdev->perfcnt->bosize);
+out:
mutex_unlock(&pfdev->perfcnt->lock);
pm_runtime_mark_last_busy(pfdev->dev);
pm_runtime_put_autosuspend(pfdev->dev);
return ret;
+}
+static const struct file_operations panfrost_perfcnt_dump_fops = {
.read = panfrost_perfcnt_dump_read,
.open = simple_open,
.llseek = default_llseek,
+};
+int panfrost_perfcnt_debugfs_init(struct drm_minor *minor) +{
struct panfrost_device *pfdev = to_panfrost_device(minor->dev);
struct dentry *file, *dir;
dir = debugfs_create_dir("perfcnt", minor->debugfs_root);
if (IS_ERR(dir))
return PTR_ERR(dir);
file = debugfs_create_file("dump", 0400, dir, pfdev,
&panfrost_perfcnt_dump_fops);
if (IS_ERR(file))
return PTR_ERR(file);
file = debugfs_create_file("enable", 0600, dir, pfdev,
&panfrost_perfcnt_enable_fops);
if (IS_ERR(file))
return PTR_ERR(file);
return 0;
+}
+int panfrost_perfcnt_init(struct panfrost_device *pfdev) +{
struct panfrost_perfcnt *perfcnt;
struct drm_gem_shmem_object *bo;
size_t size;
u32 status;
int ret;
if (panfrost_has_hw_feature(pfdev, HW_FEATURE_V4)) {
unsigned int ncoregroups;
ncoregroups = hweight64(pfdev->features.l2_present);
size = ncoregroups * BLOCKS_PER_COREGROUP *
COUNTERS_PER_BLOCK * BYTES_PER_COUNTER;
} else {
unsigned int nl2c, ncores;
/*
* TODO: define a macro to extract the number of l2 caches from
* mem_features.
*/
nl2c = ((pfdev->features.mem_features >> 8) & GENMASK(3, 0)) + 1;
/*
* The ARM driver is grouping cores per core group and then
* only using the number of cores in group 0 to calculate the
* size. Not sure why this is done like that, but I guess
* shader_present will only show cores in the first group
* anyway.
*/
ncores = hweight64(pfdev->features.shader_present);
/*
* There's always one JM and one Tiler block, hence the '+ 2'
* here.
*/
size = (nl2c + ncores + 2) *
COUNTERS_PER_BLOCK * BYTES_PER_COUNTER;
}
perfcnt = devm_kzalloc(pfdev->dev, sizeof(*perfcnt), GFP_KERNEL);
if (!perfcnt)
return -ENOMEM;
bo = drm_gem_shmem_create(pfdev->ddev, size);
if (IS_ERR(bo))
return PTR_ERR(bo);
perfcnt->bo = to_panfrost_bo(&bo->base);
perfcnt->bosize = size;
/*
* We always use the same buffer, so let's map it once and keep it
* mapped until the driver is unloaded. This might be a problem if
* we start using different AS and the perfcnt BO is not mapped at
* the same GPU virtual address.
Per FD address space is next up on my list, so we'll need to sort this out soon. As this interface is not tied to any FD, that would mean it will need its own address space (I assume it doesn't need to be shared?) and would reduce the number available to other clients (though presumably they will use some sort of LRU swapping).
Maybe Emil's suggestion on a module param would be sufficient to go ahead and make this an ioctl interface instead.
*/
ret = panfrost_mmu_map(perfcnt->bo);
if (ret)
goto err_put_bo;
/* Disable everything. */
gpu_write(pfdev, GPU_PERFCNT_CFG,
GPU_PERFCNT_CFG_AS(0) |
GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF) |
(panfrost_model_cmp(pfdev, 0x1000) >= 0 ?
GPU_PERFCNT_CFG_SETSEL(1) : 0));
gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0);
gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0);
gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0);
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0);
perfcnt->buf = drm_gem_vmap(&bo->base);
if (IS_ERR(perfcnt->buf)) {
ret = PTR_ERR(perfcnt->buf);
goto err_put_bo;
}
init_completion(&perfcnt->dump_comp);
mutex_init(&perfcnt->lock);
pfdev->perfcnt = perfcnt;
/*
* Invalidate the cache and clear the counters to start from a fresh
* state.
*/
gpu_write(pfdev, GPU_INT_MASK, 0);
gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_CLEAN_CACHES_COMPLETED);
gpu_write(pfdev, GPU_CMD, GPU_CMD_PERFCNT_CLEAR);
gpu_write(pfdev, GPU_CMD, GPU_CMD_CLEAN_INV_CACHES);
ret = readl_relaxed_poll_timeout(pfdev->iomem + GPU_INT_RAWSTAT,
status,
status &
GPU_IRQ_CLEAN_CACHES_COMPLETED,
100, 10000);
if (ret)
goto err_gem_vunmap;
gpu_write(pfdev, GPU_INT_MASK, GPU_IRQ_MASK_ALL);
We already setup GPU_INT_* registers elsewhere. We shouldn't have it in 2 places. This was why the register definitions were local to the .c files...
return 0;
+err_gem_vunmap:
drm_gem_vunmap(&pfdev->perfcnt->bo->base.base, pfdev->perfcnt->buf);
+err_put_bo:
drm_gem_object_put_unlocked(&bo->base);
return ret;
+}
+void panfrost_perfcnt_fini(struct panfrost_device *pfdev) +{
Need to do some h/w clean-up?
drm_gem_vunmap(&pfdev->perfcnt->bo->base.base, pfdev->perfcnt->buf);
drm_gem_object_put_unlocked(&pfdev->perfcnt->bo->base.base);
+} diff --git a/drivers/gpu/drm/panfrost/panfrost_perfcnt.h b/drivers/gpu/drm/panfrost/panfrost_perfcnt.h new file mode 100644 index 000000000000..b21745398178 --- /dev/null +++ b/drivers/gpu/drm/panfrost/panfrost_perfcnt.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright 2019 Collabora Ltd */ +#ifndef __PANFROST_PERFCNT_H__ +#define __PANFROST_PERFCNT_H__
+#include "panfrost_device.h"
+void panfrost_perfcnt_sample_done(struct panfrost_device *pfdev); +void panfrost_perfcnt_clean_cache_done(struct panfrost_device *pfdev); +int panfrost_perfcnt_init(struct panfrost_device *pfdev); +void panfrost_perfcnt_fini(struct panfrost_device *pfdev); +void panfrost_perfcnt_resume(struct panfrost_device *pfdev); +int panfrost_perfcnt_debugfs_init(struct drm_minor *minor);
+#endif diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h index 42d08860fd76..ea38ac60581c 100644 --- a/drivers/gpu/drm/panfrost/panfrost_regs.h +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h @@ -44,12 +44,31 @@ GPU_IRQ_MULTIPLE_FAULT) #define GPU_CMD 0x30 #define GPU_CMD_SOFT_RESET 0x01 +#define GPU_CMD_PERFCNT_CLEAR 0x03 +#define GPU_CMD_PERFCNT_SAMPLE 0x04 +#define GPU_CMD_CLEAN_CACHES 0x07 +#define GPU_CMD_CLEAN_INV_CACHES 0x08 #define GPU_STATUS 0x34 +#define GPU_STATUS_PRFCNT_ACTIVE BIT(2) #define GPU_LATEST_FLUSH_ID 0x38 #define GPU_FAULT_STATUS 0x3C #define GPU_FAULT_ADDRESS_LO 0x40 #define GPU_FAULT_ADDRESS_HI 0x44
+#define GPU_PERFCNT_BASE_LO 0x60 +#define GPU_PERFCNT_BASE_HI 0x64 +#define GPU_PERFCNT_CFG 0x68 +#define GPU_PERFCNT_CFG_MODE(x) (x) +#define GPU_PERFCNT_CFG_MODE_OFF 0 +#define GPU_PERFCNT_CFG_MODE_MANUAL 1 +#define GPU_PERFCNT_CFG_MODE_TILE 2 +#define GPU_PERFCNT_CFG_AS(x) ((x) << 4) +#define GPU_PERFCNT_CFG_SETSEL(x) ((x) << 8) +#define GPU_PRFCNT_JM_EN 0x6c +#define GPU_PRFCNT_SHADER_EN 0x70 +#define GPU_PRFCNT_TILER_EN 0x74 +#define GPU_PRFCNT_MMU_L2_EN 0x7c
#define GPU_THREAD_MAX_THREADS 0x0A0 /* (RO) Maximum number of threads per core */ #define GPU_THREAD_MAX_WORKGROUP_SIZE 0x0A4 /* (RO) Maximum workgroup size */
#define GPU_THREAD_MAX_BARRIER_SIZE 0x0A8 /* (RO) Maximum threads waiting at a barrier */
2.20.1
Hi Rob,
On Tue, 14 May 2019 08:31:29 -0500 Rob Herring robh+dt@kernel.org wrote:
+static void panfrost_perfcnt_setup(struct panfrost_device *pfdev) +{
u32 cfg;
/*
* Always use address space 0 for now.
* FIXME: this needs to be updated when we start using different
* address space.
*/
cfg = GPU_PERFCNT_CFG_AS(0);
if (panfrost_model_cmp(pfdev, 0x1000) >= 0)
You've got a couple of these. Perhaps we should add either a model_is_bifrost() helper or an is_bifrost variable to use.
Oops, Alyssa already mentioned that in her review and I forgot to address it in my v2. I'll add such an helper in v3.
cfg |= GPU_PERFCNT_CFG_SETSEL(1);
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF));
if (!pfdev->perfcnt->enabled)
return;
gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0xffffffff);
/*
* Due to PRLAM-8186 we need to disable the Tiler before we enable HW
* counters.
Seems like you could just always apply the work-around? It doesn't look like it would be much different.
Yes, thought about doing that too, but I decided to keep it as done in mali_kbase just to be safe.
*/
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0);
else
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_MANUAL));
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
+}
+static int panfrost_perfcnt_dump(struct panfrost_device *pfdev) +{
u64 gpuva;
int ret;
reinit_completion(&pfdev->perfcnt->dump_comp);
gpuva = pfdev->perfcnt->bo->node.start << PAGE_SHIFT;
gpu_write(pfdev, GPU_PERFCNT_BASE_LO, gpuva);
gpu_write(pfdev, GPU_PERFCNT_BASE_HI, gpuva >> 32);
gpu_write(pfdev, GPU_CMD, GPU_CMD_PERFCNT_SAMPLE);
ret = wait_for_completion_interruptible_timeout(&pfdev->perfcnt->dump_comp,
msecs_to_jiffies(1000));
if (!ret)
ret = -ETIMEDOUT;
else if (ret > 0)
ret = 0;
return ret;
+}
+void panfrost_perfcnt_resume(struct panfrost_device *pfdev)
No suspend handling needed?
Nope, unless we care about dumping the counters before suspending the GPU, but given that this dump will be overridden by the next one, I don't see the point. Would make sense to do it if the "trigger dump" and "read dumped values" were 2 separate operations (behavior I can easily implement BTW).
+{
if (pfdev->perfcnt)
panfrost_perfcnt_setup(pfdev);
+}
+int panfrost_perfcnt_init(struct panfrost_device *pfdev) +{
struct panfrost_perfcnt *perfcnt;
struct drm_gem_shmem_object *bo;
size_t size;
u32 status;
int ret;
if (panfrost_has_hw_feature(pfdev, HW_FEATURE_V4)) {
unsigned int ncoregroups;
ncoregroups = hweight64(pfdev->features.l2_present);
size = ncoregroups * BLOCKS_PER_COREGROUP *
COUNTERS_PER_BLOCK * BYTES_PER_COUNTER;
} else {
unsigned int nl2c, ncores;
/*
* TODO: define a macro to extract the number of l2 caches from
* mem_features.
*/
nl2c = ((pfdev->features.mem_features >> 8) & GENMASK(3, 0)) + 1;
/*
* The ARM driver is grouping cores per core group and then
* only using the number of cores in group 0 to calculate the
* size. Not sure why this is done like that, but I guess
* shader_present will only show cores in the first group
* anyway.
*/
ncores = hweight64(pfdev->features.shader_present);
/*
* There's always one JM and one Tiler block, hence the '+ 2'
* here.
*/
size = (nl2c + ncores + 2) *
COUNTERS_PER_BLOCK * BYTES_PER_COUNTER;
}
perfcnt = devm_kzalloc(pfdev->dev, sizeof(*perfcnt), GFP_KERNEL);
if (!perfcnt)
return -ENOMEM;
bo = drm_gem_shmem_create(pfdev->ddev, size);
if (IS_ERR(bo))
return PTR_ERR(bo);
perfcnt->bo = to_panfrost_bo(&bo->base);
perfcnt->bosize = size;
/*
* We always use the same buffer, so let's map it once and keep it
* mapped until the driver is unloaded. This might be a problem if
* we start using different AS and the perfcnt BO is not mapped at
* the same GPU virtual address.
Per FD address space is next up on my list, so we'll need to sort this out soon. As this interface is not tied to any FD, that would mean it will need its own address space (I assume it doesn't need to be shared?) and would reduce the number available to other clients (though presumably they will use some sort of LRU swapping).
Maybe Emil's suggestion on a module param would be sufficient to go ahead and make this an ioctl interface instead.
Okay. I wanted to stay away from ioctl()s as this feature is supposed to be replaced by something else, but if you think that's okay to deprecate the ioctl()s I'm fine with this option.
*/
ret = panfrost_mmu_map(perfcnt->bo);
if (ret)
goto err_put_bo;
/* Disable everything. */
gpu_write(pfdev, GPU_PERFCNT_CFG,
GPU_PERFCNT_CFG_AS(0) |
GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF) |
(panfrost_model_cmp(pfdev, 0x1000) >= 0 ?
GPU_PERFCNT_CFG_SETSEL(1) : 0));
gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0);
gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0);
gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0);
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0);
perfcnt->buf = drm_gem_vmap(&bo->base);
if (IS_ERR(perfcnt->buf)) {
ret = PTR_ERR(perfcnt->buf);
goto err_put_bo;
}
init_completion(&perfcnt->dump_comp);
mutex_init(&perfcnt->lock);
pfdev->perfcnt = perfcnt;
/*
* Invalidate the cache and clear the counters to start from a fresh
* state.
*/
gpu_write(pfdev, GPU_INT_MASK, 0);
gpu_write(pfdev, GPU_INT_CLEAR, GPU_IRQ_CLEAN_CACHES_COMPLETED);
gpu_write(pfdev, GPU_CMD, GPU_CMD_PERFCNT_CLEAR);
gpu_write(pfdev, GPU_CMD, GPU_CMD_CLEAN_INV_CACHES);
ret = readl_relaxed_poll_timeout(pfdev->iomem + GPU_INT_RAWSTAT,
status,
status &
GPU_IRQ_CLEAN_CACHES_COMPLETED,
100, 10000);
if (ret)
goto err_gem_vunmap;
gpu_write(pfdev, GPU_INT_MASK, GPU_IRQ_MASK_ALL);
We already setup GPU_INT_* registers elsewhere. We shouldn't have it in 2 places. This was why the register definitions were local to the .c files...
Was needed to prevent the irq handler from calling the sample/clean_cache_done() functions. Now that those functions just call complete() on the ->dump_comp field it should be safe to get rid of that trick and use wait_for_completion_timeout() instead of readl_poll().
return 0;
+err_gem_vunmap:
drm_gem_vunmap(&pfdev->perfcnt->bo->base.base, pfdev->perfcnt->buf);
+err_put_bo:
drm_gem_object_put_unlocked(&bo->base);
return ret;
+}
+void panfrost_perfcnt_fini(struct panfrost_device *pfdev) +{
Need to do some h/w clean-up?
I'll add code to disable the perf monitor.
drm_gem_vunmap(&pfdev->perfcnt->bo->base.base, pfdev->perfcnt->buf);
drm_gem_object_put_unlocked(&pfdev->perfcnt->bo->base.base);
+}
Thanks for the review.
Regards,
Boris
On 14/05/2019 14:31, Rob Herring wrote:
On Tue, May 14, 2019 at 5:48 AM Boris Brezillon boris.brezillon@collabora.com wrote:
Add a way to dump perf counters through debugfs. The implementation is kept simple and has a number of limitations:
- it's not designed for multi-user usage as the counter values are reset after each dump and there's no per-user context
- only accessible to root users
- no counters naming/position abstraction. Things are dumped in a raw format that has to be parsed by the user who has to know where the relevant values are and what they mean
This implementation is intended to be used by mesa developers to help debug perf-related issues while we work on a more generic approach that would allow all GPU drivers to expose their counters in a consistent way. As a result, this debugfs interface is considered unstable and might be deprecated in the future.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com
Changes in v2:
- Expose counters through debugfs and keep things simple for now (we'll work on a generic solution in parallel)
drivers/gpu/drm/panfrost/Makefile | 3 +- drivers/gpu/drm/panfrost/panfrost_device.c | 9 + drivers/gpu/drm/panfrost/panfrost_device.h | 3 + drivers/gpu/drm/panfrost/panfrost_drv.c | 7 + drivers/gpu/drm/panfrost/panfrost_gpu.c | 7 + drivers/gpu/drm/panfrost/panfrost_perfcnt.c | 339 ++++++++++++++++++++ drivers/gpu/drm/panfrost/panfrost_perfcnt.h | 15 + drivers/gpu/drm/panfrost/panfrost_regs.h | 19 ++ 8 files changed, 401 insertions(+), 1 deletion(-) create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.c create mode 100644 drivers/gpu/drm/panfrost/panfrost_perfcnt.h
[...]
new file mode 100644 index 000000000000..8093783a5a26 --- /dev/null +++ b/drivers/gpu/drm/panfrost/panfrost_perfcnt.c @@ -0,0 +1,339 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright 2019 Collabora Ltd */
+#include <drm/drm_file.h> +#include <drm/drm_gem_shmem_helper.h> +#include <drm/panfrost_drm.h> +#include <linux/completion.h> +#include <linux/debugfs.h> +#include <linux/iopoll.h> +#include <linux/pm_runtime.h> +#include <linux/slab.h> +#include <linux/uaccess.h>
+#include "panfrost_device.h" +#include "panfrost_features.h" +#include "panfrost_gem.h" +#include "panfrost_issues.h" +#include "panfrost_job.h" +#include "panfrost_mmu.h" +#include "panfrost_regs.h"
+#define COUNTERS_PER_BLOCK 64 +#define BYTES_PER_COUNTER 4 +#define BLOCKS_PER_COREGROUP 8 +#define V4_SHADERS_PER_COREGROUP 4
+struct panfrost_perfcnt {
struct panfrost_gem_object *bo;
size_t bosize;
void *buf;
bool enabled;
struct mutex lock;
struct completion dump_comp;
+};
+void panfrost_perfcnt_clean_cache_done(struct panfrost_device *pfdev) +{
complete(&pfdev->perfcnt->dump_comp);
+}
+void panfrost_perfcnt_sample_done(struct panfrost_device *pfdev) +{
gpu_write(pfdev, GPU_CMD, GPU_CMD_CLEAN_CACHES);
+}
+static void panfrost_perfcnt_setup(struct panfrost_device *pfdev) +{
u32 cfg;
/*
* Always use address space 0 for now.
* FIXME: this needs to be updated when we start using different
* address space.
*/
cfg = GPU_PERFCNT_CFG_AS(0);
if (panfrost_model_cmp(pfdev, 0x1000) >= 0)
You've got a couple of these. Perhaps we should add either a model_is_bifrost() helper or an is_bifrost variable to use.
cfg |= GPU_PERFCNT_CFG_SETSEL(1);
mali_kbase only sets this if CONFIG_MALI_PRFCNT_SET_SECONDARY is defined - i.e. this selects a different selection of counters. At at the very least this should be an optional flag that can be set, but I suspect the primary set are the ones you are interested in.
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF));
if (!pfdev->perfcnt->enabled)
return;
gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0xffffffff);
/*
* Due to PRLAM-8186 we need to disable the Tiler before we enable HW
* counters.
Seems like you could just always apply the work-around? It doesn't look like it would be much different.
Technically the workaround causes the driver to perform the operation in the wrong order below (write TILER_EN when the mode is MANUAL) - this is a workaround for the hardware issue (and therefore works on that hardware). But I wouldn't like to say it works on other hardware which doesn't have the issue.
*/
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0);
else
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_MANUAL));
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
+}
+static int panfrost_perfcnt_dump(struct panfrost_device *pfdev) +{
u64 gpuva;
int ret;
reinit_completion(&pfdev->perfcnt->dump_comp);
gpuva = pfdev->perfcnt->bo->node.start << PAGE_SHIFT;
gpu_write(pfdev, GPU_PERFCNT_BASE_LO, gpuva);
gpu_write(pfdev, GPU_PERFCNT_BASE_HI, gpuva >> 32);
gpu_write(pfdev, GPU_CMD, GPU_CMD_PERFCNT_SAMPLE);
ret = wait_for_completion_interruptible_timeout(&pfdev->perfcnt->dump_comp,
msecs_to_jiffies(1000));
if (!ret)
ret = -ETIMEDOUT;
else if (ret > 0)
ret = 0;
return ret;
+}
+void panfrost_perfcnt_resume(struct panfrost_device *pfdev)
No suspend handling needed?
+{
if (pfdev->perfcnt)
panfrost_perfcnt_setup(pfdev);
+}
+static ssize_t panfrost_perfcnt_enable_read(struct file *file,
char __user *user_buf,
size_t count, loff_t *ppos)
+{
struct panfrost_device *pfdev = file->private_data;
ssize_t ret;
mutex_lock(&pfdev->perfcnt->lock);
ret = simple_read_from_buffer(user_buf, count, ppos,
pfdev->perfcnt->enabled ? "Y\n" : "N\n",
2);
mutex_unlock(&pfdev->perfcnt->lock);
return ret;
+}
+static ssize_t panfrost_perfcnt_enable_write(struct file *file,
const char __user *user_buf,
size_t count, loff_t *ppos)
+{
struct panfrost_device *pfdev = file->private_data;
bool enable;
int ret;
ret = kstrtobool_from_user(user_buf, count, &enable);
if (ret)
return ret;
ret = pm_runtime_get_sync(pfdev->dev);
if (ret < 0)
return ret;
mutex_lock(&pfdev->perfcnt->lock);
if (enable != pfdev->perfcnt->enabled) {
pfdev->perfcnt->enabled = enable;
panfrost_perfcnt_setup(pfdev);
}
mutex_unlock(&pfdev->perfcnt->lock);
pm_runtime_mark_last_busy(pfdev->dev);
pm_runtime_put_autosuspend(pfdev->dev);
return count;
+}
+static const struct file_operations panfrost_perfcnt_enable_fops = {
.read = panfrost_perfcnt_enable_read,
.write = panfrost_perfcnt_enable_write,
.open = simple_open,
.llseek = default_llseek,
+};
+static ssize_t panfrost_perfcnt_dump_read(struct file *file,
char __user *user_buf,
size_t count, loff_t *ppos)
+{
struct panfrost_device *pfdev = file->private_data;
ssize_t ret;
ret = pm_runtime_get_sync(pfdev->dev);
if (ret < 0)
return ret;
mutex_lock(&pfdev->perfcnt->lock);
if (!pfdev->perfcnt->enabled) {
ret = -EINVAL;
goto out;
}
ret = panfrost_perfcnt_dump(pfdev);
if (ret)
goto out;
ret = simple_read_from_buffer(user_buf, count, ppos,
pfdev->perfcnt->buf,
pfdev->perfcnt->bosize);
+out:
mutex_unlock(&pfdev->perfcnt->lock);
pm_runtime_mark_last_busy(pfdev->dev);
pm_runtime_put_autosuspend(pfdev->dev);
return ret;
+}
+static const struct file_operations panfrost_perfcnt_dump_fops = {
.read = panfrost_perfcnt_dump_read,
.open = simple_open,
.llseek = default_llseek,
+};
+int panfrost_perfcnt_debugfs_init(struct drm_minor *minor) +{
struct panfrost_device *pfdev = to_panfrost_device(minor->dev);
struct dentry *file, *dir;
dir = debugfs_create_dir("perfcnt", minor->debugfs_root);
if (IS_ERR(dir))
return PTR_ERR(dir);
file = debugfs_create_file("dump", 0400, dir, pfdev,
&panfrost_perfcnt_dump_fops);
if (IS_ERR(file))
return PTR_ERR(file);
file = debugfs_create_file("enable", 0600, dir, pfdev,
&panfrost_perfcnt_enable_fops);
if (IS_ERR(file))
return PTR_ERR(file);
return 0;
+}
+int panfrost_perfcnt_init(struct panfrost_device *pfdev) +{
struct panfrost_perfcnt *perfcnt;
struct drm_gem_shmem_object *bo;
size_t size;
u32 status;
int ret;
if (panfrost_has_hw_feature(pfdev, HW_FEATURE_V4)) {
unsigned int ncoregroups;
ncoregroups = hweight64(pfdev->features.l2_present);
size = ncoregroups * BLOCKS_PER_COREGROUP *
COUNTERS_PER_BLOCK * BYTES_PER_COUNTER;
} else {
unsigned int nl2c, ncores;
/*
* TODO: define a macro to extract the number of l2 caches from
* mem_features.
*/
nl2c = ((pfdev->features.mem_features >> 8) & GENMASK(3, 0)) + 1;
/*
* The ARM driver is grouping cores per core group and then
* only using the number of cores in group 0 to calculate the
* size. Not sure why this is done like that, but I guess
* shader_present will only show cores in the first group
* anyway.
*/
I'm not sure I understand this comment. I'm not sure which version of the driver you are looking at, but shader_present should contain all the cores (in every core group).
The kbase driver (in panfrost's git tree[1]), contains:
base_gpu_props *props = &kbdev->gpu_props.props; u32 nr_l2 = props->l2_props.num_l2_slices; u64 core_mask = props->coherency_info.group[0].core_mask;
u32 nr_blocks = fls64(core_mask);
/* JM and tiler counter blocks are always present */ dump_size = (2 + nr_l2 + nr_blocks) * NR_CNT_PER_BLOCK * NR_BYTES_PER_CNT;
[1] https://gitlab.freedesktop.org/panfrost/mali_kbase/blob/master/driver/produc...
i.e. the number of cores + number of L2 caches (plus 2 for JM/tiler) multiplied by the block size.
Also another difference with the below is that fls64 and hweight64 don't return the same numbers. If the core mask is non-contiguous this will return different values.
ncores = hweight64(pfdev->features.shader_present);
/*
* There's always one JM and one Tiler block, hence the '+ 2'
* here.
*/
size = (nl2c + ncores + 2) *
COUNTERS_PER_BLOCK * BYTES_PER_COUNTER;
}
perfcnt = devm_kzalloc(pfdev->dev, sizeof(*perfcnt), GFP_KERNEL);
if (!perfcnt)
return -ENOMEM;
[...]
Steve
Hi Steve,
On Thu, 16 May 2019 17:21:39 +0100 Steven Price steven.price@arm.com wrote:
+static void panfrost_perfcnt_setup(struct panfrost_device *pfdev) +{
u32 cfg;
/*
* Always use address space 0 for now.
* FIXME: this needs to be updated when we start using different
* address space.
*/
cfg = GPU_PERFCNT_CFG_AS(0);
if (panfrost_model_cmp(pfdev, 0x1000) >= 0)
You've got a couple of these. Perhaps we should add either a model_is_bifrost() helper or an is_bifrost variable to use.
cfg |= GPU_PERFCNT_CFG_SETSEL(1);
mali_kbase only sets this if CONFIG_MALI_PRFCNT_SET_SECONDARY is defined
- i.e. this selects a different selection of counters. At at the very
least this should be an optional flag that can be set, but I suspect the primary set are the ones you are interested in.
I'll add a param to pass the set of counters to monitor.
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF));
if (!pfdev->perfcnt->enabled)
return;
gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0xffffffff);
/*
* Due to PRLAM-8186 we need to disable the Tiler before we enable HW
* counters.
Seems like you could just always apply the work-around? It doesn't look like it would be much different.
Technically the workaround causes the driver to perform the operation in the wrong order below (write TILER_EN when the mode is MANUAL) - this is a workaround for the hardware issue (and therefore works on that hardware). But I wouldn't like to say it works on other hardware which doesn't have the issue.
Okay, I'll keep the workaround only for HW that are impacted by the bug.
*/
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0);
else
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_MANUAL));
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
+}
[...]
+int panfrost_perfcnt_init(struct panfrost_device *pfdev) +{
struct panfrost_perfcnt *perfcnt;
struct drm_gem_shmem_object *bo;
size_t size;
u32 status;
int ret;
if (panfrost_has_hw_feature(pfdev, HW_FEATURE_V4)) {
unsigned int ncoregroups;
ncoregroups = hweight64(pfdev->features.l2_present);
size = ncoregroups * BLOCKS_PER_COREGROUP *
COUNTERS_PER_BLOCK * BYTES_PER_COUNTER;
} else {
unsigned int nl2c, ncores;
/*
* TODO: define a macro to extract the number of l2 caches from
* mem_features.
*/
nl2c = ((pfdev->features.mem_features >> 8) & GENMASK(3, 0)) + 1;
/*
* The ARM driver is grouping cores per core group and then
* only using the number of cores in group 0 to calculate the
* size. Not sure why this is done like that, but I guess
* shader_present will only show cores in the first group
* anyway.
*/
I'm not sure I understand this comment. I'm not sure which version of the driver you are looking at, but shader_present should contain all the cores (in every core group).
The kbase driver (in panfrost's git tree[1]), contains:
base_gpu_props *props = &kbdev->gpu_props.props; u32 nr_l2 = props->l2_props.num_l2_slices; u64 core_mask = props->coherency_info.group[0].core_mask;
u32 nr_blocks = fls64(core_mask);
/* JM and tiler counter blocks are always present */ dump_size = (2 + nr_l2 + nr_blocks) * NR_CNT_PER_BLOCK * NR_BYTES_PER_CNT;
[1] https://gitlab.freedesktop.org/panfrost/mali_kbase/blob/master/driver/produc...
i.e. the number of cores + number of L2 caches (plus 2 for JM/tiler) multiplied by the block size.
Actually, I was referring to what's done in kbase_gpuprops_construct_coherent_groups() [2].
Also another difference with the below is that fls64 and hweight64 don't return the same numbers. If the core mask is non-contiguous this will return different values.
Right, I should use fls64() not hweight64().
Regards,
Boris
[2]https://gitlab.freedesktop.org/panfrost/mali_kbase/blob/master/driver/produc...
On Tue, 28 May 2019 15:53:48 +0200 Boris Brezillon boris.brezillon@collabora.com wrote:
Hi Steve,
On Thu, 16 May 2019 17:21:39 +0100 Steven Price steven.price@arm.com wrote:
+static void panfrost_perfcnt_setup(struct panfrost_device *pfdev) +{
u32 cfg;
/*
* Always use address space 0 for now.
* FIXME: this needs to be updated when we start using different
* address space.
*/
cfg = GPU_PERFCNT_CFG_AS(0);
if (panfrost_model_cmp(pfdev, 0x1000) >= 0)
You've got a couple of these. Perhaps we should add either a model_is_bifrost() helper or an is_bifrost variable to use.
cfg |= GPU_PERFCNT_CFG_SETSEL(1);
mali_kbase only sets this if CONFIG_MALI_PRFCNT_SET_SECONDARY is defined
- i.e. this selects a different selection of counters. At at the very
least this should be an optional flag that can be set, but I suspect the primary set are the ones you are interested in.
I'll add a param to pass the set of counters to monitor.
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF));
if (!pfdev->perfcnt->enabled)
return;
gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0xffffffff);
/*
* Due to PRLAM-8186 we need to disable the Tiler before we enable HW
* counters.
Seems like you could just always apply the work-around? It doesn't look like it would be much different.
Technically the workaround causes the driver to perform the operation in the wrong order below (write TILER_EN when the mode is MANUAL) - this is a workaround for the hardware issue (and therefore works on that hardware). But I wouldn't like to say it works on other hardware which doesn't have the issue.
Okay, I'll keep the workaround only for HW that are impacted by the bug.
*/
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0);
else
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_MANUAL));
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
+}
[...]
+int panfrost_perfcnt_init(struct panfrost_device *pfdev) +{
struct panfrost_perfcnt *perfcnt;
struct drm_gem_shmem_object *bo;
size_t size;
u32 status;
int ret;
if (panfrost_has_hw_feature(pfdev, HW_FEATURE_V4)) {
unsigned int ncoregroups;
ncoregroups = hweight64(pfdev->features.l2_present);
size = ncoregroups * BLOCKS_PER_COREGROUP *
COUNTERS_PER_BLOCK * BYTES_PER_COUNTER;
} else {
unsigned int nl2c, ncores;
/*
* TODO: define a macro to extract the number of l2 caches from
* mem_features.
*/
nl2c = ((pfdev->features.mem_features >> 8) & GENMASK(3, 0)) + 1;
/*
* The ARM driver is grouping cores per core group and then
* only using the number of cores in group 0 to calculate the
* size. Not sure why this is done like that, but I guess
* shader_present will only show cores in the first group
* anyway.
*/
I'm not sure I understand this comment. I'm not sure which version of the driver you are looking at, but shader_present should contain all the cores (in every core group).
The kbase driver (in panfrost's git tree[1]), contains:
base_gpu_props *props = &kbdev->gpu_props.props; u32 nr_l2 = props->l2_props.num_l2_slices; u64 core_mask = props->coherency_info.group[0].core_mask;
u32 nr_blocks = fls64(core_mask);
/* JM and tiler counter blocks are always present */ dump_size = (2 + nr_l2 + nr_blocks) * NR_CNT_PER_BLOCK * NR_BYTES_PER_CNT;
[1] https://gitlab.freedesktop.org/panfrost/mali_kbase/blob/master/driver/produc...
i.e. the number of cores + number of L2 caches (plus 2 for JM/tiler) multiplied by the block size.
Actually, I was referring to what's done in kbase_gpuprops_construct_coherent_groups() [2].
Also another difference with the below is that fls64 and hweight64 don't return the same numbers. If the core mask is non-contiguous this will return different values.
Right, I should use fls64() not hweight64().
Oops, forgot to ask the extra question I had. Looks like when the GPU is busy and I enable/dump/disable perfcnt too quickly, I sometime gets {JM,SHADER,MMU_L2,TILER}_COUNTER[2] = 0 while I'd expect it to be set to the {JM,SHADER,MMU_L2,TILER}_EN value (0xff or 0xffff depending on the block). I made sure I was receiving the SAMPLE_DONE+CLEAN_DONE events. Any idea what could cause this?
Note that when the GPU is idle, {JM,SHADER,MMU_L2,TILER}_COUNTER[2] are set to the value I expect (0xff or 0xffff depending on the block).
On Tue, 28 May 2019 16:01:15 +0200 Boris Brezillon boris.brezillon@collabora.com wrote:
On Tue, 28 May 2019 15:53:48 +0200 Boris Brezillon boris.brezillon@collabora.com wrote:
Hi Steve,
On Thu, 16 May 2019 17:21:39 +0100 Steven Price steven.price@arm.com wrote:
+static void panfrost_perfcnt_setup(struct panfrost_device *pfdev) +{
u32 cfg;
/*
* Always use address space 0 for now.
* FIXME: this needs to be updated when we start using different
* address space.
*/
cfg = GPU_PERFCNT_CFG_AS(0);
if (panfrost_model_cmp(pfdev, 0x1000) >= 0)
You've got a couple of these. Perhaps we should add either a model_is_bifrost() helper or an is_bifrost variable to use.
cfg |= GPU_PERFCNT_CFG_SETSEL(1);
mali_kbase only sets this if CONFIG_MALI_PRFCNT_SET_SECONDARY is defined
- i.e. this selects a different selection of counters. At at the very
least this should be an optional flag that can be set, but I suspect the primary set are the ones you are interested in.
I'll add a param to pass the set of counters to monitor.
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF));
if (!pfdev->perfcnt->enabled)
return;
gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0xffffffff);
/*
* Due to PRLAM-8186 we need to disable the Tiler before we enable HW
* counters.
Seems like you could just always apply the work-around? It doesn't look like it would be much different.
Technically the workaround causes the driver to perform the operation in the wrong order below (write TILER_EN when the mode is MANUAL) - this is a workaround for the hardware issue (and therefore works on that hardware). But I wouldn't like to say it works on other hardware which doesn't have the issue.
Okay, I'll keep the workaround only for HW that are impacted by the bug.
*/
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0);
else
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
gpu_write(pfdev, GPU_PERFCNT_CFG,
cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_MANUAL));
if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
+}
[...]
+int panfrost_perfcnt_init(struct panfrost_device *pfdev) +{
struct panfrost_perfcnt *perfcnt;
struct drm_gem_shmem_object *bo;
size_t size;
u32 status;
int ret;
if (panfrost_has_hw_feature(pfdev, HW_FEATURE_V4)) {
unsigned int ncoregroups;
ncoregroups = hweight64(pfdev->features.l2_present);
size = ncoregroups * BLOCKS_PER_COREGROUP *
COUNTERS_PER_BLOCK * BYTES_PER_COUNTER;
} else {
unsigned int nl2c, ncores;
/*
* TODO: define a macro to extract the number of l2 caches from
* mem_features.
*/
nl2c = ((pfdev->features.mem_features >> 8) & GENMASK(3, 0)) + 1;
/*
* The ARM driver is grouping cores per core group and then
* only using the number of cores in group 0 to calculate the
* size. Not sure why this is done like that, but I guess
* shader_present will only show cores in the first group
* anyway.
*/
I'm not sure I understand this comment. I'm not sure which version of the driver you are looking at, but shader_present should contain all the cores (in every core group).
The kbase driver (in panfrost's git tree[1]), contains:
base_gpu_props *props = &kbdev->gpu_props.props; u32 nr_l2 = props->l2_props.num_l2_slices; u64 core_mask = props->coherency_info.group[0].core_mask;
u32 nr_blocks = fls64(core_mask);
/* JM and tiler counter blocks are always present */ dump_size = (2 + nr_l2 + nr_blocks) * NR_CNT_PER_BLOCK * NR_BYTES_PER_CNT;
[1] https://gitlab.freedesktop.org/panfrost/mali_kbase/blob/master/driver/produc...
i.e. the number of cores + number of L2 caches (plus 2 for JM/tiler) multiplied by the block size.
Actually, I was referring to what's done in kbase_gpuprops_construct_coherent_groups() [2].
Also another difference with the below is that fls64 and hweight64 don't return the same numbers. If the core mask is non-contiguous this will return different values.
Right, I should use fls64() not hweight64().
Oops, forgot to ask the extra question I had. Looks like when the GPU is busy and I enable/dump/disable perfcnt too quickly, I sometime gets {JM,SHADER,MMU_L2,TILER}_COUNTER[2] = 0 while I'd expect it to be set to the {JM,SHADER,MMU_L2,TILER}_EN value (0xff or 0xffff depending on the block). I made sure I was receiving the SAMPLE_DONE+CLEAN_DONE events. Any idea what could cause this?
Note that when the GPU is idle, {JM,SHADER,MMU_L2,TILER}_COUNTER[2] are set to the value I expect (0xff or 0xffff depending on the block).
I think I found the issue. Looks like drm_gem_shmem_vmap() is mapping the BO in cached mode and there's no helper to do cache maintenance operation on this BO (dma_map/unmap_sg()). Mapping it writecombine seems to do the trick (at least I no longer have those inconsistencies on {JM,SHADER,MMU_L2,TILER}_COUNTER[2]).
dri-devel@lists.freedesktop.org