This is a follow up to the RFC I made previously to introduce a cgroup controller for the GPU/DRM subsystem [v1,v2,v3]. The goal is to be able to provide resource management to GPU resources using things like container.
With this RFC v4, I am hoping to have some consensus on a merge plan. I believe the GEM related resources (drm.buffer.*) introduced in previous RFC and, hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are uncontroversial and ready to move out of RFC and into a more formal review. I will continue to work on the memory backend resources (drm.memory.*).
The cover letter from v1 is copied below for reference.
[v1]: https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html [v2]: https://www.spinics.net/lists/cgroups/msg22074.html [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html
v4: Unchanged (no review needed) * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth and shrinker) Base on feedbacks on v3: * update nominclature to drmcg * embed per device drmcg properties into drm_device * split GEM buffer related commits into stats and limit * rename function name to align with convention * combined buffer accounting and check into a try_charge function * support buffer stats without limit enforcement * removed GEM buffer sharing limitation * updated documentations New features: * introducing logical GPU concept * example implementation with AMD KFD
v3: Base on feedbacks on v2: * removed .help type file from v2 * conform to cgroup convention for default and max handling * conform to cgroup convention for addressing device specific limits (with major:minor) New function: * adopted memparse for memory size related attributes * added macro to marshall drmcgrp cftype private (DRMCG_CTF_PRIV, etc.) * added ttm buffer usage stats (per cgroup, for system, tt, vram.) * added ttm buffer usage limit (per cgroup, for vram.) * added per cgroup bandwidth stats and limiting (burst and average bandwidth)
v2: * Removed the vendoring concepts * Add limit to total buffer allocation * Add limit to the maximum size of a buffer allocation
v1: cover letter
The purpose of this patch series is to start a discussion for a generic cgroup controller for the drm subsystem. The design proposed here is a very early one. We are hoping to engage the community as we develop the idea.
Backgrounds ========== Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour, such as accounting/limiting the resources which processes in a cgroup can access[1]. Weights, limits, protections, allocations are the main resource distribution models. Existing cgroup controllers includes cpu, memory, io, rdma, and more. cgroup is one of the foundational technologies that enables the popular container application deployment and management method.
Direct Rendering Manager/drm contains code intended to support the needs of complex graphics devices. Graphics drivers in the kernel may make use of DRM functions to make tasks like memory management, interrupt handling and DMA easier, and provide a uniform interface to applications. The DRM has also developed beyond traditional graphics applications to support compute/GPGPU applications.
Motivations ========= As GPU grow beyond the realm of desktop/workstation graphics into areas like data center clusters and IoT, there are increasing needs to monitor and regulate GPU as a resource like cpu, memory and io.
Matt Roper from Intel began working on similar idea in early 2018 [2] for the purpose of managing GPU priority using the cgroup hierarchy. While that particular use case may not warrant a standalone drm cgroup controller, there are other use cases where having one can be useful [3]. Monitoring GPU resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help sysadmins get a better understanding of the applications usage profile. Further usage regulations of the aforementioned resources can also help sysadmins optimize workload deployment on limited GPU resources.
With the increased importance of machine learning, data science and other cloud-based applications, GPUs are already in production use in data centers today [5,6,7]. Existing GPU resource management is very course grain, however, as sysadmins are only able to distribute workload on a per-GPU basis [8]. An alternative is to use GPU virtualization (with or without SRIOV) but it generally acts on the entire GPU instead of the specific resources in a GPU. With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU resource management (in addition to what may be available via GPU virtualization.)
In addition to production use, the DRM cgroup can also help with testing graphics application robustness by providing a mean to artificially limit DRM resources availble to the applications.
Challenges ======== While there are common infrastructure in DRM that is shared across many vendors (the scheduler [4] for example), there are also aspects of DRM that are vendor specific. To accommodate this, we borrowed the mechanism used by the cgroup to handle different kinds of cgroup controller.
Resources for DRM are also often device (GPU) specific instead of system specific and a system may contain more than one GPU. For this, we borrowed some of the ideas from RDMA cgroup controller.
Approach ======= To experiment with the idea of a DRM cgroup, we would like to start with basic accounting and statistics, then continue to iterate and add regulating mechanisms into the driver.
[1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html [3] https://www.spinics.net/lists/cgroups/msg20720.html [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-st... [7] https://github.com/RadeonOpenCompute/k8s-device-plugin [8] https://github.com/kubernetes/kubernetes/issues/52757
Kenny Ho (16): drm: Add drm_minor_for_each cgroup: Introduce cgroup for drm subsystem drm, cgroup: Initialize drmcg properties drm, cgroup: Add total GEM buffer allocation stats drm, cgroup: Add peak GEM buffer allocation stats drm, cgroup: Add GEM buffer allocation count stats drm, cgroup: Add total GEM buffer allocation limit drm, cgroup: Add peak GEM buffer allocation limit drm, cgroup: Add TTM buffer allocation stats drm, cgroup: Add TTM buffer peak usage stats drm, cgroup: Add per cgroup bw measure and control drm, cgroup: Add soft VRAM limit drm, cgroup: Allow more aggressive memory reclaim drm, cgroup: Introduce lgpu as DRM cgroup resource drm, cgroup: add update trigger after limit change drm/amdgpu: Integrate with DRM cgroup
Documentation/admin-guide/cgroup-v2.rst | 163 +- Documentation/cgroup-v1/drm.rst | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 29 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + .../amd/amdkfd/kfd_process_queue_manager.c | 140 ++ drivers/gpu/drm/drm_drv.c | 26 + drivers/gpu/drm/drm_gem.c | 16 +- drivers/gpu/drm/drm_internal.h | 4 - drivers/gpu/drm/ttm/ttm_bo.c | 93 ++ drivers/gpu/drm/ttm/ttm_bo_util.c | 4 + include/drm/drm_cgroup.h | 122 ++ include/drm/drm_device.h | 7 + include/drm/drm_drv.h | 23 + include/drm/drm_gem.h | 13 +- include/drm/ttm/ttm_bo_api.h | 2 + include/drm/ttm/ttm_bo_driver.h | 10 + include/linux/cgroup_drm.h | 151 ++ include/linux/cgroup_subsys.h | 4 + init/Kconfig | 5 + kernel/cgroup/Makefile | 1 + kernel/cgroup/drm.c | 1367 +++++++++++++++++ 25 files changed, 2193 insertions(+), 10 deletions(-) create mode 100644 Documentation/cgroup-v1/drm.rst create mode 100644 include/drm/drm_cgroup.h create mode 100644 include/linux/cgroup_drm.h create mode 100644 kernel/cgroup/drm.c
To allow other subsystems to iterate through all stored DRM minors and act upon them.
Also exposes drm_minor_acquire and drm_minor_release for other subsystem to handle drm_minor. DRM cgroup controller is the initial consumer of this new features.
Change-Id: I7c4b67ce6b31f06d1037b03435386ff5b8144ca5 Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ drivers/gpu/drm/drm_internal.h | 4 ---- include/drm/drm_drv.h | 4 ++++ 3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 862621494a93..000cddabd970 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -254,11 +254,13 @@ struct drm_minor *drm_minor_acquire(unsigned int minor_id)
return minor; } +EXPORT_SYMBOL(drm_minor_acquire);
void drm_minor_release(struct drm_minor *minor) { drm_dev_put(minor->dev); } +EXPORT_SYMBOL(drm_minor_release);
/** * DOC: driver instance overview @@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device *dev, const char *name) } EXPORT_SYMBOL(drm_dev_set_unique);
+/** + * drm_minor_for_each - Iterate through all stored DRM minors + * @fn: Function to be called for each pointer. + * @data: Data passed to callback function. + * + * The callback function will be called for each @drm_minor entry, passing + * the minor, the entry and @data. + * + * If @fn returns anything other than %0, the iteration stops and that + * value is returned from this function. + */ +int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data) +{ + return idr_for_each(&drm_minors_idr, fn, data); +} +EXPORT_SYMBOL(drm_minor_for_each); + /* * DRM Core * The DRM core module initializes all global DRM objects and makes them diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h index e19ac7ca602d..6bfad76f8e78 100644 --- a/drivers/gpu/drm/drm_internal.h +++ b/drivers/gpu/drm/drm_internal.h @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct drm_prime_file_private *prime_fpriv); void drm_prime_remove_buf_handle_locked(struct drm_prime_file_private *prime_fpriv, struct dma_buf *dma_buf);
-/* drm_drv.c */ -struct drm_minor *drm_minor_acquire(unsigned int minor_id); -void drm_minor_release(struct drm_minor *minor); - /* drm_vblank.c */ void drm_vblank_disable_and_save(struct drm_device *dev, unsigned int pipe); void drm_vblank_cleanup(struct drm_device *dev); diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 68ca736c548d..24f8d054c570 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -799,5 +799,9 @@ static inline bool drm_drv_uses_atomic_modeset(struct drm_device *dev)
int drm_dev_set_unique(struct drm_device *dev, const char *name);
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data); + +struct drm_minor *drm_minor_acquire(unsigned int minor_id); +void drm_minor_release(struct drm_minor *minor);
#endif
On Thu, Aug 29, 2019 at 02:05:18AM -0400, Kenny Ho wrote:
To allow other subsystems to iterate through all stored DRM minors and act upon them.
Also exposes drm_minor_acquire and drm_minor_release for other subsystem to handle drm_minor. DRM cgroup controller is the initial consumer of this new features.
Change-Id: I7c4b67ce6b31f06d1037b03435386ff5b8144ca5 Signed-off-by: Kenny Ho Kenny.Ho@amd.com
Iterating over minors for cgroups sounds very, very wrong. Why do we care whether a buffer was allocated through kms dumb vs render nodes?
I'd expect all the cgroup stuff to only work on drm_device, if it does care about devices.
(I didn't look through the patch series to find out where exactly you're using this, so maybe I'm off the rails here). -Daniel
drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ drivers/gpu/drm/drm_internal.h | 4 ---- include/drm/drm_drv.h | 4 ++++ 3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 862621494a93..000cddabd970 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -254,11 +254,13 @@ struct drm_minor *drm_minor_acquire(unsigned int minor_id)
return minor; } +EXPORT_SYMBOL(drm_minor_acquire);
void drm_minor_release(struct drm_minor *minor) { drm_dev_put(minor->dev); } +EXPORT_SYMBOL(drm_minor_release);
/**
- DOC: driver instance overview
@@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device *dev, const char *name) } EXPORT_SYMBOL(drm_dev_set_unique);
+/**
- drm_minor_for_each - Iterate through all stored DRM minors
- @fn: Function to be called for each pointer.
- @data: Data passed to callback function.
- The callback function will be called for each @drm_minor entry, passing
- the minor, the entry and @data.
- If @fn returns anything other than %0, the iteration stops and that
- value is returned from this function.
- */
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data) +{
- return idr_for_each(&drm_minors_idr, fn, data);
+} +EXPORT_SYMBOL(drm_minor_for_each);
/*
- DRM Core
- The DRM core module initializes all global DRM objects and makes them
diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h index e19ac7ca602d..6bfad76f8e78 100644 --- a/drivers/gpu/drm/drm_internal.h +++ b/drivers/gpu/drm/drm_internal.h @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct drm_prime_file_private *prime_fpriv); void drm_prime_remove_buf_handle_locked(struct drm_prime_file_private *prime_fpriv, struct dma_buf *dma_buf);
-/* drm_drv.c */ -struct drm_minor *drm_minor_acquire(unsigned int minor_id); -void drm_minor_release(struct drm_minor *minor);
/* drm_vblank.c */ void drm_vblank_disable_and_save(struct drm_device *dev, unsigned int pipe); void drm_vblank_cleanup(struct drm_device *dev); diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 68ca736c548d..24f8d054c570 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -799,5 +799,9 @@ static inline bool drm_drv_uses_atomic_modeset(struct drm_device *dev)
int drm_dev_set_unique(struct drm_device *dev, const char *name);
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data);
+struct drm_minor *drm_minor_acquire(unsigned int minor_id); +void drm_minor_release(struct drm_minor *minor);
#endif
2.22.0
On Tue, Sep 3, 2019 at 3:57 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Aug 29, 2019 at 02:05:18AM -0400, Kenny Ho wrote:
To allow other subsystems to iterate through all stored DRM minors and act upon them.
Also exposes drm_minor_acquire and drm_minor_release for other subsystem to handle drm_minor. DRM cgroup controller is the initial consumer of this new features.
Change-Id: I7c4b67ce6b31f06d1037b03435386ff5b8144ca5 Signed-off-by: Kenny Ho Kenny.Ho@amd.com
Iterating over minors for cgroups sounds very, very wrong. Why do we care whether a buffer was allocated through kms dumb vs render nodes?
I'd expect all the cgroup stuff to only work on drm_device, if it does care about devices.
(I didn't look through the patch series to find out where exactly you're using this, so maybe I'm off the rails here).
I am exposing this to remove the need to keep track of a separate list of available drm_device in the system (to remove the registering and unregistering of drm_device to the cgroup subsystem and just use drm_minor as the single source of truth.) I am only filtering out the render nodes minor because they point to the same drm_device and is confusing.
Perhaps I missed an obvious way to list the drm devices without iterating through the drm_minors? (I probably jumped to the minors because $major:$minor is the convention to address devices in cgroup.)
Kenny
-Daniel
drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ drivers/gpu/drm/drm_internal.h | 4 ---- include/drm/drm_drv.h | 4 ++++ 3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 862621494a93..000cddabd970 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -254,11 +254,13 @@ struct drm_minor *drm_minor_acquire(unsigned int minor_id)
return minor;
} +EXPORT_SYMBOL(drm_minor_acquire);
void drm_minor_release(struct drm_minor *minor) { drm_dev_put(minor->dev); } +EXPORT_SYMBOL(drm_minor_release);
/**
- DOC: driver instance overview
@@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device *dev, const char *name) } EXPORT_SYMBOL(drm_dev_set_unique);
+/**
- drm_minor_for_each - Iterate through all stored DRM minors
- @fn: Function to be called for each pointer.
- @data: Data passed to callback function.
- The callback function will be called for each @drm_minor entry, passing
- the minor, the entry and @data.
- If @fn returns anything other than %0, the iteration stops and that
- value is returned from this function.
- */
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data) +{
return idr_for_each(&drm_minors_idr, fn, data);
+} +EXPORT_SYMBOL(drm_minor_for_each);
/*
- DRM Core
- The DRM core module initializes all global DRM objects and makes them
diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h index e19ac7ca602d..6bfad76f8e78 100644 --- a/drivers/gpu/drm/drm_internal.h +++ b/drivers/gpu/drm/drm_internal.h @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct drm_prime_file_private *prime_fpriv); void drm_prime_remove_buf_handle_locked(struct drm_prime_file_private *prime_fpriv, struct dma_buf *dma_buf);
-/* drm_drv.c */ -struct drm_minor *drm_minor_acquire(unsigned int minor_id); -void drm_minor_release(struct drm_minor *minor);
/* drm_vblank.c */ void drm_vblank_disable_and_save(struct drm_device *dev, unsigned int pipe); void drm_vblank_cleanup(struct drm_device *dev); diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 68ca736c548d..24f8d054c570 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -799,5 +799,9 @@ static inline bool drm_drv_uses_atomic_modeset(struct drm_device *dev)
int drm_dev_set_unique(struct drm_device *dev, const char *name);
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data);
+struct drm_minor *drm_minor_acquire(unsigned int minor_id); +void drm_minor_release(struct drm_minor *minor);
#endif
2.22.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Tue, Sep 3, 2019 at 9:45 PM Kenny Ho y2kenny@gmail.com wrote:
On Tue, Sep 3, 2019 at 3:57 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Aug 29, 2019 at 02:05:18AM -0400, Kenny Ho wrote:
To allow other subsystems to iterate through all stored DRM minors and act upon them.
Also exposes drm_minor_acquire and drm_minor_release for other subsystem to handle drm_minor. DRM cgroup controller is the initial consumer of this new features.
Change-Id: I7c4b67ce6b31f06d1037b03435386ff5b8144ca5 Signed-off-by: Kenny Ho Kenny.Ho@amd.com
Iterating over minors for cgroups sounds very, very wrong. Why do we care whether a buffer was allocated through kms dumb vs render nodes?
I'd expect all the cgroup stuff to only work on drm_device, if it does care about devices.
(I didn't look through the patch series to find out where exactly you're using this, so maybe I'm off the rails here).
I am exposing this to remove the need to keep track of a separate list of available drm_device in the system (to remove the registering and unregistering of drm_device to the cgroup subsystem and just use drm_minor as the single source of truth.) I am only filtering out the render nodes minor because they point to the same drm_device and is confusing.
Perhaps I missed an obvious way to list the drm devices without iterating through the drm_minors? (I probably jumped to the minors because $major:$minor is the convention to address devices in cgroup.)
Create your own if there's nothing, because you need to anyway: - You need special locking anyway, we can't just block on the idr lock for everything. - This needs to refcount drm_device, no the minors.
Iterating over stuff still feels kinda wrong still, because normally the way we register/unregister userspace api (and cgroups isn't anything else from a drm driver pov) is by adding more calls to drm_dev_register/unregister. If you put a drm_cg_register/unregister call in there we have a clean separation, and you can track all the currently active devices however you want. Iterating over objects that can be hotunplugged any time tends to get really complicated really quickly. -Daniel
Kenny
-Daniel
drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ drivers/gpu/drm/drm_internal.h | 4 ---- include/drm/drm_drv.h | 4 ++++ 3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 862621494a93..000cddabd970 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -254,11 +254,13 @@ struct drm_minor *drm_minor_acquire(unsigned int minor_id)
return minor;
} +EXPORT_SYMBOL(drm_minor_acquire);
void drm_minor_release(struct drm_minor *minor) { drm_dev_put(minor->dev); } +EXPORT_SYMBOL(drm_minor_release);
/**
- DOC: driver instance overview
@@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device *dev, const char *name) } EXPORT_SYMBOL(drm_dev_set_unique);
+/**
- drm_minor_for_each - Iterate through all stored DRM minors
- @fn: Function to be called for each pointer.
- @data: Data passed to callback function.
- The callback function will be called for each @drm_minor entry, passing
- the minor, the entry and @data.
- If @fn returns anything other than %0, the iteration stops and that
- value is returned from this function.
- */
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data) +{
return idr_for_each(&drm_minors_idr, fn, data);
+} +EXPORT_SYMBOL(drm_minor_for_each);
/*
- DRM Core
- The DRM core module initializes all global DRM objects and makes them
diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h index e19ac7ca602d..6bfad76f8e78 100644 --- a/drivers/gpu/drm/drm_internal.h +++ b/drivers/gpu/drm/drm_internal.h @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct drm_prime_file_private *prime_fpriv); void drm_prime_remove_buf_handle_locked(struct drm_prime_file_private *prime_fpriv, struct dma_buf *dma_buf);
-/* drm_drv.c */ -struct drm_minor *drm_minor_acquire(unsigned int minor_id); -void drm_minor_release(struct drm_minor *minor);
/* drm_vblank.c */ void drm_vblank_disable_and_save(struct drm_device *dev, unsigned int pipe); void drm_vblank_cleanup(struct drm_device *dev); diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 68ca736c548d..24f8d054c570 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -799,5 +799,9 @@ static inline bool drm_drv_uses_atomic_modeset(struct drm_device *dev)
int drm_dev_set_unique(struct drm_device *dev, const char *name);
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data);
+struct drm_minor *drm_minor_acquire(unsigned int minor_id); +void drm_minor_release(struct drm_minor *minor);
#endif
2.22.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Tue, Sep 3, 2019 at 4:12 PM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 3, 2019 at 9:45 PM Kenny Ho y2kenny@gmail.com wrote:
On Tue, Sep 3, 2019 at 3:57 AM Daniel Vetter daniel@ffwll.ch wrote:
Iterating over minors for cgroups sounds very, very wrong. Why do we care whether a buffer was allocated through kms dumb vs render nodes?
I'd expect all the cgroup stuff to only work on drm_device, if it does care about devices.
(I didn't look through the patch series to find out where exactly you're using this, so maybe I'm off the rails here).
I am exposing this to remove the need to keep track of a separate list of available drm_device in the system (to remove the registering and unregistering of drm_device to the cgroup subsystem and just use drm_minor as the single source of truth.) I am only filtering out the render nodes minor because they point to the same drm_device and is confusing.
Perhaps I missed an obvious way to list the drm devices without iterating through the drm_minors? (I probably jumped to the minors because $major:$minor is the convention to address devices in cgroup.)
Create your own if there's nothing, because you need to anyway:
- You need special locking anyway, we can't just block on the idr lock
for everything.
- This needs to refcount drm_device, no the minors.
Iterating over stuff still feels kinda wrong still, because normally the way we register/unregister userspace api (and cgroups isn't anything else from a drm driver pov) is by adding more calls to drm_dev_register/unregister. If you put a drm_cg_register/unregister call in there we have a clean separation, and you can track all the currently active devices however you want. Iterating over objects that can be hotunplugged any time tends to get really complicated really quickly.
Um... I thought this is what I had previously. Did I misunderstood your feedback from v3? Doesn't drm_minor already include all these facilities so isn't creating my own kind of reinventing the wheel? (as I did previously?) drm_minor_register is called inside drm_dev_register so isn't leveraging existing drm_minor facilities much better solution?
Kenny
Kenny
-Daniel
drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ drivers/gpu/drm/drm_internal.h | 4 ---- include/drm/drm_drv.h | 4 ++++ 3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 862621494a93..000cddabd970 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -254,11 +254,13 @@ struct drm_minor *drm_minor_acquire(unsigned int minor_id)
return minor;
} +EXPORT_SYMBOL(drm_minor_acquire);
void drm_minor_release(struct drm_minor *minor) { drm_dev_put(minor->dev); } +EXPORT_SYMBOL(drm_minor_release);
/**
- DOC: driver instance overview
@@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device *dev, const char *name) } EXPORT_SYMBOL(drm_dev_set_unique);
+/**
- drm_minor_for_each - Iterate through all stored DRM minors
- @fn: Function to be called for each pointer.
- @data: Data passed to callback function.
- The callback function will be called for each @drm_minor entry, passing
- the minor, the entry and @data.
- If @fn returns anything other than %0, the iteration stops and that
- value is returned from this function.
- */
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data) +{
return idr_for_each(&drm_minors_idr, fn, data);
+} +EXPORT_SYMBOL(drm_minor_for_each);
/*
- DRM Core
- The DRM core module initializes all global DRM objects and makes them
diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h index e19ac7ca602d..6bfad76f8e78 100644 --- a/drivers/gpu/drm/drm_internal.h +++ b/drivers/gpu/drm/drm_internal.h @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct drm_prime_file_private *prime_fpriv); void drm_prime_remove_buf_handle_locked(struct drm_prime_file_private *prime_fpriv, struct dma_buf *dma_buf);
-/* drm_drv.c */ -struct drm_minor *drm_minor_acquire(unsigned int minor_id); -void drm_minor_release(struct drm_minor *minor);
/* drm_vblank.c */ void drm_vblank_disable_and_save(struct drm_device *dev, unsigned int pipe); void drm_vblank_cleanup(struct drm_device *dev); diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 68ca736c548d..24f8d054c570 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -799,5 +799,9 @@ static inline bool drm_drv_uses_atomic_modeset(struct drm_device *dev)
int drm_dev_set_unique(struct drm_device *dev, const char *name);
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data);
+struct drm_minor *drm_minor_acquire(unsigned int minor_id); +void drm_minor_release(struct drm_minor *minor);
#endif
2.22.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Tue, Sep 03, 2019 at 04:43:45PM -0400, Kenny Ho wrote:
On Tue, Sep 3, 2019 at 4:12 PM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 3, 2019 at 9:45 PM Kenny Ho y2kenny@gmail.com wrote:
On Tue, Sep 3, 2019 at 3:57 AM Daniel Vetter daniel@ffwll.ch wrote:
Iterating over minors for cgroups sounds very, very wrong. Why do we care whether a buffer was allocated through kms dumb vs render nodes?
I'd expect all the cgroup stuff to only work on drm_device, if it does care about devices.
(I didn't look through the patch series to find out where exactly you're using this, so maybe I'm off the rails here).
I am exposing this to remove the need to keep track of a separate list of available drm_device in the system (to remove the registering and unregistering of drm_device to the cgroup subsystem and just use drm_minor as the single source of truth.) I am only filtering out the render nodes minor because they point to the same drm_device and is confusing.
Perhaps I missed an obvious way to list the drm devices without iterating through the drm_minors? (I probably jumped to the minors because $major:$minor is the convention to address devices in cgroup.)
Create your own if there's nothing, because you need to anyway:
- You need special locking anyway, we can't just block on the idr lock
for everything.
- This needs to refcount drm_device, no the minors.
Iterating over stuff still feels kinda wrong still, because normally the way we register/unregister userspace api (and cgroups isn't anything else from a drm driver pov) is by adding more calls to drm_dev_register/unregister. If you put a drm_cg_register/unregister call in there we have a clean separation, and you can track all the currently active devices however you want. Iterating over objects that can be hotunplugged any time tends to get really complicated really quickly.
Um... I thought this is what I had previously. Did I misunderstood your feedback from v3? Doesn't drm_minor already include all these facilities so isn't creating my own kind of reinventing the wheel? (as I did previously?) drm_minor_register is called inside drm_dev_register so isn't leveraging existing drm_minor facilities much better solution?
Hm the previous version already dropped out of my inbox, so hard to find it again. And I couldn't find this in archieves. Do you have pointers?
I thought the previous version did cgroup init separately from drm_device setup, and I guess I suggested that it should be moved int drm_dev_register/unregister?
Anyway, I don't think reusing the drm_minor registration makes sense, since we want to be on the drm_device, not on the minor. Which is a bit awkward for cgroups, which wants to identify devices using major.minor pairs. But I guess drm is the first subsystem where 1 device can be exposed through multiple minors ...
Tejun, any suggestions on this?
Anyway, I think just leveraging existing code because it can be abused to make it fit for us doesn't make sense. E.g. for the kms side we also don't piggy-back on top of drm_minor_register (it would be technically possible), but instead we have drm_modeset_register_all(). -Daniel
Kenny
Kenny
-Daniel
drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ drivers/gpu/drm/drm_internal.h | 4 ---- include/drm/drm_drv.h | 4 ++++ 3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 862621494a93..000cddabd970 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -254,11 +254,13 @@ struct drm_minor *drm_minor_acquire(unsigned int minor_id)
return minor;
} +EXPORT_SYMBOL(drm_minor_acquire);
void drm_minor_release(struct drm_minor *minor) { drm_dev_put(minor->dev); } +EXPORT_SYMBOL(drm_minor_release);
/**
- DOC: driver instance overview
@@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device *dev, const char *name) } EXPORT_SYMBOL(drm_dev_set_unique);
+/**
- drm_minor_for_each - Iterate through all stored DRM minors
- @fn: Function to be called for each pointer.
- @data: Data passed to callback function.
- The callback function will be called for each @drm_minor entry, passing
- the minor, the entry and @data.
- If @fn returns anything other than %0, the iteration stops and that
- value is returned from this function.
- */
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data) +{
return idr_for_each(&drm_minors_idr, fn, data);
+} +EXPORT_SYMBOL(drm_minor_for_each);
/*
- DRM Core
- The DRM core module initializes all global DRM objects and makes them
diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h index e19ac7ca602d..6bfad76f8e78 100644 --- a/drivers/gpu/drm/drm_internal.h +++ b/drivers/gpu/drm/drm_internal.h @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct drm_prime_file_private *prime_fpriv); void drm_prime_remove_buf_handle_locked(struct drm_prime_file_private *prime_fpriv, struct dma_buf *dma_buf);
-/* drm_drv.c */ -struct drm_minor *drm_minor_acquire(unsigned int minor_id); -void drm_minor_release(struct drm_minor *minor);
/* drm_vblank.c */ void drm_vblank_disable_and_save(struct drm_device *dev, unsigned int pipe); void drm_vblank_cleanup(struct drm_device *dev); diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 68ca736c548d..24f8d054c570 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -799,5 +799,9 @@ static inline bool drm_drv_uses_atomic_modeset(struct drm_device *dev)
int drm_dev_set_unique(struct drm_device *dev, const char *name);
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data);
+struct drm_minor *drm_minor_acquire(unsigned int minor_id); +void drm_minor_release(struct drm_minor *minor);
#endif
2.22.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Hi Daniel,
This is the previous patch relevant to this discussion: https://patchwork.freedesktop.org/patch/314343/
So before I refactored the code to leverage drm_minor, I kept my own list of "known" drm_device inside the controller and have explicit register and unregister function to init per device cgroup defaults. For v4, I refactored the per device cgroup properties and embedded them into the drm_device and continue to only use the primary minor as a way to index the device as v3.
Regards, Kenny
On Wed, Sep 4, 2019 at 4:54 AM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 03, 2019 at 04:43:45PM -0400, Kenny Ho wrote:
On Tue, Sep 3, 2019 at 4:12 PM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 3, 2019 at 9:45 PM Kenny Ho y2kenny@gmail.com wrote:
On Tue, Sep 3, 2019 at 3:57 AM Daniel Vetter daniel@ffwll.ch
wrote:
Iterating over minors for cgroups sounds very, very wrong. Why do
we care
whether a buffer was allocated through kms dumb vs render nodes?
I'd expect all the cgroup stuff to only work on drm_device, if it
does
care about devices.
(I didn't look through the patch series to find out where exactly
you're
using this, so maybe I'm off the rails here).
I am exposing this to remove the need to keep track of a separate
list
of available drm_device in the system (to remove the registering and unregistering of drm_device to the cgroup subsystem and just use drm_minor as the single source of truth.) I am only filtering out
the
render nodes minor because they point to the same drm_device and is confusing.
Perhaps I missed an obvious way to list the drm devices without iterating through the drm_minors? (I probably jumped to the minors because $major:$minor is the convention to address devices in
cgroup.)
Create your own if there's nothing, because you need to anyway:
- You need special locking anyway, we can't just block on the idr lock
for everything.
- This needs to refcount drm_device, no the minors.
Iterating over stuff still feels kinda wrong still, because normally the way we register/unregister userspace api (and cgroups isn't anything else from a drm driver pov) is by adding more calls to drm_dev_register/unregister. If you put a drm_cg_register/unregister call in there we have a clean separation, and you can track all the currently active devices however you want. Iterating over objects that can be hotunplugged any time tends to get really complicated really quickly.
Um... I thought this is what I had previously. Did I misunderstood your feedback from v3? Doesn't drm_minor already include all these facilities so isn't creating my own kind of reinventing the wheel? (as I did previously?) drm_minor_register is called inside drm_dev_register so isn't leveraging existing drm_minor facilities much better solution?
Hm the previous version already dropped out of my inbox, so hard to find it again. And I couldn't find this in archieves. Do you have pointers?
I thought the previous version did cgroup init separately from drm_device setup, and I guess I suggested that it should be moved int drm_dev_register/unregister?
Anyway, I don't think reusing the drm_minor registration makes sense, since we want to be on the drm_device, not on the minor. Which is a bit awkward for cgroups, which wants to identify devices using major.minor pairs. But I guess drm is the first subsystem where 1 device can be exposed through multiple minors ...
Tejun, any suggestions on this?
Anyway, I think just leveraging existing code because it can be abused to make it fit for us doesn't make sense. E.g. for the kms side we also don't piggy-back on top of drm_minor_register (it would be technically possible), but instead we have drm_modeset_register_all(). -Daniel
Kenny
Kenny
-Daniel
drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ drivers/gpu/drm/drm_internal.h | 4 ---- include/drm/drm_drv.h | 4 ++++ 3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c
b/drivers/gpu/drm/drm_drv.c
index 862621494a93..000cddabd970 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -254,11 +254,13 @@ struct drm_minor
*drm_minor_acquire(unsigned int minor_id)
return minor;
} +EXPORT_SYMBOL(drm_minor_acquire);
void drm_minor_release(struct drm_minor *minor) { drm_dev_put(minor->dev); } +EXPORT_SYMBOL(drm_minor_release);
/**
- DOC: driver instance overview
@@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device
*dev, const char *name)
} EXPORT_SYMBOL(drm_dev_set_unique);
+/**
- drm_minor_for_each - Iterate through all stored DRM minors
- @fn: Function to be called for each pointer.
- @data: Data passed to callback function.
- The callback function will be called for each @drm_minor
entry, passing
- the minor, the entry and @data.
- If @fn returns anything other than %0, the iteration stops
and that
- value is returned from this function.
- */
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data),
void *data)
+{
return idr_for_each(&drm_minors_idr, fn, data);
+} +EXPORT_SYMBOL(drm_minor_for_each);
/*
- DRM Core
- The DRM core module initializes all global DRM objects and
makes them
diff --git a/drivers/gpu/drm/drm_internal.h
b/drivers/gpu/drm/drm_internal.h
index e19ac7ca602d..6bfad76f8e78 100644 --- a/drivers/gpu/drm/drm_internal.h +++ b/drivers/gpu/drm/drm_internal.h @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct
drm_prime_file_private *prime_fpriv);
void drm_prime_remove_buf_handle_locked(struct
drm_prime_file_private *prime_fpriv,
struct dma_buf *dma_buf);
-/* drm_drv.c */ -struct drm_minor *drm_minor_acquire(unsigned int minor_id); -void drm_minor_release(struct drm_minor *minor);
/* drm_vblank.c */ void drm_vblank_disable_and_save(struct drm_device *dev,
unsigned int pipe);
void drm_vblank_cleanup(struct drm_device *dev); diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 68ca736c548d..24f8d054c570 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -799,5 +799,9 @@ static inline bool
drm_drv_uses_atomic_modeset(struct drm_device *dev)
int drm_dev_set_unique(struct drm_device *dev, const char
*name);
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data),
void *data);
+struct drm_minor *drm_minor_acquire(unsigned int minor_id); +void drm_minor_release(struct drm_minor *minor);
#endif
2.22.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
(resent in plain text mode)
Hi Daniel,
This is the previous patch relevant to this discussion: https://patchwork.freedesktop.org/patch/314343/
So before I refactored the code to leverage drm_minor, I kept my own list of "known" drm_device inside the controller and have explicit register and unregister function to init per device cgroup defaults. For v4, I refactored the per device cgroup properties and embedded them into the drm_device and continue to only use the primary minor as a way to index the device as v3.
Regards, Kenny
On Wed, Sep 4, 2019 at 4:54 AM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 03, 2019 at 04:43:45PM -0400, Kenny Ho wrote:
On Tue, Sep 3, 2019 at 4:12 PM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 3, 2019 at 9:45 PM Kenny Ho y2kenny@gmail.com wrote:
On Tue, Sep 3, 2019 at 3:57 AM Daniel Vetter daniel@ffwll.ch wrote:
Iterating over minors for cgroups sounds very, very wrong. Why do we care whether a buffer was allocated through kms dumb vs render nodes?
I'd expect all the cgroup stuff to only work on drm_device, if it does care about devices.
(I didn't look through the patch series to find out where exactly you're using this, so maybe I'm off the rails here).
I am exposing this to remove the need to keep track of a separate list of available drm_device in the system (to remove the registering and unregistering of drm_device to the cgroup subsystem and just use drm_minor as the single source of truth.) I am only filtering out the render nodes minor because they point to the same drm_device and is confusing.
Perhaps I missed an obvious way to list the drm devices without iterating through the drm_minors? (I probably jumped to the minors because $major:$minor is the convention to address devices in cgroup.)
Create your own if there's nothing, because you need to anyway:
- You need special locking anyway, we can't just block on the idr lock
for everything.
- This needs to refcount drm_device, no the minors.
Iterating over stuff still feels kinda wrong still, because normally the way we register/unregister userspace api (and cgroups isn't anything else from a drm driver pov) is by adding more calls to drm_dev_register/unregister. If you put a drm_cg_register/unregister call in there we have a clean separation, and you can track all the currently active devices however you want. Iterating over objects that can be hotunplugged any time tends to get really complicated really quickly.
Um... I thought this is what I had previously. Did I misunderstood your feedback from v3? Doesn't drm_minor already include all these facilities so isn't creating my own kind of reinventing the wheel? (as I did previously?) drm_minor_register is called inside drm_dev_register so isn't leveraging existing drm_minor facilities much better solution?
Hm the previous version already dropped out of my inbox, so hard to find it again. And I couldn't find this in archieves. Do you have pointers?
I thought the previous version did cgroup init separately from drm_device setup, and I guess I suggested that it should be moved int drm_dev_register/unregister?
Anyway, I don't think reusing the drm_minor registration makes sense, since we want to be on the drm_device, not on the minor. Which is a bit awkward for cgroups, which wants to identify devices using major.minor pairs. But I guess drm is the first subsystem where 1 device can be exposed through multiple minors ...
Tejun, any suggestions on this?
Anyway, I think just leveraging existing code because it can be abused to make it fit for us doesn't make sense. E.g. for the kms side we also don't piggy-back on top of drm_minor_register (it would be technically possible), but instead we have drm_modeset_register_all(). -Daniel
Kenny
Kenny
-Daniel
drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ drivers/gpu/drm/drm_internal.h | 4 ---- include/drm/drm_drv.h | 4 ++++ 3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 862621494a93..000cddabd970 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -254,11 +254,13 @@ struct drm_minor *drm_minor_acquire(unsigned int minor_id)
return minor;
} +EXPORT_SYMBOL(drm_minor_acquire);
void drm_minor_release(struct drm_minor *minor) { drm_dev_put(minor->dev); } +EXPORT_SYMBOL(drm_minor_release);
/**
- DOC: driver instance overview
@@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device *dev, const char *name) } EXPORT_SYMBOL(drm_dev_set_unique);
+/**
- drm_minor_for_each - Iterate through all stored DRM minors
- @fn: Function to be called for each pointer.
- @data: Data passed to callback function.
- The callback function will be called for each @drm_minor entry, passing
- the minor, the entry and @data.
- If @fn returns anything other than %0, the iteration stops and that
- value is returned from this function.
- */
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data) +{
return idr_for_each(&drm_minors_idr, fn, data);
+} +EXPORT_SYMBOL(drm_minor_for_each);
/*
- DRM Core
- The DRM core module initializes all global DRM objects and makes them
diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h index e19ac7ca602d..6bfad76f8e78 100644 --- a/drivers/gpu/drm/drm_internal.h +++ b/drivers/gpu/drm/drm_internal.h @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct drm_prime_file_private *prime_fpriv); void drm_prime_remove_buf_handle_locked(struct drm_prime_file_private *prime_fpriv, struct dma_buf *dma_buf);
-/* drm_drv.c */ -struct drm_minor *drm_minor_acquire(unsigned int minor_id); -void drm_minor_release(struct drm_minor *minor);
/* drm_vblank.c */ void drm_vblank_disable_and_save(struct drm_device *dev, unsigned int pipe); void drm_vblank_cleanup(struct drm_device *dev); diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 68ca736c548d..24f8d054c570 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -799,5 +799,9 @@ static inline bool drm_drv_uses_atomic_modeset(struct drm_device *dev)
int drm_dev_set_unique(struct drm_device *dev, const char *name);
+int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data);
+struct drm_minor *drm_minor_acquire(unsigned int minor_id); +void drm_minor_release(struct drm_minor *minor);
#endif
2.22.0
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Sep 5, 2019 at 8:28 PM Kenny Ho y2kenny@gmail.com wrote:
(resent in plain text mode)
Hi Daniel,
This is the previous patch relevant to this discussion: https://patchwork.freedesktop.org/patch/314343/
Ah yes, thanks for finding that.
So before I refactored the code to leverage drm_minor, I kept my own list of "known" drm_device inside the controller and have explicit register and unregister function to init per device cgroup defaults. For v4, I refactored the per device cgroup properties and embedded them into the drm_device and continue to only use the primary minor as a way to index the device as v3.
I didn't really like the explicit registration step, at least for the basic cgroup controls (like gem buffer limits), and suggested that should happen automatically at drm_dev_register/unregister time. I also talked about picking a consistent minor (if we have to use minors, still would like Tejun to confirm what we should do here), but that was an unrelated comment. So doing auto-registration on drm_minor was one step too far.
Just doing a drm_cg_register/unregister pair that's called from drm_dev_register/unregister, and then if you want, looking up the right minor (I think always picking the render node makes sense for this, and skipping if there's no render node) would make most sense. At least for the basic cgroup controllers which are generic across drivers. -Daniel
Regards, Kenny
On Wed, Sep 4, 2019 at 4:54 AM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 03, 2019 at 04:43:45PM -0400, Kenny Ho wrote:
On Tue, Sep 3, 2019 at 4:12 PM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 3, 2019 at 9:45 PM Kenny Ho y2kenny@gmail.com wrote:
On Tue, Sep 3, 2019 at 3:57 AM Daniel Vetter daniel@ffwll.ch wrote:
Iterating over minors for cgroups sounds very, very wrong. Why do we care whether a buffer was allocated through kms dumb vs render nodes?
I'd expect all the cgroup stuff to only work on drm_device, if it does care about devices.
(I didn't look through the patch series to find out where exactly you're using this, so maybe I'm off the rails here).
I am exposing this to remove the need to keep track of a separate list of available drm_device in the system (to remove the registering and unregistering of drm_device to the cgroup subsystem and just use drm_minor as the single source of truth.) I am only filtering out the render nodes minor because they point to the same drm_device and is confusing.
Perhaps I missed an obvious way to list the drm devices without iterating through the drm_minors? (I probably jumped to the minors because $major:$minor is the convention to address devices in cgroup.)
Create your own if there's nothing, because you need to anyway:
- You need special locking anyway, we can't just block on the idr lock
for everything.
- This needs to refcount drm_device, no the minors.
Iterating over stuff still feels kinda wrong still, because normally the way we register/unregister userspace api (and cgroups isn't anything else from a drm driver pov) is by adding more calls to drm_dev_register/unregister. If you put a drm_cg_register/unregister call in there we have a clean separation, and you can track all the currently active devices however you want. Iterating over objects that can be hotunplugged any time tends to get really complicated really quickly.
Um... I thought this is what I had previously. Did I misunderstood your feedback from v3? Doesn't drm_minor already include all these facilities so isn't creating my own kind of reinventing the wheel? (as I did previously?) drm_minor_register is called inside drm_dev_register so isn't leveraging existing drm_minor facilities much better solution?
Hm the previous version already dropped out of my inbox, so hard to find it again. And I couldn't find this in archieves. Do you have pointers?
I thought the previous version did cgroup init separately from drm_device setup, and I guess I suggested that it should be moved int drm_dev_register/unregister?
Anyway, I don't think reusing the drm_minor registration makes sense, since we want to be on the drm_device, not on the minor. Which is a bit awkward for cgroups, which wants to identify devices using major.minor pairs. But I guess drm is the first subsystem where 1 device can be exposed through multiple minors ...
Tejun, any suggestions on this?
Anyway, I think just leveraging existing code because it can be abused to make it fit for us doesn't make sense. E.g. for the kms side we also don't piggy-back on top of drm_minor_register (it would be technically possible), but instead we have drm_modeset_register_all(). -Daniel
Kenny
Kenny
-Daniel
> --- > drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ > drivers/gpu/drm/drm_internal.h | 4 ---- > include/drm/drm_drv.h | 4 ++++ > 3 files changed, 23 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > index 862621494a93..000cddabd970 100644 > --- a/drivers/gpu/drm/drm_drv.c > +++ b/drivers/gpu/drm/drm_drv.c > @@ -254,11 +254,13 @@ struct drm_minor *drm_minor_acquire(unsigned int minor_id) > > return minor; > } > +EXPORT_SYMBOL(drm_minor_acquire); > > void drm_minor_release(struct drm_minor *minor) > { > drm_dev_put(minor->dev); > } > +EXPORT_SYMBOL(drm_minor_release); > > /** > * DOC: driver instance overview > @@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device *dev, const char *name) > } > EXPORT_SYMBOL(drm_dev_set_unique); > > +/** > + * drm_minor_for_each - Iterate through all stored DRM minors > + * @fn: Function to be called for each pointer. > + * @data: Data passed to callback function. > + * > + * The callback function will be called for each @drm_minor entry, passing > + * the minor, the entry and @data. > + * > + * If @fn returns anything other than %0, the iteration stops and that > + * value is returned from this function. > + */ > +int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data) > +{ > + return idr_for_each(&drm_minors_idr, fn, data); > +} > +EXPORT_SYMBOL(drm_minor_for_each); > + > /* > * DRM Core > * The DRM core module initializes all global DRM objects and makes them > diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h > index e19ac7ca602d..6bfad76f8e78 100644 > --- a/drivers/gpu/drm/drm_internal.h > +++ b/drivers/gpu/drm/drm_internal.h > @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct drm_prime_file_private *prime_fpriv); > void drm_prime_remove_buf_handle_locked(struct drm_prime_file_private *prime_fpriv, > struct dma_buf *dma_buf); > > -/* drm_drv.c */ > -struct drm_minor *drm_minor_acquire(unsigned int minor_id); > -void drm_minor_release(struct drm_minor *minor); > - > /* drm_vblank.c */ > void drm_vblank_disable_and_save(struct drm_device *dev, unsigned int pipe); > void drm_vblank_cleanup(struct drm_device *dev); > diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h > index 68ca736c548d..24f8d054c570 100644 > --- a/include/drm/drm_drv.h > +++ b/include/drm/drm_drv.h > @@ -799,5 +799,9 @@ static inline bool drm_drv_uses_atomic_modeset(struct drm_device *dev) > > int drm_dev_set_unique(struct drm_device *dev, const char *name); > > +int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data); > + > +struct drm_minor *drm_minor_acquire(unsigned int minor_id); > +void drm_minor_release(struct drm_minor *minor); > > #endif > -- > 2.22.0 >
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Sep 5, 2019 at 4:06 PM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Sep 5, 2019 at 8:28 PM Kenny Ho y2kenny@gmail.com wrote:
(resent in plain text mode)
Hi Daniel,
This is the previous patch relevant to this discussion: https://patchwork.freedesktop.org/patch/314343/
Ah yes, thanks for finding that.
So before I refactored the code to leverage drm_minor, I kept my own list of "known" drm_device inside the controller and have explicit register and unregister function to init per device cgroup defaults. For v4, I refactored the per device cgroup properties and embedded them into the drm_device and continue to only use the primary minor as a way to index the device as v3.
I didn't really like the explicit registration step, at least for the basic cgroup controls (like gem buffer limits), and suggested that should happen automatically at drm_dev_register/unregister time. I also talked about picking a consistent minor (if we have to use minors, still would like Tejun to confirm what we should do here), but that was an unrelated comment. So doing auto-registration on drm_minor was one step too far.
How about your comments on embedding properties into drm_device? I am actually still not clear on the downside of using drm_minor this way. With this implementation in v4, there isn't additional state that can go out of sync with the ground truth of drm_device from the perspective of drm_minor. Wouldn't the issue with hotplugging drm device you described earlier get worsen if the cgroup controller keep its own list?
Just doing a drm_cg_register/unregister pair that's called from drm_dev_register/unregister, and then if you want, looking up the right minor (I think always picking the render node makes sense for this, and skipping if there's no render node) would make most sense. At least for the basic cgroup controllers which are generic across drivers.
Why do we want to skip drm devices that does not have a render node and not just use the primary instead?
Kenny
-Daniel
Regards, Kenny
On Wed, Sep 4, 2019 at 4:54 AM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 03, 2019 at 04:43:45PM -0400, Kenny Ho wrote:
On Tue, Sep 3, 2019 at 4:12 PM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 3, 2019 at 9:45 PM Kenny Ho y2kenny@gmail.com wrote:
On Tue, Sep 3, 2019 at 3:57 AM Daniel Vetter daniel@ffwll.ch wrote: > Iterating over minors for cgroups sounds very, very wrong. Why do we care > whether a buffer was allocated through kms dumb vs render nodes? > > I'd expect all the cgroup stuff to only work on drm_device, if it does > care about devices. > > (I didn't look through the patch series to find out where exactly you're > using this, so maybe I'm off the rails here).
I am exposing this to remove the need to keep track of a separate list of available drm_device in the system (to remove the registering and unregistering of drm_device to the cgroup subsystem and just use drm_minor as the single source of truth.) I am only filtering out the render nodes minor because they point to the same drm_device and is confusing.
Perhaps I missed an obvious way to list the drm devices without iterating through the drm_minors? (I probably jumped to the minors because $major:$minor is the convention to address devices in cgroup.)
Create your own if there's nothing, because you need to anyway:
- You need special locking anyway, we can't just block on the idr lock
for everything.
- This needs to refcount drm_device, no the minors.
Iterating over stuff still feels kinda wrong still, because normally the way we register/unregister userspace api (and cgroups isn't anything else from a drm driver pov) is by adding more calls to drm_dev_register/unregister. If you put a drm_cg_register/unregister call in there we have a clean separation, and you can track all the currently active devices however you want. Iterating over objects that can be hotunplugged any time tends to get really complicated really quickly.
Um... I thought this is what I had previously. Did I misunderstood your feedback from v3? Doesn't drm_minor already include all these facilities so isn't creating my own kind of reinventing the wheel? (as I did previously?) drm_minor_register is called inside drm_dev_register so isn't leveraging existing drm_minor facilities much better solution?
Hm the previous version already dropped out of my inbox, so hard to find it again. And I couldn't find this in archieves. Do you have pointers?
I thought the previous version did cgroup init separately from drm_device setup, and I guess I suggested that it should be moved int drm_dev_register/unregister?
Anyway, I don't think reusing the drm_minor registration makes sense, since we want to be on the drm_device, not on the minor. Which is a bit awkward for cgroups, which wants to identify devices using major.minor pairs. But I guess drm is the first subsystem where 1 device can be exposed through multiple minors ...
Tejun, any suggestions on this?
Anyway, I think just leveraging existing code because it can be abused to make it fit for us doesn't make sense. E.g. for the kms side we also don't piggy-back on top of drm_minor_register (it would be technically possible), but instead we have drm_modeset_register_all(). -Daniel
Kenny
Kenny
> -Daniel > > > --- > > drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ > > drivers/gpu/drm/drm_internal.h | 4 ---- > > include/drm/drm_drv.h | 4 ++++ > > 3 files changed, 23 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > > index 862621494a93..000cddabd970 100644 > > --- a/drivers/gpu/drm/drm_drv.c > > +++ b/drivers/gpu/drm/drm_drv.c > > @@ -254,11 +254,13 @@ struct drm_minor *drm_minor_acquire(unsigned int minor_id) > > > > return minor; > > } > > +EXPORT_SYMBOL(drm_minor_acquire); > > > > void drm_minor_release(struct drm_minor *minor) > > { > > drm_dev_put(minor->dev); > > } > > +EXPORT_SYMBOL(drm_minor_release); > > > > /** > > * DOC: driver instance overview > > @@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device *dev, const char *name) > > } > > EXPORT_SYMBOL(drm_dev_set_unique); > > > > +/** > > + * drm_minor_for_each - Iterate through all stored DRM minors > > + * @fn: Function to be called for each pointer. > > + * @data: Data passed to callback function. > > + * > > + * The callback function will be called for each @drm_minor entry, passing > > + * the minor, the entry and @data. > > + * > > + * If @fn returns anything other than %0, the iteration stops and that > > + * value is returned from this function. > > + */ > > +int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data) > > +{ > > + return idr_for_each(&drm_minors_idr, fn, data); > > +} > > +EXPORT_SYMBOL(drm_minor_for_each); > > + > > /* > > * DRM Core > > * The DRM core module initializes all global DRM objects and makes them > > diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h > > index e19ac7ca602d..6bfad76f8e78 100644 > > --- a/drivers/gpu/drm/drm_internal.h > > +++ b/drivers/gpu/drm/drm_internal.h > > @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct drm_prime_file_private *prime_fpriv); > > void drm_prime_remove_buf_handle_locked(struct drm_prime_file_private *prime_fpriv, > > struct dma_buf *dma_buf); > > > > -/* drm_drv.c */ > > -struct drm_minor *drm_minor_acquire(unsigned int minor_id); > > -void drm_minor_release(struct drm_minor *minor); > > - > > /* drm_vblank.c */ > > void drm_vblank_disable_and_save(struct drm_device *dev, unsigned int pipe); > > void drm_vblank_cleanup(struct drm_device *dev); > > diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h > > index 68ca736c548d..24f8d054c570 100644 > > --- a/include/drm/drm_drv.h > > +++ b/include/drm/drm_drv.h > > @@ -799,5 +799,9 @@ static inline bool drm_drv_uses_atomic_modeset(struct drm_device *dev) > > > > int drm_dev_set_unique(struct drm_device *dev, const char *name); > > > > +int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data); > > + > > +struct drm_minor *drm_minor_acquire(unsigned int minor_id); > > +void drm_minor_release(struct drm_minor *minor); > > > > #endif > > -- > > 2.22.0 > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Sep 5, 2019 at 10:21 PM Kenny Ho y2kenny@gmail.com wrote:
On Thu, Sep 5, 2019 at 4:06 PM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Sep 5, 2019 at 8:28 PM Kenny Ho y2kenny@gmail.com wrote:
(resent in plain text mode)
Hi Daniel,
This is the previous patch relevant to this discussion: https://patchwork.freedesktop.org/patch/314343/
Ah yes, thanks for finding that.
So before I refactored the code to leverage drm_minor, I kept my own list of "known" drm_device inside the controller and have explicit register and unregister function to init per device cgroup defaults. For v4, I refactored the per device cgroup properties and embedded them into the drm_device and continue to only use the primary minor as a way to index the device as v3.
I didn't really like the explicit registration step, at least for the basic cgroup controls (like gem buffer limits), and suggested that should happen automatically at drm_dev_register/unregister time. I also talked about picking a consistent minor (if we have to use minors, still would like Tejun to confirm what we should do here), but that was an unrelated comment. So doing auto-registration on drm_minor was one step too far.
How about your comments on embedding properties into drm_device? I am actually still not clear on the downside of using drm_minor this way. With this implementation in v4, there isn't additional state that can go out of sync with the ground truth of drm_device from the perspective of drm_minor. Wouldn't the issue with hotplugging drm device you described earlier get worsen if the cgroup controller keep its own list?
drm_dev_unregister gets called on hotunplug, so your cgroup-internal tracking won't get out of sync any more than the drm_minor list gets out of sync with drm_devices. The trouble with drm_minor is just that cgroup doesn't track allocations on drm_minor (that's just the uapi flavour), but on the underlying drm_device. So really doesn't make much sense to attach cgroup tracking to the drm_minor.
Just doing a drm_cg_register/unregister pair that's called from drm_dev_register/unregister, and then if you want, looking up the right minor (I think always picking the render node makes sense for this, and skipping if there's no render node) would make most sense. At least for the basic cgroup controllers which are generic across drivers.
Why do we want to skip drm devices that does not have a render node and not just use the primary instead?
I guess we could also take the primary node, but drivers with only primary node are generaly display-only drm drivers. Not sure we want cgroups on those (but I guess can't hurt, and more consistent). But then we'd always need to pick the primary node for cgroup identification purposes. -Daniel
Kenny
-Daniel
Regards, Kenny
On Wed, Sep 4, 2019 at 4:54 AM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 03, 2019 at 04:43:45PM -0400, Kenny Ho wrote:
On Tue, Sep 3, 2019 at 4:12 PM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 3, 2019 at 9:45 PM Kenny Ho y2kenny@gmail.com wrote: > On Tue, Sep 3, 2019 at 3:57 AM Daniel Vetter daniel@ffwll.ch wrote: > > Iterating over minors for cgroups sounds very, very wrong. Why do we care > > whether a buffer was allocated through kms dumb vs render nodes? > > > > I'd expect all the cgroup stuff to only work on drm_device, if it does > > care about devices. > > > > (I didn't look through the patch series to find out where exactly you're > > using this, so maybe I'm off the rails here). > > I am exposing this to remove the need to keep track of a separate list > of available drm_device in the system (to remove the registering and > unregistering of drm_device to the cgroup subsystem and just use > drm_minor as the single source of truth.) I am only filtering out the > render nodes minor because they point to the same drm_device and is > confusing. > > Perhaps I missed an obvious way to list the drm devices without > iterating through the drm_minors? (I probably jumped to the minors > because $major:$minor is the convention to address devices in cgroup.)
Create your own if there's nothing, because you need to anyway:
- You need special locking anyway, we can't just block on the idr lock
for everything.
- This needs to refcount drm_device, no the minors.
Iterating over stuff still feels kinda wrong still, because normally the way we register/unregister userspace api (and cgroups isn't anything else from a drm driver pov) is by adding more calls to drm_dev_register/unregister. If you put a drm_cg_register/unregister call in there we have a clean separation, and you can track all the currently active devices however you want. Iterating over objects that can be hotunplugged any time tends to get really complicated really quickly.
Um... I thought this is what I had previously. Did I misunderstood your feedback from v3? Doesn't drm_minor already include all these facilities so isn't creating my own kind of reinventing the wheel? (as I did previously?) drm_minor_register is called inside drm_dev_register so isn't leveraging existing drm_minor facilities much better solution?
Hm the previous version already dropped out of my inbox, so hard to find it again. And I couldn't find this in archieves. Do you have pointers?
I thought the previous version did cgroup init separately from drm_device setup, and I guess I suggested that it should be moved int drm_dev_register/unregister?
Anyway, I don't think reusing the drm_minor registration makes sense, since we want to be on the drm_device, not on the minor. Which is a bit awkward for cgroups, which wants to identify devices using major.minor pairs. But I guess drm is the first subsystem where 1 device can be exposed through multiple minors ...
Tejun, any suggestions on this?
Anyway, I think just leveraging existing code because it can be abused to make it fit for us doesn't make sense. E.g. for the kms side we also don't piggy-back on top of drm_minor_register (it would be technically possible), but instead we have drm_modeset_register_all(). -Daniel
Kenny
> > Kenny > > > -Daniel > > > > > --- > > > drivers/gpu/drm/drm_drv.c | 19 +++++++++++++++++++ > > > drivers/gpu/drm/drm_internal.h | 4 ---- > > > include/drm/drm_drv.h | 4 ++++ > > > 3 files changed, 23 insertions(+), 4 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > > > index 862621494a93..000cddabd970 100644 > > > --- a/drivers/gpu/drm/drm_drv.c > > > +++ b/drivers/gpu/drm/drm_drv.c > > > @@ -254,11 +254,13 @@ struct drm_minor *drm_minor_acquire(unsigned int minor_id) > > > > > > return minor; > > > } > > > +EXPORT_SYMBOL(drm_minor_acquire); > > > > > > void drm_minor_release(struct drm_minor *minor) > > > { > > > drm_dev_put(minor->dev); > > > } > > > +EXPORT_SYMBOL(drm_minor_release); > > > > > > /** > > > * DOC: driver instance overview > > > @@ -1078,6 +1080,23 @@ int drm_dev_set_unique(struct drm_device *dev, const char *name) > > > } > > > EXPORT_SYMBOL(drm_dev_set_unique); > > > > > > +/** > > > + * drm_minor_for_each - Iterate through all stored DRM minors > > > + * @fn: Function to be called for each pointer. > > > + * @data: Data passed to callback function. > > > + * > > > + * The callback function will be called for each @drm_minor entry, passing > > > + * the minor, the entry and @data. > > > + * > > > + * If @fn returns anything other than %0, the iteration stops and that > > > + * value is returned from this function. > > > + */ > > > +int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data) > > > +{ > > > + return idr_for_each(&drm_minors_idr, fn, data); > > > +} > > > +EXPORT_SYMBOL(drm_minor_for_each); > > > + > > > /* > > > * DRM Core > > > * The DRM core module initializes all global DRM objects and makes them > > > diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h > > > index e19ac7ca602d..6bfad76f8e78 100644 > > > --- a/drivers/gpu/drm/drm_internal.h > > > +++ b/drivers/gpu/drm/drm_internal.h > > > @@ -54,10 +54,6 @@ void drm_prime_destroy_file_private(struct drm_prime_file_private *prime_fpriv); > > > void drm_prime_remove_buf_handle_locked(struct drm_prime_file_private *prime_fpriv, > > > struct dma_buf *dma_buf); > > > > > > -/* drm_drv.c */ > > > -struct drm_minor *drm_minor_acquire(unsigned int minor_id); > > > -void drm_minor_release(struct drm_minor *minor); > > > - > > > /* drm_vblank.c */ > > > void drm_vblank_disable_and_save(struct drm_device *dev, unsigned int pipe); > > > void drm_vblank_cleanup(struct drm_device *dev); > > > diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h > > > index 68ca736c548d..24f8d054c570 100644 > > > --- a/include/drm/drm_drv.h > > > +++ b/include/drm/drm_drv.h > > > @@ -799,5 +799,9 @@ static inline bool drm_drv_uses_atomic_modeset(struct drm_device *dev) > > > > > > int drm_dev_set_unique(struct drm_device *dev, const char *name); > > > > > > +int drm_minor_for_each(int (*fn)(int id, void *p, void *data), void *data); > > > + > > > +struct drm_minor *drm_minor_acquire(unsigned int minor_id); > > > +void drm_minor_release(struct drm_minor *minor); > > > > > > #endif > > > -- > > > 2.22.0 > > > > > > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Sep 5, 2019 at 4:32 PM Daniel Vetter daniel@ffwll.ch wrote:
*snip*
drm_dev_unregister gets called on hotunplug, so your cgroup-internal tracking won't get out of sync any more than the drm_minor list gets out of sync with drm_devices. The trouble with drm_minor is just that cgroup doesn't track allocations on drm_minor (that's just the uapi flavour), but on the underlying drm_device. So really doesn't make much sense to attach cgroup tracking to the drm_minor.
Um... I think I get what you are saying, but isn't this a matter of the cgroup controller doing a drm_dev_get when using the drm_minor? Or that won't work because it's possible to have a valid drm_minor but invalid drm_device in it? I understand it's an extra level of indirection but since the convention for addressing device in cgroup is using $major:$minor I don't see a way to escape this. (Tejun actually already made a comment on my earlier RFC where I didn't follow the major:minor convention strictly.)
Kenny
Just doing a drm_cg_register/unregister pair that's called from drm_dev_register/unregister, and then if you want, looking up the right minor (I think always picking the render node makes sense for this, and skipping if there's no render node) would make most sense. At least for the basic cgroup controllers which are generic across drivers.
Why do we want to skip drm devices that does not have a render node and not just use the primary instead?
I guess we could also take the primary node, but drivers with only primary node are generaly display-only drm drivers. Not sure we want cgroups on those (but I guess can't hurt, and more consistent). But then we'd always need to pick the primary node for cgroup identification purposes. -Daniel
Kenny
-Daniel
On Thu, Sep 05, 2019 at 05:26:08PM -0400, Kenny Ho wrote:
On Thu, Sep 5, 2019 at 4:32 PM Daniel Vetter daniel@ffwll.ch wrote:
*snip*
drm_dev_unregister gets called on hotunplug, so your cgroup-internal tracking won't get out of sync any more than the drm_minor list gets out of sync with drm_devices. The trouble with drm_minor is just that cgroup doesn't track allocations on drm_minor (that's just the uapi flavour), but on the underlying drm_device. So really doesn't make much sense to attach cgroup tracking to the drm_minor.
Um... I think I get what you are saying, but isn't this a matter of the cgroup controller doing a drm_dev_get when using the drm_minor? Or that won't work because it's possible to have a valid drm_minor but invalid drm_device in it? I understand it's an extra level of indirection but since the convention for addressing device in cgroup is using $major:$minor I don't see a way to escape this. (Tejun actually already made a comment on my earlier RFC where I didn't follow the major:minor convention strictly.)
drm_device is the object that controls lifetime and everything, that's why you need to do a drm_dev_get and all that in some places. Going through the minor really feels like a distraction.
And yes we have a bit a mess between cgroups insisting on using the minor, and drm_device having more than 1 minor for the same underlying physical resource. Just because the uapi is a bit a mess in that regard doesn't mean we should pull that mess into the kernel implementation imo. -Daniel
Kenny
Just doing a drm_cg_register/unregister pair that's called from drm_dev_register/unregister, and then if you want, looking up the right minor (I think always picking the render node makes sense for this, and skipping if there's no render node) would make most sense. At least for the basic cgroup controllers which are generic across drivers.
Why do we want to skip drm devices that does not have a render node and not just use the primary instead?
I guess we could also take the primary node, but drivers with only primary node are generaly display-only drm drivers. Not sure we want cgroups on those (but I guess can't hurt, and more consistent). But then we'd always need to pick the primary node for cgroup identification purposes. -Daniel
Kenny
-Daniel
Hello,
On Wed, Sep 04, 2019 at 10:54:34AM +0200, Daniel Vetter wrote:
Anyway, I don't think reusing the drm_minor registration makes sense, since we want to be on the drm_device, not on the minor. Which is a bit awkward for cgroups, which wants to identify devices using major.minor pairs. But I guess drm is the first subsystem where 1 device can be exposed through multiple minors ...
Tejun, any suggestions on this?
I'm not extremely attached to maj:min. It's nice in that it'd be consistent with blkcg but it already isn't the nicest of identifiers for block devices. If using maj:min is reasonably straight forward for gpus even if not perfect, I'd prefer going with maj:min. Otherwise, please feel free to use the ID best for GPUs - hopefully something which is easy to understand, consistent with IDs used elsewhere and easy to build tooling around.
Thanks.
On Fri, Sep 6, 2019 at 5:29 PM Tejun Heo tj@kernel.org wrote:
Hello,
On Wed, Sep 04, 2019 at 10:54:34AM +0200, Daniel Vetter wrote:
Anyway, I don't think reusing the drm_minor registration makes sense, since we want to be on the drm_device, not on the minor. Which is a bit awkward for cgroups, which wants to identify devices using major.minor pairs. But I guess drm is the first subsystem where 1 device can be exposed through multiple minors ...
Tejun, any suggestions on this?
I'm not extremely attached to maj:min. It's nice in that it'd be consistent with blkcg but it already isn't the nicest of identifiers for block devices. If using maj:min is reasonably straight forward for gpus even if not perfect, I'd prefer going with maj:min. Otherwise, please feel free to use the ID best for GPUs - hopefully something which is easy to understand, consistent with IDs used elsewhere and easy to build tooling around.
Block devices are a great example I think. How do you handle the partitions on that? For drm we also have a main minor interface, and then the render-only interface on drivers that support it. So if blkcg handles that by only exposing the primary maj:min pair, I think we can go with that and it's all nicely consistent. -Daniel
Hello, Daniel.
On Fri, Sep 06, 2019 at 05:36:02PM +0200, Daniel Vetter wrote:
Block devices are a great example I think. How do you handle the partitions on that? For drm we also have a main minor interface, and
cgroup IO controllers only distribute hardware IO capacity and are blind to partitions. As there's always the whole device MAJ:MIN for block devices, we only use that.
then the render-only interface on drivers that support it. So if blkcg handles that by only exposing the primary maj:min pair, I think we can go with that and it's all nicely consistent.
Ah yeah, that sounds equivalent. Great.
Thanks.
With the increased importance of machine learning, data science and other cloud-based applications, GPUs are already in production use in data centers today. Existing GPU resource management is very coarse grain, however, as sysadmins are only able to distribute workload on a per-GPU basis. An alternative is to use GPU virtualization (with or without SRIOV) but it generally acts on the entire GPU instead of the specific resources in a GPU. With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU resource management (in addition to what may be available via GPU virtualization.)
Change-Id: I6830d3990f63f0c13abeba29b1d330cf28882831 Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- Documentation/admin-guide/cgroup-v2.rst | 18 ++++- Documentation/cgroup-v1/drm.rst | 1 + include/linux/cgroup_drm.h | 92 +++++++++++++++++++++++++ include/linux/cgroup_subsys.h | 4 ++ init/Kconfig | 5 ++ kernel/cgroup/Makefile | 1 + kernel/cgroup/drm.c | 42 +++++++++++ 7 files changed, 161 insertions(+), 2 deletions(-) create mode 100644 Documentation/cgroup-v1/drm.rst create mode 100644 include/linux/cgroup_drm.h create mode 100644 kernel/cgroup/drm.c
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 88e746074252..2936423a3fd5 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -61,8 +61,10 @@ v1 is available under Documentation/cgroup-v1/. 5-6. Device 5-7. RDMA 5-7-1. RDMA Interface Files - 5-8. Misc - 5-8-1. perf_event + 5-8. DRM + 5-8-1. DRM Interface Files + 5-9. Misc + 5-9-1. perf_event 5-N. Non-normative information 5-N-1. CPU controller root cgroup process behaviour 5-N-2. IO controller root cgroup process behaviour @@ -1889,6 +1891,18 @@ RDMA Interface Files ocrdma1 hca_handle=1 hca_object=23
+DRM +--- + +The "drm" controller regulates the distribution and accounting of +of DRM (Direct Rendering Manager) and GPU-related resources. + +DRM Interface Files +~~~~~~~~~~~~~~~~~~~~ + +TODO + + Misc ----
diff --git a/Documentation/cgroup-v1/drm.rst b/Documentation/cgroup-v1/drm.rst new file mode 100644 index 000000000000..5f5658e1f5ed --- /dev/null +++ b/Documentation/cgroup-v1/drm.rst @@ -0,0 +1 @@ +Please see ../cgroup-v2.rst for details diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h new file mode 100644 index 000000000000..971166f9dd78 --- /dev/null +++ b/include/linux/cgroup_drm.h @@ -0,0 +1,92 @@ +/* SPDX-License-Identifier: MIT + * Copyright 2019 Advanced Micro Devices, Inc. + */ +#ifndef _CGROUP_DRM_H +#define _CGROUP_DRM_H + +#ifdef CONFIG_CGROUP_DRM + +#include <linux/cgroup.h> + +/** + * The DRM cgroup controller data structure. + */ +struct drmcg { + struct cgroup_subsys_state css; +}; + +/** + * css_to_drmcg - get the corresponding drmcg ref from a cgroup_subsys_state + * @css: the target cgroup_subsys_state + * + * Return: DRM cgroup that contains the @css + */ +static inline struct drmcg *css_to_drmcg(struct cgroup_subsys_state *css) +{ + return css ? container_of(css, struct drmcg, css) : NULL; +} + +/** + * drmcg_get - get the drmcg reference that a task belongs to + * @task: the target task + * + * This increase the reference count of the css that the @task belongs to + * + * Return: reference to the DRM cgroup the task belongs to + */ +static inline struct drmcg *drmcg_get(struct task_struct *task) +{ + return css_to_drmcg(task_get_css(task, drm_cgrp_id)); +} + +/** + * drmcg_put - put a drmcg reference + * @drmcg: the target drmcg + * + * Put a reference obtained via drmcg_get + */ +static inline void drmcg_put(struct drmcg *drmcg) +{ + if (drmcg) + css_put(&drmcg->css); +} + +/** + * drmcg_parent - find the parent of a drm cgroup + * @cg: the target drmcg + * + * This does not increase the reference count of the parent cgroup + * + * Return: parent DRM cgroup of @cg + */ +static inline struct drmcg *drmcg_parent(struct drmcg *cg) +{ + return css_to_drmcg(cg->css.parent); +} + +#else /* CONFIG_CGROUP_DRM */ + +struct drmcg { +}; + +static inline struct drmcg *css_to_drmcg(struct cgroup_subsys_state *css) +{ + return NULL; +} + +static inline struct drmcg *drmcg_get(struct task_struct *task) +{ + return NULL; +} + +static inline void drmcg_put(struct drmcg *drmcg) +{ +} + +static inline struct drmcg *drmcg_parent(struct drmcg *cg) +{ + return NULL; +} + +#endif /* CONFIG_CGROUP_DRM */ +#endif /* _CGROUP_DRM_H */ diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index acb77dcff3b4..ddedad809e8b 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -61,6 +61,10 @@ SUBSYS(pids) SUBSYS(rdma) #endif
+#if IS_ENABLED(CONFIG_CGROUP_DRM) +SUBSYS(drm) +#endif + /* * The following subsystems are not supported on the default hierarchy. */ diff --git a/init/Kconfig b/init/Kconfig index 8b9ffe236e4f..01d3453f6e04 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -876,6 +876,11 @@ config CGROUP_RDMA Attaching processes with active RDMA resources to the cgroup hierarchy is allowed even if can cross the hierarchy's limit.
+config CGROUP_DRM + bool "DRM controller (EXPERIMENTAL)" + help + Provides accounting and enforcement of resources in the DRM subsystem. + config CGROUP_FREEZER bool "Freezer controller" help diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile index 5d7a76bfbbb7..31f186f58121 100644 --- a/kernel/cgroup/Makefile +++ b/kernel/cgroup/Makefile @@ -4,5 +4,6 @@ obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o freezer.o obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o obj-$(CONFIG_CGROUP_PIDS) += pids.o obj-$(CONFIG_CGROUP_RDMA) += rdma.o +obj-$(CONFIG_CGROUP_DRM) += drm.o obj-$(CONFIG_CPUSETS) += cpuset.o obj-$(CONFIG_CGROUP_DEBUG) += debug.o diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c new file mode 100644 index 000000000000..e97861b3cb30 --- /dev/null +++ b/kernel/cgroup/drm.c @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: MIT +// Copyright 2019 Advanced Micro Devices, Inc. +#include <linux/slab.h> +#include <linux/cgroup.h> +#include <linux/cgroup_drm.h> + +static struct drmcg *root_drmcg __read_mostly; + +static void drmcg_css_free(struct cgroup_subsys_state *css) +{ + struct drmcg *drmcg = css_to_drmcg(css); + + kfree(drmcg); +} + +static struct cgroup_subsys_state * +drmcg_css_alloc(struct cgroup_subsys_state *parent_css) +{ + struct drmcg *parent = css_to_drmcg(parent_css); + struct drmcg *drmcg; + + drmcg = kzalloc(sizeof(struct drmcg), GFP_KERNEL); + if (!drmcg) + return ERR_PTR(-ENOMEM); + + if (!parent) + root_drmcg = drmcg; + + return &drmcg->css; +} + +struct cftype files[] = { + { } /* terminate */ +}; + +struct cgroup_subsys drm_cgrp_subsys = { + .css_alloc = drmcg_css_alloc, + .css_free = drmcg_css_free, + .early_init = false, + .legacy_cftypes = files, + .dfl_cftypes = files, +};
Hi.
On Thu, Aug 29, 2019 at 02:05:19AM -0400, Kenny Ho Kenny.Ho@amd.com wrote:
+struct cgroup_subsys drm_cgrp_subsys = {
- .css_alloc = drmcg_css_alloc,
- .css_free = drmcg_css_free,
- .early_init = false,
- .legacy_cftypes = files,
Do you really want to expose the DRM controller on v1 hierarchies (where threads of one process can be in different cgroups, or children cgroups compete with their parents)?
- .dfl_cftypes = files,
+};
Just asking, Michal
On Tue, Oct 1, 2019 at 10:31 AM Michal Koutný mkoutny@suse.com wrote:
On Thu, Aug 29, 2019 at 02:05:19AM -0400, Kenny Ho Kenny.Ho@amd.com wrote:
+struct cgroup_subsys drm_cgrp_subsys = {
.css_alloc = drmcg_css_alloc,
.css_free = drmcg_css_free,
.early_init = false,
.legacy_cftypes = files,
Do you really want to expose the DRM controller on v1 hierarchies (where threads of one process can be in different cgroups, or children cgroups compete with their parents)?
(Sorry for the delay, I have been distracted by something else.) Yes, I am hoping to make the functionality as widely available as possible since the ecosystem is still transitioning to v2. Do you see inherent problem with this approach?
Regards, Kenny
.dfl_cftypes = files,
+};
Just asking, Michal
On Fri, Nov 29, 2019 at 01:00:36AM -0500, Kenny Ho wrote:
On Tue, Oct 1, 2019 at 10:31 AM Michal Koutný mkoutny@suse.com wrote:
On Thu, Aug 29, 2019 at 02:05:19AM -0400, Kenny Ho Kenny.Ho@amd.com wrote:
+struct cgroup_subsys drm_cgrp_subsys = {
.css_alloc = drmcg_css_alloc,
.css_free = drmcg_css_free,
.early_init = false,
.legacy_cftypes = files,
Do you really want to expose the DRM controller on v1 hierarchies (where threads of one process can be in different cgroups, or children cgroups compete with their parents)?
(Sorry for the delay, I have been distracted by something else.) Yes, I am hoping to make the functionality as widely available as possible since the ecosystem is still transitioning to v2. Do you see inherent problem with this approach?
Integrating with memcg could be more challenging on cgroup1. That's one of the reasons why e.g. cgroup-aware pagecache writeback is only on cgroup2.
drmcg initialization involves allocating a per cgroup, per device data structure and setting the defaults. There are two entry points for drmcg init:
1) When struct drmcg is created via css_alloc, initialization is done for each device
2) When DRM devices are created after drmcgs are created a) Per device drmcg data structure is allocated at the beginning of DRM device creation such that drmcg can begin tracking usage statistics b) At the end of DRM device creation, drmcg_device_update is called in case device specific defaults need to be applied.
Entry point #2 usually applies to the root cgroup since it can be created before DRM devices are available. The drmcg controller will go through all existing drm cgroups and initialize them with the new device accordingly.
Change-Id: I908ee6975ea0585e4c30eafde4599f87094d8c65 Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- drivers/gpu/drm/drm_drv.c | 7 +++ include/drm/drm_cgroup.h | 27 ++++++++ include/drm/drm_device.h | 7 +++ include/drm/drm_drv.h | 9 +++ include/linux/cgroup_drm.h | 13 ++++ kernel/cgroup/drm.c | 123 +++++++++++++++++++++++++++++++++++++ 6 files changed, 186 insertions(+) create mode 100644 include/drm/drm_cgroup.h
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 000cddabd970..94265eba68ca 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -37,6 +37,7 @@ #include <drm/drm_client.h> #include <drm/drm_drv.h> #include <drm/drmP.h> +#include <drm/drm_cgroup.h>
#include "drm_crtc_internal.h" #include "drm_legacy.h" @@ -672,6 +673,7 @@ int drm_dev_init(struct drm_device *dev, mutex_init(&dev->filelist_mutex); mutex_init(&dev->clientlist_mutex); mutex_init(&dev->master_mutex); + mutex_init(&dev->drmcg_mutex);
dev->anon_inode = drm_fs_inode_new(); if (IS_ERR(dev->anon_inode)) { @@ -708,6 +710,7 @@ int drm_dev_init(struct drm_device *dev, if (ret) goto err_setunique;
+ drmcg_device_early_init(dev); return 0;
err_setunique: @@ -722,6 +725,7 @@ int drm_dev_init(struct drm_device *dev, drm_fs_inode_free(dev->anon_inode); err_free: put_device(dev->dev); + mutex_destroy(&dev->drmcg_mutex); mutex_destroy(&dev->master_mutex); mutex_destroy(&dev->clientlist_mutex); mutex_destroy(&dev->filelist_mutex); @@ -798,6 +802,7 @@ void drm_dev_fini(struct drm_device *dev)
put_device(dev->dev);
+ mutex_destroy(&dev->drmcg_mutex); mutex_destroy(&dev->master_mutex); mutex_destroy(&dev->clientlist_mutex); mutex_destroy(&dev->filelist_mutex); @@ -1008,6 +1013,8 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags) dev->dev ? dev_name(dev->dev) : "virtual device", dev->primary->index);
+ drmcg_device_update(dev); + goto out_unlock;
err_minors: diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h new file mode 100644 index 000000000000..bef9f9245924 --- /dev/null +++ b/include/drm/drm_cgroup.h @@ -0,0 +1,27 @@ +/* SPDX-License-Identifier: MIT + * Copyright 2019 Advanced Micro Devices, Inc. + */ +#ifndef __DRM_CGROUP_H__ +#define __DRM_CGROUP_H__ + +/** + * Per DRM device properties for DRM cgroup controller for the purpose + * of storing per device defaults + */ +struct drmcg_props { +}; + +#ifdef CONFIG_CGROUP_DRM + +void drmcg_device_update(struct drm_device *device); +void drmcg_device_early_init(struct drm_device *device); +#else +static inline void drmcg_device_update(struct drm_device *device) +{ +} + +static inline void drmcg_device_early_init(struct drm_device *device) +{ +} +#endif /* CONFIG_CGROUP_DRM */ +#endif /* __DRM_CGROUP_H__ */ diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h index 7f9ef709b2b6..5d7d779a5083 100644 --- a/include/drm/drm_device.h +++ b/include/drm/drm_device.h @@ -8,6 +8,7 @@
#include <drm/drm_hashtab.h> #include <drm/drm_mode_config.h> +#include <drm/drm_cgroup.h>
struct drm_driver; struct drm_minor; @@ -304,6 +305,12 @@ struct drm_device { */ struct drm_fb_helper *fb_helper;
+ /** \name DRM Cgroup */ + /*@{ */ + struct mutex drmcg_mutex; + struct drmcg_props drmcg_props; + /*@} */ + /* Everything below here is for legacy driver, never use! */ /* private: */ #if IS_ENABLED(CONFIG_DRM_LEGACY) diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 24f8d054c570..c8a37a08d98d 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -660,6 +660,15 @@ struct drm_driver { struct drm_device *dev, uint32_t handle);
+ /** + * @drmcg_custom_init + * + * Optional callback used to initialize drm cgroup per device properties + * such as resource limit defaults. + */ + void (*drmcg_custom_init)(struct drm_device *dev, + struct drmcg_props *props); + /** * @gem_vm_ops: Driver private ops for this object */ diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index 971166f9dd78..4ecd44f2ac27 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -6,13 +6,26 @@
#ifdef CONFIG_CGROUP_DRM
+#include <linux/mutex.h> #include <linux/cgroup.h> +#include <drm/drm_file.h> + +/* limit defined per the way drm_minor_alloc operates */ +#define MAX_DRM_DEV (64 * DRM_MINOR_RENDER) + +/** + * Per DRM cgroup, per device resources (such as statistics and limits) + */ +struct drmcg_device_resource { + /* for per device stats */ +};
/** * The DRM cgroup controller data structure. */ struct drmcg { struct cgroup_subsys_state css; + struct drmcg_device_resource *dev_resources[MAX_DRM_DEV]; };
/** diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index e97861b3cb30..135fdcdc4b51 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -1,28 +1,103 @@ // SPDX-License-Identifier: MIT // Copyright 2019 Advanced Micro Devices, Inc. +#include <linux/export.h> #include <linux/slab.h> #include <linux/cgroup.h> +#include <linux/fs.h> +#include <linux/seq_file.h> +#include <linux/mutex.h> #include <linux/cgroup_drm.h> +#include <linux/kernel.h> +#include <drm/drm_file.h> +#include <drm/drm_drv.h> +#include <drm/drm_device.h> +#include <drm/drm_cgroup.h> + +/* global mutex for drmcg across all devices */ +static DEFINE_MUTEX(drmcg_mutex);
static struct drmcg *root_drmcg __read_mostly;
+static int drmcg_css_free_fn(int id, void *ptr, void *data) +{ + struct drm_minor *minor = ptr; + struct drmcg *drmcg = data; + + if (minor->type != DRM_MINOR_PRIMARY) + return 0; + + kfree(drmcg->dev_resources[minor->index]); + + return 0; +} + static void drmcg_css_free(struct cgroup_subsys_state *css) { struct drmcg *drmcg = css_to_drmcg(css);
+ drm_minor_for_each(&drmcg_css_free_fn, drmcg); + kfree(drmcg); }
+static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) +{ + int minor = dev->primary->index; + struct drmcg_device_resource *ddr = drmcg->dev_resources[minor]; + + if (ddr == NULL) { + ddr = kzalloc(sizeof(struct drmcg_device_resource), + GFP_KERNEL); + + if (!ddr) + return -ENOMEM; + } + + mutex_lock(&dev->drmcg_mutex); + drmcg->dev_resources[minor] = ddr; + + /* set defaults here */ + + mutex_unlock(&dev->drmcg_mutex); + return 0; +} + +static int init_drmcg_fn(int id, void *ptr, void *data) +{ + struct drm_minor *minor = ptr; + struct drmcg *drmcg = data; + + if (minor->type != DRM_MINOR_PRIMARY) + return 0; + + return init_drmcg_single(drmcg, minor->dev); +} + +static inline int init_drmcg(struct drmcg *drmcg, struct drm_device *dev) +{ + if (dev != NULL) + return init_drmcg_single(drmcg, dev); + + return drm_minor_for_each(&init_drmcg_fn, drmcg); +} + static struct cgroup_subsys_state * drmcg_css_alloc(struct cgroup_subsys_state *parent_css) { struct drmcg *parent = css_to_drmcg(parent_css); struct drmcg *drmcg; + int rc;
drmcg = kzalloc(sizeof(struct drmcg), GFP_KERNEL); if (!drmcg) return ERR_PTR(-ENOMEM);
+ rc = init_drmcg(drmcg, NULL); + if (rc) { + drmcg_css_free(&drmcg->css); + return ERR_PTR(rc); + } + if (!parent) root_drmcg = drmcg;
@@ -40,3 +115,51 @@ struct cgroup_subsys drm_cgrp_subsys = { .legacy_cftypes = files, .dfl_cftypes = files, }; + +static inline void drmcg_update_cg_tree(struct drm_device *dev) +{ + /* init cgroups created before registration (i.e. root cgroup) */ + if (root_drmcg != NULL) { + struct cgroup_subsys_state *pos; + struct drmcg *child; + + rcu_read_lock(); + css_for_each_descendant_pre(pos, &root_drmcg->css) { + child = css_to_drmcg(pos); + init_drmcg(child, dev); + } + rcu_read_unlock(); + } +} + +/** + * drmcg_device_update - update DRM cgroups defaults + * @dev: the target DRM device + * + * If @dev has a drmcg_custom_init for the DRM cgroup controller, it will be called + * to set device specific defaults and set the initial values for all existing + * cgroups created prior to @dev become available. + */ +void drmcg_device_update(struct drm_device *dev) +{ + if (dev->driver->drmcg_custom_init) + { + dev->driver->drmcg_custom_init(dev, &dev->drmcg_props); + + drmcg_update_cg_tree(dev); + } +} +EXPORT_SYMBOL(drmcg_device_update); + +/** + * drmcg_device_early_init - initialize device specific resources for DRM cgroups + * @dev: the target DRM device + * + * Allocate and initialize device specific resources for existing DRM cgroups. + * Typically only the root cgroup exists before the initialization of @dev. + */ +void drmcg_device_early_init(struct drm_device *dev) +{ + drmcg_update_cg_tree(dev); +} +EXPORT_SYMBOL(drmcg_device_early_init);
The drm resource being measured here is the GEM buffer objects. User applications allocate and free these buffers. In addition, a process can allocate a buffer and share it with another process. The consumer of a shared buffer can also outlive the allocator of the buffer.
For the purpose of cgroup accounting and limiting, ownership of the buffer is deemed to be the cgroup for which the allocating process belongs to. There is one cgroup stats per drm device. Each allocation is charged to the owning cgroup as well as all its ancestors.
Similar to the memory cgroup, migrating a process to a different cgroup does not move the GEM buffer usages that the process started while in previous cgroup, to the new cgroup.
The following is an example to illustrate some of the operations. Given the following cgroup hierarchy (The letters are cgroup names with R being the root cgroup. The numbers in brackets are processes. The processes are placed with cgroup's 'No Internal Process Constraint' in mind, so no process is placed in cgroup B.)
R (4, 5) ------ A (6) \ B ---- C (7,8) \ D (9)
Here is a list of operation and the associated effect on the size track by the cgroups (for simplicity, each buffer is 1 unit in size.)
== == == == == =================================================== R A B C D Ops == == == == == =================================================== 1 0 0 0 0 4 allocated a buffer 1 0 0 0 0 4 shared a buffer with 5 1 0 0 0 0 4 shared a buffer with 9 2 0 1 0 1 9 allocated a buffer 3 0 2 1 1 7 allocated a buffer 3 0 2 1 1 7 shared a buffer with 8 3 0 2 1 1 7 sharing with 9 3 0 2 1 1 7 release a buffer 3 0 2 1 1 7 migrate to cgroup D 3 0 2 1 1 9 release a buffer from 7 2 0 1 0 1 8 release a buffer from 7 (last ref to shared buf) == == == == == ===================================================
drm.buffer.stats A read-only flat-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor.
Total GEM buffer allocation in bytes.
Change-Id: I9d662ec50d64bb40a37dbf47f018b2f3a1c033ad Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- Documentation/admin-guide/cgroup-v2.rst | 50 +++++++++- drivers/gpu/drm/drm_gem.c | 9 ++ include/drm/drm_cgroup.h | 16 +++ include/drm/drm_gem.h | 11 +++ include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 126 ++++++++++++++++++++++++ 6 files changed, 217 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 2936423a3fd5..0e29d136e2f9 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -63,6 +63,7 @@ v1 is available under Documentation/cgroup-v1/. 5-7-1. RDMA Interface Files 5-8. DRM 5-8-1. DRM Interface Files + 5-8-2. GEM Buffer Ownership 5-9. Misc 5-9-1. perf_event 5-N. Non-normative information @@ -1900,7 +1901,54 @@ of DRM (Direct Rendering Manager) and GPU-related resources. DRM Interface Files ~~~~~~~~~~~~~~~~~~~~
-TODO + drm.buffer.stats + A read-only flat-keyed file which exists on all cgroups. Each + entry is keyed by the drm device's major:minor. + + Total GEM buffer allocation in bytes. + +GEM Buffer Ownership +~~~~~~~~~~~~~~~~~~~~ + +For the purpose of cgroup accounting and limiting, ownership of the +buffer is deemed to be the cgroup for which the allocating process +belongs to. There is one cgroup stats per drm device. Each allocation +is charged to the owning cgroup as well as all its ancestors. + +Similar to the memory cgroup, migrating a process to a different cgroup +does not move the GEM buffer usages that the process started while in +previous cgroup, to the new cgroup. + +The following is an example to illustrate some of the operations. Given +the following cgroup hierarchy (The letters are cgroup names with R +being the root cgroup. The numbers in brackets are processes. The +processes are placed with cgroup's 'No Internal Process Constraint' in +mind, so no process is placed in cgroup B.) + +R (4, 5) ------ A (6) + \ + B ---- C (7,8) + \ + D (9) + +Here is a list of operation and the associated effect on the size +track by the cgroups (for simplicity, each buffer is 1 unit in size.) + +== == == == == =================================================== +R A B C D Ops +== == == == == =================================================== +1 0 0 0 0 4 allocated a buffer +1 0 0 0 0 4 shared a buffer with 5 +1 0 0 0 0 4 shared a buffer with 9 +2 0 1 0 1 9 allocated a buffer +3 0 2 1 1 7 allocated a buffer +3 0 2 1 1 7 shared a buffer with 8 +3 0 2 1 1 7 sharing with 9 +3 0 2 1 1 7 release a buffer +3 0 2 1 1 7 migrate to cgroup D +3 0 2 1 1 9 release a buffer from 7 +2 0 1 0 1 8 release a buffer from 7 (last ref to shared buf) +== == == == == ===================================================
Misc diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 50de138c89e0..517b71a6f4d4 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -38,10 +38,12 @@ #include <linux/dma-buf.h> #include <linux/mem_encrypt.h> #include <linux/pagevec.h> +#include <linux/cgroup_drm.h> #include <drm/drmP.h> #include <drm/drm_vma_manager.h> #include <drm/drm_gem.h> #include <drm/drm_print.h> +#include <drm/drm_cgroup.h> #include "drm_internal.h"
/** @file drm_gem.c @@ -159,6 +161,9 @@ void drm_gem_private_object_init(struct drm_device *dev, obj->resv = &obj->_resv;
drm_vma_node_reset(&obj->vma_node); + + obj->drmcg = drmcg_get(current); + drmcg_chg_bo_alloc(obj->drmcg, dev, size); } EXPORT_SYMBOL(drm_gem_private_object_init);
@@ -950,6 +955,10 @@ drm_gem_object_release(struct drm_gem_object *obj) fput(obj->filp);
reservation_object_fini(&obj->_resv); + + drmcg_unchg_bo_alloc(obj->drmcg, obj->dev, obj->size); + drmcg_put(obj->drmcg); + drm_gem_free_mmap_offset(obj); } EXPORT_SYMBOL(drm_gem_object_release); diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index bef9f9245924..1fa37d1ad44c 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -4,6 +4,8 @@ #ifndef __DRM_CGROUP_H__ #define __DRM_CGROUP_H__
+#include <linux/cgroup_drm.h> + /** * Per DRM device properties for DRM cgroup controller for the purpose * of storing per device defaults @@ -15,6 +17,10 @@ struct drmcg_props {
void drmcg_device_update(struct drm_device *device); void drmcg_device_early_init(struct drm_device *device); +void drmcg_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, + size_t size); +void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, + size_t size); #else static inline void drmcg_device_update(struct drm_device *device) { @@ -23,5 +29,15 @@ static inline void drmcg_device_update(struct drm_device *device) static inline void drmcg_device_early_init(struct drm_device *device) { } + +static inline void drmcg_chg_bo_alloc(struct drmcg *drmcg, + struct drm_device *dev, size_t size) +{ +} + +static inline void drmcg_unchg_bo_alloc(struct drmcg *drmcg, + struct drm_device *dev, size_t size) +{ +} #endif /* CONFIG_CGROUP_DRM */ #endif /* __DRM_CGROUP_H__ */ diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index 5047c7ee25f5..6047968bdd17 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -291,6 +291,17 @@ struct drm_gem_object { * */ const struct drm_gem_object_funcs *funcs; + + /** + * @drmcg: + * + * DRM cgroup this GEM object belongs to. + * + * This is used to track and limit the amount of GEM objects a user + * can allocate. Since GEM objects can be shared, this is also used + * to ensure GEM objects are only shared within the same cgroup. + */ + struct drmcg *drmcg; };
/** diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index 4ecd44f2ac27..1d8a7f2cdb4e 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -13,11 +13,17 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+enum drmcg_res_type { + DRMCG_TYPE_BO_TOTAL, + __DRMCG_TYPE_LAST, +}; + /** * Per DRM cgroup, per device resources (such as statistics and limits) */ struct drmcg_device_resource { /* for per device stats */ + s64 bo_stats_total_allocated; };
/** diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 135fdcdc4b51..87ae9164d8d8 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -11,11 +11,24 @@ #include <drm/drm_file.h> #include <drm/drm_drv.h> #include <drm/drm_device.h> +#include <drm/drm_ioctl.h> #include <drm/drm_cgroup.h>
/* global mutex for drmcg across all devices */ static DEFINE_MUTEX(drmcg_mutex);
+#define DRMCG_CTF_PRIV_SIZE 3 +#define DRMCG_CTF_PRIV_MASK GENMASK((DRMCG_CTF_PRIV_SIZE - 1), 0) +#define DRMCG_CTF_PRIV(res_type, f_type) ((res_type) <<\ + DRMCG_CTF_PRIV_SIZE | (f_type)) +#define DRMCG_CTF_PRIV2RESTYPE(priv) ((priv) >> DRMCG_CTF_PRIV_SIZE) +#define DRMCG_CTF_PRIV2FTYPE(priv) ((priv) & DRMCG_CTF_PRIV_MASK) + + +enum drmcg_file_type { + DRMCG_FTYPE_STATS, +}; + static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data) @@ -104,7 +117,66 @@ drmcg_css_alloc(struct cgroup_subsys_state *parent_css) return &drmcg->css; }
+static void drmcg_print_stats(struct drmcg_device_resource *ddr, + struct seq_file *sf, enum drmcg_res_type type) +{ + if (ddr == NULL) { + seq_puts(sf, "\n"); + return; + } + + switch (type) { + case DRMCG_TYPE_BO_TOTAL: + seq_printf(sf, "%lld\n", ddr->bo_stats_total_allocated); + break; + default: + seq_puts(sf, "\n"); + break; + } +} + +static int drmcg_seq_show_fn(int id, void *ptr, void *data) +{ + struct drm_minor *minor = ptr; + struct seq_file *sf = data; + struct drmcg *drmcg = css_to_drmcg(seq_css(sf)); + enum drmcg_file_type f_type = + DRMCG_CTF_PRIV2FTYPE(seq_cft(sf)->private); + enum drmcg_res_type type = + DRMCG_CTF_PRIV2RESTYPE(seq_cft(sf)->private); + struct drmcg_device_resource *ddr; + + if (minor->type != DRM_MINOR_PRIMARY) + return 0; + + ddr = drmcg->dev_resources[minor->index]; + + seq_printf(sf, "%d:%d ", DRM_MAJOR, minor->index); + + switch (f_type) { + case DRMCG_FTYPE_STATS: + drmcg_print_stats(ddr, sf, type); + break; + default: + seq_puts(sf, "\n"); + break; + } + + return 0; +} + +int drmcg_seq_show(struct seq_file *sf, void *v) +{ + return drm_minor_for_each(&drmcg_seq_show_fn, sf); +} + struct cftype files[] = { + { + .name = "buffer.total.stats", + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_TOTAL, + DRMCG_FTYPE_STATS), + }, { } /* terminate */ };
@@ -163,3 +235,57 @@ void drmcg_device_early_init(struct drm_device *dev) drmcg_update_cg_tree(dev); } EXPORT_SYMBOL(drmcg_device_early_init); + +/** + * drmcg_chg_bo_alloc - charge GEM buffer usage for a device and cgroup + * @drmcg: the DRM cgroup to be charged to + * @dev: the device the usage should be charged to + * @size: size of the GEM buffer to be accounted for + * + * This function should be called when a new GEM buffer is allocated to account + * for the utilization. This should not be called when the buffer is shared ( + * the GEM buffer's reference count being incremented.) + */ +void drmcg_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, + size_t size) +{ + struct drmcg_device_resource *ddr; + int devIdx = dev->primary->index; + + if (drmcg == NULL) + return; + + mutex_lock(&dev->drmcg_mutex); + for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { + ddr = drmcg->dev_resources[devIdx]; + + ddr->bo_stats_total_allocated += (s64)size; + } + mutex_unlock(&dev->drmcg_mutex); +} +EXPORT_SYMBOL(drmcg_chg_bo_alloc); + +/** + * drmcg_unchg_bo_alloc - + * @drmcg: the DRM cgroup to uncharge from + * @dev: the device the usage should be removed from + * @size: size of the GEM buffer to be accounted for + * + * This function should be called when the GEM buffer is about to be freed ( + * not simply when the GEM buffer's reference count is being decremented.) + */ +void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, + size_t size) +{ + int devIdx = dev->primary->index; + + if (drmcg == NULL) + return; + + mutex_lock(&dev->drmcg_mutex); + for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) + drmcg->dev_resources[devIdx]->bo_stats_total_allocated + -= (s64)size; + mutex_unlock(&dev->drmcg_mutex); +} +EXPORT_SYMBOL(drmcg_unchg_bo_alloc);
drm.buffer.peak.stats A read-only flat-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor.
Largest (high water mark) GEM buffer allocated in bytes.
Change-Id: I79e56222151a3d33a76a61ba0097fe93ebb3449f Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- Documentation/admin-guide/cgroup-v2.rst | 6 ++++++ include/linux/cgroup_drm.h | 3 +++ kernel/cgroup/drm.c | 12 ++++++++++++ 3 files changed, 21 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 0e29d136e2f9..8588a0ffc69d 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1907,6 +1907,12 @@ DRM Interface Files
Total GEM buffer allocation in bytes.
+ drm.buffer.peak.stats + A read-only flat-keyed file which exists on all cgroups. Each + entry is keyed by the drm device's major:minor. + + Largest (high water mark) GEM buffer allocated in bytes. + GEM Buffer Ownership ~~~~~~~~~~~~~~~~~~~~
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index 1d8a7f2cdb4e..974d390cfa4f 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -15,6 +15,7 @@
enum drmcg_res_type { DRMCG_TYPE_BO_TOTAL, + DRMCG_TYPE_BO_PEAK, __DRMCG_TYPE_LAST, };
@@ -24,6 +25,8 @@ enum drmcg_res_type { struct drmcg_device_resource { /* for per device stats */ s64 bo_stats_total_allocated; + + s64 bo_stats_peak_allocated; };
/** diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 87ae9164d8d8..0bf5b95668c4 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -129,6 +129,9 @@ static void drmcg_print_stats(struct drmcg_device_resource *ddr, case DRMCG_TYPE_BO_TOTAL: seq_printf(sf, "%lld\n", ddr->bo_stats_total_allocated); break; + case DRMCG_TYPE_BO_PEAK: + seq_printf(sf, "%lld\n", ddr->bo_stats_peak_allocated); + break; default: seq_puts(sf, "\n"); break; @@ -177,6 +180,12 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_TOTAL, DRMCG_FTYPE_STATS), }, + { + .name = "buffer.peak.stats", + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_PEAK, + DRMCG_FTYPE_STATS), + }, { } /* terminate */ };
@@ -260,6 +269,9 @@ void drmcg_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, ddr = drmcg->dev_resources[devIdx];
ddr->bo_stats_total_allocated += (s64)size; + + if (ddr->bo_stats_peak_allocated < (s64)size) + ddr->bo_stats_peak_allocated = (s64)size; } mutex_unlock(&dev->drmcg_mutex); }
drm.buffer.count.stats A read-only flat-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor.
Total number of GEM buffer allocated.
Change-Id: Id3e1809d5fee8562e47a7d2b961688956d844ec6 Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- Documentation/admin-guide/cgroup-v2.rst | 6 ++++++ include/linux/cgroup_drm.h | 3 +++ kernel/cgroup/drm.c | 22 +++++++++++++++++++--- 3 files changed, 28 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8588a0ffc69d..4dc72339a9b6 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1913,6 +1913,12 @@ DRM Interface Files
Largest (high water mark) GEM buffer allocated in bytes.
+ drm.buffer.count.stats + A read-only flat-keyed file which exists on all cgroups. Each + entry is keyed by the drm device's major:minor. + + Total number of GEM buffer allocated. + GEM Buffer Ownership ~~~~~~~~~~~~~~~~~~~~
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index 974d390cfa4f..972f7aa975b5 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -16,6 +16,7 @@ enum drmcg_res_type { DRMCG_TYPE_BO_TOTAL, DRMCG_TYPE_BO_PEAK, + DRMCG_TYPE_BO_COUNT, __DRMCG_TYPE_LAST, };
@@ -27,6 +28,8 @@ struct drmcg_device_resource { s64 bo_stats_total_allocated;
s64 bo_stats_peak_allocated; + + s64 bo_stats_count_allocated; };
/** diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 0bf5b95668c4..85e46ece4a82 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -132,6 +132,9 @@ static void drmcg_print_stats(struct drmcg_device_resource *ddr, case DRMCG_TYPE_BO_PEAK: seq_printf(sf, "%lld\n", ddr->bo_stats_peak_allocated); break; + case DRMCG_TYPE_BO_COUNT: + seq_printf(sf, "%lld\n", ddr->bo_stats_count_allocated); + break; default: seq_puts(sf, "\n"); break; @@ -186,6 +189,12 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_PEAK, DRMCG_FTYPE_STATS), }, + { + .name = "buffer.count.stats", + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_COUNT, + DRMCG_FTYPE_STATS), + }, { } /* terminate */ };
@@ -272,6 +281,8 @@ void drmcg_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev,
if (ddr->bo_stats_peak_allocated < (s64)size) ddr->bo_stats_peak_allocated = (s64)size; + + ddr->bo_stats_count_allocated++; } mutex_unlock(&dev->drmcg_mutex); } @@ -289,15 +300,20 @@ EXPORT_SYMBOL(drmcg_chg_bo_alloc); void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size) { + struct drmcg_device_resource *ddr; int devIdx = dev->primary->index;
if (drmcg == NULL) return;
mutex_lock(&dev->drmcg_mutex); - for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) - drmcg->dev_resources[devIdx]->bo_stats_total_allocated - -= (s64)size; + for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { + ddr = drmcg->dev_resources[devIdx]; + + ddr->bo_stats_total_allocated -= (s64)size; + + ddr->bo_stats_count_allocated--; + } mutex_unlock(&dev->drmcg_mutex); } EXPORT_SYMBOL(drmcg_unchg_bo_alloc);
The drm resource being limited here is the GEM buffer objects. User applications allocate and free these buffers. In addition, a process can allocate a buffer and share it with another process. The consumer of a shared buffer can also outlive the allocator of the buffer.
For the purpose of cgroup accounting and limiting, ownership of the buffer is deemed to be the cgroup for which the allocating process belongs to. There is one cgroup limit per drm device.
The limiting functionality is added to the previous stats collection function. The drm_gem_private_object_init is modified to have a return value to allow failure due to cgroup limit.
The try_chg function only fails if the DRM cgroup properties has limit_enforced set to true for the DRM device. This is to allow the DRM cgroup controller to collect usage stats without enforcing the limits.
drm.buffer.default A read-only flat-keyed file which exists on the root cgroup. Each entry is keyed by the drm device's major:minor.
Default limits on the total GEM buffer allocation in bytes.
drm.buffer.max A read-write flat-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor.
Per device limits on the total GEM buffer allocation in byte. This is a hard limit. Attempts in allocating beyond the cgroup limit will result in ENOMEM. Shorthand understood by memparse (such as k, m, g) can be used.
Set allocation limit for /dev/dri/card1 to 1GB echo "226:1 1g" > drm.buffer.total.max
Set allocation limit for /dev/dri/card0 to 512MB echo "226:0 512m" > drm.buffer.total.max
Change-Id: I96e0b7add4d331ed8bb267b3c9243d360c6e9903 Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- Documentation/admin-guide/cgroup-v2.rst | 21 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +- drivers/gpu/drm/drm_gem.c | 11 +- include/drm/drm_cgroup.h | 7 +- include/drm/drm_gem.h | 2 +- include/linux/cgroup_drm.h | 1 + kernel/cgroup/drm.c | 221 ++++++++++++++++++++- 8 files changed, 260 insertions(+), 17 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 4dc72339a9b6..e8fac2684179 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1919,6 +1919,27 @@ DRM Interface Files
Total number of GEM buffer allocated.
+ drm.buffer.default + A read-only flat-keyed file which exists on the root cgroup. + Each entry is keyed by the drm device's major:minor. + + Default limits on the total GEM buffer allocation in bytes. + + drm.buffer.max + A read-write flat-keyed file which exists on all cgroups. Each + entry is keyed by the drm device's major:minor. + + Per device limits on the total GEM buffer allocation in byte. + This is a hard limit. Attempts in allocating beyond the cgroup + limit will result in ENOMEM. Shorthand understood by memparse + (such as k, m, g) can be used. + + Set allocation limit for /dev/dri/card1 to 1GB + echo "226:1 1g" > drm.buffer.total.max + + Set allocation limit for /dev/dri/card0 to 512MB + echo "226:0 512m" > drm.buffer.total.max + GEM Buffer Ownership ~~~~~~~~~~~~~~~~~~~~
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index c0bbd3aa0558..163a4fbf0611 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -1395,6 +1395,12 @@ amdgpu_get_crtc_scanout_position(struct drm_device *dev, unsigned int pipe, stime, etime, mode); }
+static void amdgpu_drmcg_custom_init(struct drm_device *dev, + struct drmcg_props *props) +{ + props->limit_enforced = true; +} + static struct drm_driver kms_driver = { .driver_features = DRIVER_USE_AGP | DRIVER_ATOMIC | @@ -1431,6 +1437,8 @@ static struct drm_driver kms_driver = { .gem_prime_vunmap = amdgpu_gem_prime_vunmap, .gem_prime_mmap = amdgpu_gem_prime_mmap,
+ .drmcg_custom_init = amdgpu_drmcg_custom_init, + .name = DRIVER_NAME, .desc = DRIVER_DESC, .date = DRIVER_DATE, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 989b7b55cb2e..b1bd66be3e1a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -34,6 +34,7 @@ #include <drm/drmP.h> #include <drm/amdgpu_drm.h> #include <drm/drm_cache.h> +#include <drm/drm_cgroup.h> #include "amdgpu.h" #include "amdgpu_trace.h" #include "amdgpu_amdkfd.h" @@ -454,7 +455,10 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev, bo = kzalloc(sizeof(struct amdgpu_bo), GFP_KERNEL); if (bo == NULL) return -ENOMEM; - drm_gem_private_object_init(adev->ddev, &bo->gem_base, size); + if (!drm_gem_private_object_init(adev->ddev, &bo->gem_base, size)) { + kfree(bo); + return -ENOMEM; + } INIT_LIST_HEAD(&bo->shadow_list); bo->vm_bo = NULL; bo->preferred_domains = bp->preferred_domain ? bp->preferred_domain : diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 517b71a6f4d4..7887f153ab83 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -145,11 +145,17 @@ EXPORT_SYMBOL(drm_gem_object_init); * no GEM provided backing store. Instead the caller is responsible for * backing the object and handling it. */ -void drm_gem_private_object_init(struct drm_device *dev, +bool drm_gem_private_object_init(struct drm_device *dev, struct drm_gem_object *obj, size_t size) { BUG_ON((size & (PAGE_SIZE - 1)) != 0);
+ obj->drmcg = drmcg_get(current); + if (!drmcg_try_chg_bo_alloc(obj->drmcg, dev, size)) { + drmcg_put(obj->drmcg); + obj->drmcg = NULL; + return false; + } obj->dev = dev; obj->filp = NULL;
@@ -162,8 +168,7 @@ void drm_gem_private_object_init(struct drm_device *dev,
drm_vma_node_reset(&obj->vma_node);
- obj->drmcg = drmcg_get(current); - drmcg_chg_bo_alloc(obj->drmcg, dev, size); + return true; } EXPORT_SYMBOL(drm_gem_private_object_init);
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 1fa37d1ad44c..49c5d35ff6e1 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -11,13 +11,16 @@ * of storing per device defaults */ struct drmcg_props { + bool limit_enforced; + + s64 bo_limits_total_allocated_default; };
#ifdef CONFIG_CGROUP_DRM
void drmcg_device_update(struct drm_device *device); void drmcg_device_early_init(struct drm_device *device); -void drmcg_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, +bool drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size); void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size); @@ -30,7 +33,7 @@ static inline void drmcg_device_early_init(struct drm_device *device) { }
-static inline void drmcg_chg_bo_alloc(struct drmcg *drmcg, +static inline void drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size) { } diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index 6047968bdd17..2bf0c0962ddf 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -334,7 +334,7 @@ void drm_gem_object_release(struct drm_gem_object *obj); void drm_gem_object_free(struct kref *kref); int drm_gem_object_init(struct drm_device *dev, struct drm_gem_object *obj, size_t size); -void drm_gem_private_object_init(struct drm_device *dev, +bool drm_gem_private_object_init(struct drm_device *dev, struct drm_gem_object *obj, size_t size); void drm_gem_vm_open(struct vm_area_struct *vma); void drm_gem_vm_close(struct vm_area_struct *vma); diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index 972f7aa975b5..eb54e56f20ae 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -26,6 +26,7 @@ enum drmcg_res_type { struct drmcg_device_resource { /* for per device stats */ s64 bo_stats_total_allocated; + s64 bo_limits_total_allocated;
s64 bo_stats_peak_allocated;
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 85e46ece4a82..7161fa40e156 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -27,6 +27,8 @@ static DEFINE_MUTEX(drmcg_mutex);
enum drmcg_file_type { DRMCG_FTYPE_STATS, + DRMCG_FTYPE_LIMIT, + DRMCG_FTYPE_DEFAULT, };
static struct drmcg *root_drmcg __read_mostly; @@ -70,6 +72,8 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) drmcg->dev_resources[minor] = ddr;
/* set defaults here */ + ddr->bo_limits_total_allocated = + dev->drmcg_props.bo_limits_total_allocated_default;
mutex_unlock(&dev->drmcg_mutex); return 0; @@ -141,6 +145,38 @@ static void drmcg_print_stats(struct drmcg_device_resource *ddr, } }
+static void drmcg_print_limits(struct drmcg_device_resource *ddr, + struct seq_file *sf, enum drmcg_res_type type) +{ + if (ddr == NULL) { + seq_puts(sf, "\n"); + return; + } + + switch (type) { + case DRMCG_TYPE_BO_TOTAL: + seq_printf(sf, "%lld\n", ddr->bo_limits_total_allocated); + break; + default: + seq_puts(sf, "\n"); + break; + } +} + +static void drmcg_print_default(struct drmcg_props *props, + struct seq_file *sf, enum drmcg_res_type type) +{ + switch (type) { + case DRMCG_TYPE_BO_TOTAL: + seq_printf(sf, "%lld\n", + props->bo_limits_total_allocated_default); + break; + default: + seq_puts(sf, "\n"); + break; + } +} + static int drmcg_seq_show_fn(int id, void *ptr, void *data) { struct drm_minor *minor = ptr; @@ -163,6 +199,12 @@ static int drmcg_seq_show_fn(int id, void *ptr, void *data) case DRMCG_FTYPE_STATS: drmcg_print_stats(ddr, sf, type); break; + case DRMCG_FTYPE_LIMIT: + drmcg_print_limits(ddr, sf, type); + break; + case DRMCG_FTYPE_DEFAULT: + drmcg_print_default(&minor->dev->drmcg_props, sf, type); + break; default: seq_puts(sf, "\n"); break; @@ -176,6 +218,124 @@ int drmcg_seq_show(struct seq_file *sf, void *v) return drm_minor_for_each(&drmcg_seq_show_fn, sf); }
+static void drmcg_pr_cft_err(const struct drmcg *drmcg, + int rc, const char *cft_name, int minor) +{ + pr_err("drmcg: error parsing %s, minor %d, rc %d ", + cft_name, minor, rc); + pr_cont_cgroup_name(drmcg->css.cgroup); + pr_cont("\n"); +} + +static int drmcg_process_limit_s64_val(char *sval, bool is_mem, + s64 def_val, s64 max_val, s64 *ret_val) +{ + int rc = strcmp("max", sval); + + + if (!rc) + *ret_val = max_val; + else { + rc = strcmp("default", sval); + + if (!rc) + *ret_val = def_val; + } + + if (rc) { + if (is_mem) { + *ret_val = memparse(sval, NULL); + rc = 0; + } else { + rc = kstrtoll(sval, 0, ret_val); + } + } + + if (*ret_val > max_val) + rc = -EINVAL; + + return rc; +} + +static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) +{ + mutex_lock(&dev->drmcg_mutex); + *dst = val; + mutex_unlock(&dev->drmcg_mutex); +} + +static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct drmcg *drmcg = css_to_drmcg(of_css(of)); + struct drmcg *parent = drmcg_parent(drmcg); + enum drmcg_res_type type = + DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); + char *cft_name = of_cft(of)->name; + char *limits = strstrip(buf); + struct drmcg_device_resource *ddr; + struct drmcg_props *props; + struct drm_minor *dm; + char *line; + char sattr[256]; + s64 val; + s64 p_max; + int rc; + int minor; + + while (limits != NULL) { + line = strsep(&limits, "\n"); + + if (sscanf(line, + __stringify(DRM_MAJOR)":%u %255[^\t\n]", + &minor, sattr) != 2) { + pr_err("drmcg: error parsing %s ", cft_name); + pr_cont_cgroup_name(drmcg->css.cgroup); + pr_cont("\n"); + + continue; + } + + dm = drm_minor_acquire(minor); + if (IS_ERR(dm)) { + pr_err("drmcg: invalid minor %d for %s ", + minor, cft_name); + pr_cont_cgroup_name(drmcg->css.cgroup); + pr_cont("\n"); + + continue; + } + + ddr = drmcg->dev_resources[minor]; + props = &dm->dev->drmcg_props; + switch (type) { + case DRMCG_TYPE_BO_TOTAL: + p_max = parent == NULL ? S64_MAX : + parent->dev_resources[minor]-> + bo_limits_total_allocated; + + rc = drmcg_process_limit_s64_val(sattr, true, + props->bo_limits_total_allocated_default, + p_max, + &val); + + if (rc || val < 0) { + drmcg_pr_cft_err(drmcg, rc, cft_name, minor); + break; + } + + drmcg_value_apply(dm->dev, + &ddr->bo_limits_total_allocated, val); + break; + default: + break; + } + drm_dev_put(dm->dev); /* release from drm_minor_acquire */ + } + + return nbytes; +} + struct cftype files[] = { { .name = "buffer.total.stats", @@ -183,6 +343,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_TOTAL, DRMCG_FTYPE_STATS), }, + { + .name = "buffer.total.default", + .seq_show = drmcg_seq_show, + .flags = CFTYPE_ONLY_ON_ROOT, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_TOTAL, + DRMCG_FTYPE_DEFAULT), + }, + { + .name = "buffer.total.max", + .write = drmcg_limit_write, + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_TOTAL, + DRMCG_FTYPE_LIMIT), + }, { .name = "buffer.peak.stats", .seq_show = drmcg_seq_show, @@ -250,12 +424,16 @@ EXPORT_SYMBOL(drmcg_device_update); */ void drmcg_device_early_init(struct drm_device *dev) { + dev->drmcg_props.limit_enforced = false; + + dev->drmcg_props.bo_limits_total_allocated_default = S64_MAX; + drmcg_update_cg_tree(dev); } EXPORT_SYMBOL(drmcg_device_early_init);
/** - * drmcg_chg_bo_alloc - charge GEM buffer usage for a device and cgroup + * drmcg_try_chg_bo_alloc - charge GEM buffer usage for a device and cgroup * @drmcg: the DRM cgroup to be charged to * @dev: the device the usage should be charged to * @size: size of the GEM buffer to be accounted for @@ -264,29 +442,52 @@ EXPORT_SYMBOL(drmcg_device_early_init); * for the utilization. This should not be called when the buffer is shared ( * the GEM buffer's reference count being incremented.) */ -void drmcg_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, +bool drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size) { struct drmcg_device_resource *ddr; int devIdx = dev->primary->index; + struct drmcg_props *props = &dev->drmcg_props; + struct drmcg *drmcg_cur = drmcg; + bool result = true; + s64 delta = 0;
if (drmcg == NULL) - return; + return true;
mutex_lock(&dev->drmcg_mutex); - for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { - ddr = drmcg->dev_resources[devIdx]; + if (props->limit_enforced) { + for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { + ddr = drmcg->dev_resources[devIdx]; + delta = ddr->bo_limits_total_allocated - + ddr->bo_stats_total_allocated; + + if (delta <= 0 || size > delta) { + result = false; + break; + } + } + } + + drmcg = drmcg_cur; + + if (result || !props->limit_enforced) { + for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { + ddr = drmcg->dev_resources[devIdx];
- ddr->bo_stats_total_allocated += (s64)size; + ddr->bo_stats_total_allocated += (s64)size;
- if (ddr->bo_stats_peak_allocated < (s64)size) - ddr->bo_stats_peak_allocated = (s64)size; + if (ddr->bo_stats_peak_allocated < (s64)size) + ddr->bo_stats_peak_allocated = (s64)size;
- ddr->bo_stats_count_allocated++; + ddr->bo_stats_count_allocated++; + } } mutex_unlock(&dev->drmcg_mutex); + + return result; } -EXPORT_SYMBOL(drmcg_chg_bo_alloc); +EXPORT_SYMBOL(drmcg_try_chg_bo_alloc);
/** * drmcg_unchg_bo_alloc -
Hello.
On Thu, Aug 29, 2019 at 02:05:24AM -0400, Kenny Ho Kenny.Ho@amd.com wrote:
drm.buffer.default A read-only flat-keyed file which exists on the root cgroup. Each entry is keyed by the drm device's major:minor.
Default limits on the total GEM buffer allocation in bytes.
What is the purpose of this attribute (and alikes for other resources)? I can't see it being set differently but S64_MAX in drmcg_device_early_init.
+static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, [...]
switch (type) {
case DRMCG_TYPE_BO_TOTAL:
p_max = parent == NULL ? S64_MAX :
parent->dev_resources[minor]->
bo_limits_total_allocated;
rc = drmcg_process_limit_s64_val(sattr, true,
props->bo_limits_total_allocated_default,
p_max,
&val);
IIUC, this allows initiating the particular limit value based either on parent or the default per-device value. This is alas rather an antipattern. The most stringent limit on the path from a cgroup to the root should be applied at the charging time. However, the child should not inherit the verbatim value from the parent (may race with parent and it won't be updated upon parent change). You already do the appropriate hierarchical check in drmcg_try_chb_bo_alloc, so the parent propagation could be simply dropped if I'm not mistaken.
Also, I can't find how the read of parent->dev_resources[minor]->bo_limits_total_allocated and its concurrent update are synchronized (i.e. someone writing buffer.total.max for parent and child in parallel). (It may just my oversight.)
I'm posting this to the buffer knobs patch but similar applies to lgpu resource controls as well.
HTH, Michal
On Tue, Oct 1, 2019 at 10:30 AM Michal Koutný mkoutny@suse.com wrote:
On Thu, Aug 29, 2019 at 02:05:24AM -0400, Kenny Ho Kenny.Ho@amd.com wrote:
drm.buffer.default A read-only flat-keyed file which exists on the root cgroup. Each entry is keyed by the drm device's major:minor.
Default limits on the total GEM buffer allocation in bytes.
What is the purpose of this attribute (and alikes for other resources)? I can't see it being set differently but S64_MAX in drmcg_device_early_init.
cgroup has a number of conventions and one of which is the idea of a default. The idea here is to allow for device specific defaults. For this specific resource, I can probably not expose it since it's not particularly useful, but for other resources (such as the lgpu resource) the concept of a default is useful (for example, different devices can have different number of lgpu.)
+static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, [...]
switch (type) {
case DRMCG_TYPE_BO_TOTAL:
p_max = parent == NULL ? S64_MAX :
parent->dev_resources[minor]->
bo_limits_total_allocated;
rc = drmcg_process_limit_s64_val(sattr, true,
props->bo_limits_total_allocated_default,
p_max,
&val);
IIUC, this allows initiating the particular limit value based either on parent or the default per-device value. This is alas rather an antipattern. The most stringent limit on the path from a cgroup to the root should be applied at the charging time. However, the child should not inherit the verbatim value from the parent (may race with parent and it won't be updated upon parent change).
I think this was a mistake during one of my refactor and I shrunk the critical section protected by a mutex a bit too much. But you are right in the sense that I don't propagate the limits downward to the children when the parent's limit is updated. But from the user interface perspective, wouldn't this be confusing? When a sysadmin sets a limit using the 'max' keyword, the value would be a global one even though the actual allowable maximum for the particular cgroup is less in reality because of the ancestor cgroups? (If this is the established norm, I am ok to go along but seems confusing to me.) I am probably missing something because as I implemented this, the 'max' and 'default' semantic has been confusing to me especially for the children cgroups due to the context of the ancestors.
You already do the appropriate hierarchical check in drmcg_try_chb_bo_alloc, so the parent propagation could be simply dropped if I'm not mistaken.
I will need to double check. But I think interaction between parent and children (or perhaps between siblings) will be needed eventually because there seems to be a desire to implement "weight" type of resource. Also, from performance perspective, wouldn't it make more sense to make sure the limits are set correctly during configuration than to have to check all the cgroups up through the parents? I don't have comprehensive knowledge of the implementation of other cgroup controllers so if more experience folks can comment that would be great. (Although, I probably should just do one approach instead of doing both... or 1.5.)
Also, I can't find how the read of parent->dev_resources[minor]->bo_limits_total_allocated and its concurrent update are synchronized (i.e. someone writing buffer.total.max for parent and child in parallel). (It may just my oversight.)
This is probably the refactor mistake I mentioned earlier.
Regards, Kenny
drm.buffer.peak.default A read-only flat-keyed file which exists on the root cgroup. Each entry is keyed by the drm device's major:minor.
Default limits on the largest GEM buffer allocation in bytes.
drm.buffer.peak.max A read-write flat-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor.
Per device limits on the largest GEM buffer allocation in bytes. This is a hard limit. Attempts in allocating beyond the cgroup limit will result in ENOMEM. Shorthand understood by memparse (such as k, m, g) can be used.
Set largest allocation for /dev/dri/card1 to 4MB echo "226:1 4m" > drm.buffer.peak.max
Change-Id: I0830d56775568e1cf215b56cc892d5e7945e9f25 Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- Documentation/admin-guide/cgroup-v2.rst | 18 ++++++++++ include/drm/drm_cgroup.h | 1 + include/linux/cgroup_drm.h | 1 + kernel/cgroup/drm.c | 48 +++++++++++++++++++++++++ 4 files changed, 68 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index e8fac2684179..87a195133eaa 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1940,6 +1940,24 @@ DRM Interface Files Set allocation limit for /dev/dri/card0 to 512MB echo "226:0 512m" > drm.buffer.total.max
+ drm.buffer.peak.default + A read-only flat-keyed file which exists on the root cgroup. + Each entry is keyed by the drm device's major:minor. + + Default limits on the largest GEM buffer allocation in bytes. + + drm.buffer.peak.max + A read-write flat-keyed file which exists on all cgroups. Each + entry is keyed by the drm device's major:minor. + + Per device limits on the largest GEM buffer allocation in bytes. + This is a hard limit. Attempts in allocating beyond the cgroup + limit will result in ENOMEM. Shorthand understood by memparse + (such as k, m, g) can be used. + + Set largest allocation for /dev/dri/card1 to 4MB + echo "226:1 4m" > drm.buffer.peak.max + GEM Buffer Ownership ~~~~~~~~~~~~~~~~~~~~
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 49c5d35ff6e1..d61b90beded5 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -14,6 +14,7 @@ struct drmcg_props { bool limit_enforced;
s64 bo_limits_total_allocated_default; + s64 bo_limits_peak_allocated_default; };
#ifdef CONFIG_CGROUP_DRM diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index eb54e56f20ae..87a2566c9fdd 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -29,6 +29,7 @@ struct drmcg_device_resource { s64 bo_limits_total_allocated;
s64 bo_stats_peak_allocated; + s64 bo_limits_peak_allocated;
s64 bo_stats_count_allocated; }; diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 7161fa40e156..2f54bff291e5 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -75,6 +75,9 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) ddr->bo_limits_total_allocated = dev->drmcg_props.bo_limits_total_allocated_default;
+ ddr->bo_limits_peak_allocated = + dev->drmcg_props.bo_limits_peak_allocated_default; + mutex_unlock(&dev->drmcg_mutex); return 0; } @@ -157,6 +160,9 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, case DRMCG_TYPE_BO_TOTAL: seq_printf(sf, "%lld\n", ddr->bo_limits_total_allocated); break; + case DRMCG_TYPE_BO_PEAK: + seq_printf(sf, "%lld\n", ddr->bo_limits_peak_allocated); + break; default: seq_puts(sf, "\n"); break; @@ -171,6 +177,10 @@ static void drmcg_print_default(struct drmcg_props *props, seq_printf(sf, "%lld\n", props->bo_limits_total_allocated_default); break; + case DRMCG_TYPE_BO_PEAK: + seq_printf(sf, "%lld\n", + props->bo_limits_peak_allocated_default); + break; default: seq_puts(sf, "\n"); break; @@ -327,6 +337,24 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, drmcg_value_apply(dm->dev, &ddr->bo_limits_total_allocated, val); break; + case DRMCG_TYPE_BO_PEAK: + p_max = parent == NULL ? S64_MAX : + parent->dev_resources[minor]-> + bo_limits_peak_allocated; + + rc = drmcg_process_limit_s64_val(sattr, true, + props->bo_limits_peak_allocated_default, + p_max, + &val); + + if (rc || val < 0) { + drmcg_pr_cft_err(drmcg, rc, cft_name, minor); + break; + } + + drmcg_value_apply(dm->dev, + &ddr->bo_limits_peak_allocated, val); + break; default: break; } @@ -363,6 +391,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_PEAK, DRMCG_FTYPE_STATS), }, + { + .name = "buffer.peak.default", + .seq_show = drmcg_seq_show, + .flags = CFTYPE_ONLY_ON_ROOT, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_PEAK, + DRMCG_FTYPE_DEFAULT), + }, + { + .name = "buffer.peak.max", + .write = drmcg_limit_write, + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_PEAK, + DRMCG_FTYPE_LIMIT), + }, { .name = "buffer.count.stats", .seq_show = drmcg_seq_show, @@ -427,6 +469,7 @@ void drmcg_device_early_init(struct drm_device *dev) dev->drmcg_props.limit_enforced = false;
dev->drmcg_props.bo_limits_total_allocated_default = S64_MAX; + dev->drmcg_props.bo_limits_peak_allocated_default = S64_MAX;
drmcg_update_cg_tree(dev); } @@ -466,6 +509,11 @@ bool drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, result = false; break; } + + if (ddr->bo_limits_peak_allocated < size) { + result = false; + break; + } } }
The drm resource being measured is the TTM (Translation Table Manager) buffers. TTM manages different types of memory that a GPU might access. These memory types include dedicated Video RAM (VRAM) and host/system memory accessible through IOMMU (GART/GTT). TTM is currently used by multiple drm drivers (amd, ast, bochs, cirrus, hisilicon, maga200, nouveau, qxl, virtio, vmwgfx.)
drm.memory.stats A read-only nested-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor. The following nested keys are defined.
====== ============================================= system Host/system memory tt Host memory used by the drm device (GTT/GART) vram Video RAM used by the drm device priv Other drm device, vendor specific memory ====== =============================================
Reading returns the following::
226:0 system=0 tt=0 vram=0 priv=0 226:1 system=0 tt=9035776 vram=17768448 priv=16809984 226:2 system=0 tt=9035776 vram=17768448 priv=16809984
drm.memory.evict.stats A read-only flat-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor.
Total number of evictions.
Change-Id: Ice2c4cc845051229549bebeb6aa2d7d6153bdf6a Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 3 +- drivers/gpu/drm/ttm/ttm_bo.c | 30 +++++++ drivers/gpu/drm/ttm/ttm_bo_util.c | 4 + include/drm/drm_cgroup.h | 19 +++++ include/drm/ttm/ttm_bo_api.h | 2 + include/drm/ttm/ttm_bo_driver.h | 8 ++ include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 108 ++++++++++++++++++++++++ 8 files changed, 179 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index cfcbbdc39656..463e015e8694 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -1720,8 +1720,9 @@ int amdgpu_ttm_init(struct amdgpu_device *adev) mutex_init(&adev->mman.gtt_window_lock);
/* No others user of address space so set it to 0 */ - r = ttm_bo_device_init(&adev->mman.bdev, + r = ttm_bo_device_init_tmp(&adev->mman.bdev, &amdgpu_bo_driver, + adev->ddev, adev->ddev->anon_inode->i_mapping, adev->need_dma32); if (r) { diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 58c403eda04e..a0e9ce46baf3 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -34,6 +34,7 @@ #include <drm/ttm/ttm_module.h> #include <drm/ttm/ttm_bo_driver.h> #include <drm/ttm/ttm_placement.h> +#include <drm/drm_cgroup.h> #include <linux/jiffies.h> #include <linux/slab.h> #include <linux/sched.h> @@ -42,6 +43,7 @@ #include <linux/module.h> #include <linux/atomic.h> #include <linux/reservation.h> +#include <linux/cgroup_drm.h>
static void ttm_bo_global_kobj_release(struct kobject *kobj);
@@ -151,6 +153,10 @@ static void ttm_bo_release_list(struct kref *list_kref) struct ttm_bo_device *bdev = bo->bdev; size_t acc_size = bo->acc_size;
+ if (bo->bdev->ddev != NULL) // TODO: remove after ddev initiazlied for all + drmcg_unchg_mem(bo); + drmcg_put(bo->drmcg); + BUG_ON(kref_read(&bo->list_kref)); BUG_ON(kref_read(&bo->kref)); BUG_ON(atomic_read(&bo->cpu_writers)); @@ -360,6 +366,8 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo, if (bo->mem.mem_type == TTM_PL_SYSTEM) { if (bdev->driver->move_notify) bdev->driver->move_notify(bo, evict, mem); + if (bo->bdev->ddev != NULL) // TODO: remove after ddev initiazlied for all + drmcg_mem_track_move(bo, evict, mem); bo->mem = *mem; mem->mm_node = NULL; goto moved; @@ -368,6 +376,8 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo,
if (bdev->driver->move_notify) bdev->driver->move_notify(bo, evict, mem); + if (bo->bdev->ddev != NULL) // TODO: remove after ddev initiazlied for all + drmcg_mem_track_move(bo, evict, mem);
if (!(old_man->flags & TTM_MEMTYPE_FLAG_FIXED) && !(new_man->flags & TTM_MEMTYPE_FLAG_FIXED)) @@ -381,6 +391,8 @@ static int ttm_bo_handle_move_mem(struct ttm_buffer_object *bo, if (bdev->driver->move_notify) { swap(*mem, bo->mem); bdev->driver->move_notify(bo, false, mem); + if (bo->bdev->ddev != NULL) // TODO: remove after ddev initiazlied for all + drmcg_mem_track_move(bo, evict, mem); swap(*mem, bo->mem); }
@@ -1355,6 +1367,10 @@ int ttm_bo_init_reserved(struct ttm_bo_device *bdev, WARN_ON(!locked); }
+ bo->drmcg = drmcg_get(current); + if (bo->bdev->ddev != NULL) // TODO: remove after ddev initiazlied for all + drmcg_chg_mem(bo); + if (likely(!ret)) ret = ttm_bo_validate(bo, placement, ctx);
@@ -1747,6 +1763,20 @@ int ttm_bo_device_init(struct ttm_bo_device *bdev, } EXPORT_SYMBOL(ttm_bo_device_init);
+/* TODO merge with official function when implementation finalized*/ +int ttm_bo_device_init_tmp(struct ttm_bo_device *bdev, + struct ttm_bo_driver *driver, + struct drm_device *ddev, + struct address_space *mapping, + bool need_dma32) +{ + int ret = ttm_bo_device_init(bdev, driver, mapping, need_dma32); + + bdev->ddev = ddev; + return ret; +} +EXPORT_SYMBOL(ttm_bo_device_init_tmp); + /* * buffer object vm functions. */ diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 895d77d799e4..15acd2c0720e 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -32,6 +32,7 @@ #include <drm/ttm/ttm_bo_driver.h> #include <drm/ttm/ttm_placement.h> #include <drm/drm_vma_manager.h> +#include <drm/drm_cgroup.h> #include <linux/io.h> #include <linux/highmem.h> #include <linux/wait.h> @@ -522,6 +523,9 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, ret = reservation_object_trylock(fbo->base.resv); WARN_ON(!ret);
+ if (bo->bdev->ddev != NULL) // TODO: remove after ddev initiazlied for all + drmcg_chg_mem(bo); + *new_obj = &fbo->base; return 0; } diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index d61b90beded5..7d63f73a5375 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -5,6 +5,7 @@ #define __DRM_CGROUP_H__
#include <linux/cgroup_drm.h> +#include <drm/ttm/ttm_bo_api.h>
/** * Per DRM device properties for DRM cgroup controller for the purpose @@ -25,6 +26,11 @@ bool drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size); void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size); +void drmcg_chg_mem(struct ttm_buffer_object *tbo); +void drmcg_unchg_mem(struct ttm_buffer_object *tbo); +void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict, + struct ttm_mem_reg *new_mem); + #else static inline void drmcg_device_update(struct drm_device *device) { @@ -43,5 +49,18 @@ static inline void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size) { } + +static inline void drmcg_chg_mem(struct ttm_buffer_object *tbo) +{ +} + +static inline void drmcg_unchg_mem(struct ttm_buffer_object *tbo) +{ +} + +static inline void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, + bool evict, struct ttm_mem_reg *new_mem) +{ +} #endif /* CONFIG_CGROUP_DRM */ #endif /* __DRM_CGROUP_H__ */ diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h index 49d9cdfc58f2..839936ab358c 100644 --- a/include/drm/ttm/ttm_bo_api.h +++ b/include/drm/ttm/ttm_bo_api.h @@ -128,6 +128,7 @@ struct ttm_tt; * struct ttm_buffer_object * * @bdev: Pointer to the buffer object device structure. + * @drmcg: DRM cgroup this object belongs to. * @type: The bo type. * @destroy: Destruction function. If NULL, kfree is used. * @num_pages: Actual number of pages. @@ -174,6 +175,7 @@ struct ttm_buffer_object { */
struct ttm_bo_device *bdev; + struct drmcg *drmcg; enum ttm_bo_type type; void (*destroy) (struct ttm_buffer_object *); unsigned long num_pages; diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index c9b8ba492f24..e1a805d65b83 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -30,6 +30,7 @@ #ifndef _TTM_BO_DRIVER_H_ #define _TTM_BO_DRIVER_H_
+#include <drm/drm_device.h> #include <drm/drm_mm.h> #include <drm/drm_vma_manager.h> #include <linux/workqueue.h> @@ -442,6 +443,7 @@ extern struct ttm_bo_global { * @driver: Pointer to a struct ttm_bo_driver struct setup by the driver. * @man: An array of mem_type_managers. * @vma_manager: Address space manager + * @ddev: Pointer to struct drm_device that this ttm_bo_device belongs to * lru_lock: Spinlock that protects the buffer+device lru lists and * ddestroy lists. * @dev_mapping: A pointer to the struct address_space representing the @@ -460,6 +462,7 @@ struct ttm_bo_device { struct ttm_bo_global *glob; struct ttm_bo_driver *driver; struct ttm_mem_type_manager man[TTM_NUM_MEM_TYPES]; + struct drm_device *ddev;
/* * Protected by internal locks. @@ -598,6 +601,11 @@ int ttm_bo_device_init(struct ttm_bo_device *bdev, struct address_space *mapping, bool need_dma32);
+int ttm_bo_device_init_tmp(struct ttm_bo_device *bdev, + struct ttm_bo_driver *driver, + struct drm_device *ddev, + struct address_space *mapping, + bool need_dma32); /** * ttm_bo_unmap_virtual * diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index 87a2566c9fdd..4c2794c9333d 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -9,6 +9,7 @@ #include <linux/mutex.h> #include <linux/cgroup.h> #include <drm/drm_file.h> +#include <drm/ttm/ttm_placement.h>
/* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER) @@ -17,6 +18,8 @@ enum drmcg_res_type { DRMCG_TYPE_BO_TOTAL, DRMCG_TYPE_BO_PEAK, DRMCG_TYPE_BO_COUNT, + DRMCG_TYPE_MEM, + DRMCG_TYPE_MEM_EVICT, __DRMCG_TYPE_LAST, };
@@ -32,6 +35,9 @@ struct drmcg_device_resource { s64 bo_limits_peak_allocated;
s64 bo_stats_count_allocated; + + s64 mem_stats[TTM_PL_PRIV+1]; + s64 mem_stats_evict; };
/** diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 2f54bff291e5..4960a8d1e8f4 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -10,6 +10,8 @@ #include <linux/kernel.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> +#include <drm/ttm/ttm_bo_api.h> +#include <drm/ttm/ttm_bo_driver.h> #include <drm/drm_device.h> #include <drm/drm_ioctl.h> #include <drm/drm_cgroup.h> @@ -31,6 +33,13 @@ enum drmcg_file_type { DRMCG_FTYPE_DEFAULT, };
+static char const *ttm_placement_names[] = { + [TTM_PL_SYSTEM] = "system", + [TTM_PL_TT] = "tt", + [TTM_PL_VRAM] = "vram", + [TTM_PL_PRIV] = "priv", +}; + static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data) @@ -127,6 +136,7 @@ drmcg_css_alloc(struct cgroup_subsys_state *parent_css) static void drmcg_print_stats(struct drmcg_device_resource *ddr, struct seq_file *sf, enum drmcg_res_type type) { + int i; if (ddr == NULL) { seq_puts(sf, "\n"); return; @@ -142,6 +152,16 @@ static void drmcg_print_stats(struct drmcg_device_resource *ddr, case DRMCG_TYPE_BO_COUNT: seq_printf(sf, "%lld\n", ddr->bo_stats_count_allocated); break; + case DRMCG_TYPE_MEM: + for (i = 0; i <= TTM_PL_PRIV; i++) { + seq_printf(sf, "%s=%lld ", ttm_placement_names[i], + ddr->mem_stats[i]); + } + seq_puts(sf, "\n"); + break; + case DRMCG_TYPE_MEM_EVICT: + seq_printf(sf, "%lld\n", ddr->mem_stats_evict); + break; default: seq_puts(sf, "\n"); break; @@ -411,6 +431,18 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BO_COUNT, DRMCG_FTYPE_STATS), }, + { + .name = "memory.stats", + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_MEM, + DRMCG_FTYPE_STATS), + }, + { + .name = "memory.evict.stats", + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_MEM_EVICT, + DRMCG_FTYPE_STATS), + }, { } /* terminate */ };
@@ -566,3 +598,79 @@ void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, mutex_unlock(&dev->drmcg_mutex); } EXPORT_SYMBOL(drmcg_unchg_bo_alloc); + +void drmcg_chg_mem(struct ttm_buffer_object *tbo) +{ + struct drm_device *dev = tbo->bdev->ddev; + struct drmcg *drmcg = tbo->drmcg; + int devIdx = dev->primary->index; + s64 size = (s64)(tbo->mem.size); + int mem_type = tbo->mem.mem_type; + struct drmcg_device_resource *ddr; + + if (drmcg == NULL) + return; + + mem_type = mem_type > TTM_PL_PRIV ? TTM_PL_PRIV : mem_type; + + mutex_lock(&dev->drmcg_mutex); + for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { + ddr = drmcg->dev_resources[devIdx]; + ddr->mem_stats[mem_type] += size; + } + mutex_unlock(&dev->drmcg_mutex); +} +EXPORT_SYMBOL(drmcg_chg_mem); + +void drmcg_unchg_mem(struct ttm_buffer_object *tbo) +{ + struct drm_device *dev = tbo->bdev->ddev; + struct drmcg *drmcg = tbo->drmcg; + int devIdx = dev->primary->index; + s64 size = (s64)(tbo->mem.size); + int mem_type = tbo->mem.mem_type; + struct drmcg_device_resource *ddr; + + if (drmcg == NULL) + return; + + mem_type = mem_type > TTM_PL_PRIV ? TTM_PL_PRIV : mem_type; + + mutex_lock(&dev->drmcg_mutex); + for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { + ddr = drmcg->dev_resources[devIdx]; + ddr->mem_stats[mem_type] -= size; + } + mutex_unlock(&dev->drmcg_mutex); +} +EXPORT_SYMBOL(drmcg_unchg_mem); + +void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict, + struct ttm_mem_reg *new_mem) +{ + struct drm_device *dev = old_bo->bdev->ddev; + struct drmcg *drmcg = old_bo->drmcg; + s64 move_in_bytes = (s64)(old_bo->mem.size); + int devIdx = dev->primary->index; + int old_mem_type = old_bo->mem.mem_type; + int new_mem_type = new_mem->mem_type; + struct drmcg_device_resource *ddr; + + if (drmcg == NULL) + return; + + old_mem_type = old_mem_type > TTM_PL_PRIV ? TTM_PL_PRIV : old_mem_type; + new_mem_type = new_mem_type > TTM_PL_PRIV ? TTM_PL_PRIV : new_mem_type; + + mutex_lock(&dev->drmcg_mutex); + for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { + ddr = drmcg->dev_resources[devIdx]; + ddr->mem_stats[old_mem_type] -= move_in_bytes; + ddr->mem_stats[new_mem_type] += move_in_bytes; + + if (evict) + ddr->mem_stats_evict++; + } + mutex_unlock(&dev->drmcg_mutex); +} +EXPORT_SYMBOL(drmcg_mem_track_move);
drm.memory.peak.stats A read-only nested-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor. The following nested keys are defined.
====== ============================================== system Peak host memory used tt Peak host memory used by the device (GTT/GART) vram Peak Video RAM used by the drm device priv Other drm device specific memory peak usage ====== ==============================================
Reading returns the following::
226:0 system=0 tt=0 vram=0 priv=0 226:1 system=0 tt=9035776 vram=17768448 priv=16809984 226:2 system=0 tt=9035776 vram=17768448 priv=16809984
Change-Id: I986e44533848f66411465bdd52105e78105a709a Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- include/linux/cgroup_drm.h | 2 ++ kernel/cgroup/drm.c | 19 +++++++++++++++++++ 2 files changed, 21 insertions(+)
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index 4c2794c9333d..9579e2a0b71d 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -20,6 +20,7 @@ enum drmcg_res_type { DRMCG_TYPE_BO_COUNT, DRMCG_TYPE_MEM, DRMCG_TYPE_MEM_EVICT, + DRMCG_TYPE_MEM_PEAK, __DRMCG_TYPE_LAST, };
@@ -37,6 +38,7 @@ struct drmcg_device_resource { s64 bo_stats_count_allocated;
s64 mem_stats[TTM_PL_PRIV+1]; + s64 mem_peaks[TTM_PL_PRIV+1]; s64 mem_stats_evict; };
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 4960a8d1e8f4..899dc44722c3 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -162,6 +162,13 @@ static void drmcg_print_stats(struct drmcg_device_resource *ddr, case DRMCG_TYPE_MEM_EVICT: seq_printf(sf, "%lld\n", ddr->mem_stats_evict); break; + case DRMCG_TYPE_MEM_PEAK: + for (i = 0; i <= TTM_PL_PRIV; i++) { + seq_printf(sf, "%s=%lld ", ttm_placement_names[i], + ddr->mem_peaks[i]); + } + seq_puts(sf, "\n"); + break; default: seq_puts(sf, "\n"); break; @@ -443,6 +450,12 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_MEM_EVICT, DRMCG_FTYPE_STATS), }, + { + .name = "memory.peaks.stats", + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_MEM_PEAK, + DRMCG_FTYPE_STATS), + }, { } /* terminate */ };
@@ -617,6 +630,8 @@ void drmcg_chg_mem(struct ttm_buffer_object *tbo) for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { ddr = drmcg->dev_resources[devIdx]; ddr->mem_stats[mem_type] += size; + ddr->mem_peaks[mem_type] = max(ddr->mem_peaks[mem_type], + ddr->mem_stats[mem_type]); } mutex_unlock(&dev->drmcg_mutex); } @@ -668,6 +683,10 @@ void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict, ddr->mem_stats[old_mem_type] -= move_in_bytes; ddr->mem_stats[new_mem_type] += move_in_bytes;
+ ddr->mem_peaks[new_mem_type] = max( + ddr->mem_peaks[new_mem_type], + ddr->mem_stats[new_mem_type]); + if (evict) ddr->mem_stats_evict++; }
The bandwidth is measured by keeping track of the amount of bytes moved by ttm within a time period. We defined two type of bandwidth: burst and average. Average bandwidth is calculated by dividing the total amount of bytes moved within a cgroup by the lifetime of the cgroup. Burst bandwidth is similar except that the byte and time measurement is reset after a user configurable period.
The bandwidth control is best effort since it is done on a per move basis instead of per byte. The bandwidth is limited by delaying the move of a buffer. The bandwidth limit can be exceeded when the next move is larger than the remaining allowance.
drm.burst_bw_period_in_us A read-write flat-keyed file which exists on the root cgroup. Each entry is keyed by the drm device's major:minor.
Length of a period use to measure burst bandwidth in us. One period per device.
drm.burst_bw_period_in_us.default A read-only flat-keyed file which exists on the root cgroup. Each entry is keyed by the drm device's major:minor.
Default length of a period in us (one per device.)
drm.bandwidth.stats A read-only nested-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor. The following nested keys are defined.
================= ====================================== burst_byte_per_us Burst bandwidth avg_bytes_per_us Average bandwidth moved_byte Amount of byte moved within a period accum_us Amount of time accumulated in a period total_moved_byte Byte moved within the cgroup lifetime total_accum_us Cgroup lifetime in us byte_credit Available byte credit to limit avg bw ================= ======================================
Reading returns the following:: 226:1 burst_byte_per_us=23 avg_bytes_per_us=0 moved_byte=2244608 accum_us=95575 total_moved_byte=45899776 total_accum_us=201634590 byte_credit=13214278590464 226:2 burst_byte_per_us=10 avg_bytes_per_us=219 moved_byte=430080 accum_us=39350 total_moved_byte=65518026752 total_accum_us=298337721 byte_credit=9223372036854644735
drm.bandwidth.high A read-write nested-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor. The following nested keys are defined.
================ ======================================= bytes_in_period Burst limit per period in byte avg_bytes_per_us Average bandwidth limit in bytes per us ================ =======================================
Reading returns the following::
226:1 bytes_in_period=9223372036854775807 avg_bytes_per_us=65536 226:2 bytes_in_period=9223372036854775807 avg_bytes_per_us=65536
drm.bandwidth.default A read-only nested-keyed file which exists on the root cgroup. Each entry is keyed by the drm device's major:minor. The following nested keys are defined.
================ ======================================== bytes_in_period Default burst limit per period in byte avg_bytes_per_us Default average bw limit in bytes per us ================ ========================================
Reading returns the following::
226:1 bytes_in_period=9223372036854775807 avg_bytes_per_us=65536 226:2 bytes_in_period=9223372036854775807 avg_bytes_per_us=65536
Change-Id: Ie573491325ccc16535bb943e7857f43bd0962add Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- drivers/gpu/drm/ttm/ttm_bo.c | 7 + include/drm/drm_cgroup.h | 19 +++ include/linux/cgroup_drm.h | 16 ++ kernel/cgroup/drm.c | 319 ++++++++++++++++++++++++++++++++++- 4 files changed, 359 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index a0e9ce46baf3..32eee85f3641 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -36,6 +36,7 @@ #include <drm/ttm/ttm_placement.h> #include <drm/drm_cgroup.h> #include <linux/jiffies.h> +#include <linux/delay.h> #include <linux/slab.h> #include <linux/sched.h> #include <linux/mm.h> @@ -1256,6 +1257,12 @@ int ttm_bo_validate(struct ttm_buffer_object *bo, * Check whether we need to move buffer. */ if (!ttm_bo_mem_compat(placement, &bo->mem, &new_flags)) { + unsigned int move_delay = drmcg_get_mem_bw_period_in_us(bo); + + move_delay /= 2000; /* check every half period in ms*/ + while (bo->bdev->ddev != NULL && !drmcg_mem_can_move(bo)) + msleep(move_delay); + ret = ttm_bo_move_buffer(bo, placement, ctx); if (ret) return ret; diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 7d63f73a5375..9ce0d54e6bd8 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -16,6 +16,12 @@ struct drmcg_props {
s64 bo_limits_total_allocated_default; s64 bo_limits_peak_allocated_default; + + s64 mem_bw_limits_period_in_us; + s64 mem_bw_limits_period_in_us_default; + + s64 mem_bw_bytes_in_period_default; + s64 mem_bw_avg_bytes_per_us_default; };
#ifdef CONFIG_CGROUP_DRM @@ -30,6 +36,8 @@ void drmcg_chg_mem(struct ttm_buffer_object *tbo); void drmcg_unchg_mem(struct ttm_buffer_object *tbo); void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict, struct ttm_mem_reg *new_mem); +unsigned int drmcg_get_mem_bw_period_in_us(struct ttm_buffer_object *tbo); +bool drmcg_mem_can_move(struct ttm_buffer_object *tbo);
#else static inline void drmcg_device_update(struct drm_device *device) @@ -62,5 +70,16 @@ static inline void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict, struct ttm_mem_reg *new_mem) { } + +static inline unsigned int drmcg_get_mem_bw_period_in_us( + struct ttm_buffer_object *tbo) +{ + return 0; +} + +static inline bool drmcg_mem_can_move(struct ttm_buffer_object *tbo) +{ + return true; +} #endif /* CONFIG_CGROUP_DRM */ #endif /* __DRM_CGROUP_H__ */ diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index 9579e2a0b71d..27809a583bf2 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -14,6 +14,15 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+enum drmcg_mem_bw_attr { + DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */ + DRMCG_MEM_BW_ATTR_ACCUM_US, /* for calulating 'instantaneous' bw */ + DRMCG_MEM_BW_ATTR_TOTAL_BYTE_MOVED, + DRMCG_MEM_BW_ATTR_TOTAL_ACCUM_US, + DRMCG_MEM_BW_ATTR_BYTE_CREDIT, + __DRMCG_MEM_BW_ATTR_LAST, +}; + enum drmcg_res_type { DRMCG_TYPE_BO_TOTAL, DRMCG_TYPE_BO_PEAK, @@ -21,6 +30,8 @@ enum drmcg_res_type { DRMCG_TYPE_MEM, DRMCG_TYPE_MEM_EVICT, DRMCG_TYPE_MEM_PEAK, + DRMCG_TYPE_BANDWIDTH, + DRMCG_TYPE_BANDWIDTH_PERIOD_BURST, __DRMCG_TYPE_LAST, };
@@ -40,6 +51,11 @@ struct drmcg_device_resource { s64 mem_stats[TTM_PL_PRIV+1]; s64 mem_peaks[TTM_PL_PRIV+1]; s64 mem_stats_evict; + + s64 mem_bw_stats_last_update_us; + s64 mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST]; + s64 mem_bw_limits_bytes_in_period; + s64 mem_bw_limits_avg_bytes_per_us; };
/** diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 899dc44722c3..ab962a277e58 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -7,6 +7,7 @@ #include <linux/seq_file.h> #include <linux/mutex.h> #include <linux/cgroup_drm.h> +#include <linux/ktime.h> #include <linux/kernel.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> @@ -40,6 +41,17 @@ static char const *ttm_placement_names[] = { [TTM_PL_PRIV] = "priv", };
+static char const *mem_bw_attr_names[] = { + [DRMCG_MEM_BW_ATTR_BYTE_MOVED] = "moved_byte", + [DRMCG_MEM_BW_ATTR_ACCUM_US] = "accum_us", + [DRMCG_MEM_BW_ATTR_TOTAL_BYTE_MOVED] = "total_moved_byte", + [DRMCG_MEM_BW_ATTR_TOTAL_ACCUM_US] = "total_accum_us", + [DRMCG_MEM_BW_ATTR_BYTE_CREDIT] = "byte_credit", +}; + +#define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us" +#define MEM_BW_LIMITS_NAME_BURST "bytes_in_period" + static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data) @@ -75,6 +87,9 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev)
if (!ddr) return -ENOMEM; + + ddr->mem_bw_stats_last_update_us = ktime_to_us(ktime_get()); + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_ACCUM_US] = 1; }
mutex_lock(&dev->drmcg_mutex); @@ -87,6 +102,12 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) ddr->bo_limits_peak_allocated = dev->drmcg_props.bo_limits_peak_allocated_default;
+ ddr->mem_bw_limits_bytes_in_period = + dev->drmcg_props.mem_bw_bytes_in_period_default; + + ddr->mem_bw_limits_avg_bytes_per_us = + dev->drmcg_props.mem_bw_avg_bytes_per_us_default; + mutex_unlock(&dev->drmcg_mutex); return 0; } @@ -133,6 +154,26 @@ drmcg_css_alloc(struct cgroup_subsys_state *parent_css) return &drmcg->css; }
+static inline void drmcg_mem_burst_bw_stats_reset(struct drm_device *dev) +{ + struct cgroup_subsys_state *pos; + struct drmcg *node; + struct drmcg_device_resource *ddr; + int devIdx; + + devIdx = dev->primary->index; + + rcu_read_lock(); + css_for_each_descendant_pre(pos, &root_drmcg->css) { + node = css_to_drmcg(pos); + ddr = node->dev_resources[devIdx]; + + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_ACCUM_US] = 1; + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_MOVED] = 0; + } + rcu_read_unlock(); +} + static void drmcg_print_stats(struct drmcg_device_resource *ddr, struct seq_file *sf, enum drmcg_res_type type) { @@ -169,6 +210,31 @@ static void drmcg_print_stats(struct drmcg_device_resource *ddr, } seq_puts(sf, "\n"); break; + case DRMCG_TYPE_BANDWIDTH: + if (ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_ACCUM_US] == 0) + seq_puts(sf, "burst_byte_per_us=NaN "); + else + seq_printf(sf, "burst_byte_per_us=%lld ", + ddr->mem_bw_stats[ + DRMCG_MEM_BW_ATTR_BYTE_MOVED]/ + ddr->mem_bw_stats[ + DRMCG_MEM_BW_ATTR_ACCUM_US]); + + if (ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_TOTAL_ACCUM_US] == 0) + seq_puts(sf, "avg_bytes_per_us=NaN "); + else + seq_printf(sf, "avg_bytes_per_us=%lld ", + ddr->mem_bw_stats[ + DRMCG_MEM_BW_ATTR_TOTAL_BYTE_MOVED]/ + ddr->mem_bw_stats[ + DRMCG_MEM_BW_ATTR_TOTAL_ACCUM_US]); + + for (i = 0; i < __DRMCG_MEM_BW_ATTR_LAST; i++) { + seq_printf(sf, "%s=%lld ", mem_bw_attr_names[i], + ddr->mem_bw_stats[i]); + } + seq_puts(sf, "\n"); + break; default: seq_puts(sf, "\n"); break; @@ -176,7 +242,8 @@ static void drmcg_print_stats(struct drmcg_device_resource *ddr, }
static void drmcg_print_limits(struct drmcg_device_resource *ddr, - struct seq_file *sf, enum drmcg_res_type type) + struct seq_file *sf, enum drmcg_res_type type, + struct drm_device *dev) { if (ddr == NULL) { seq_puts(sf, "\n"); @@ -190,6 +257,17 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, case DRMCG_TYPE_BO_PEAK: seq_printf(sf, "%lld\n", ddr->bo_limits_peak_allocated); break; + case DRMCG_TYPE_BANDWIDTH_PERIOD_BURST: + seq_printf(sf, "%lld\n", + dev->drmcg_props.mem_bw_limits_period_in_us); + break; + case DRMCG_TYPE_BANDWIDTH: + seq_printf(sf, "%s=%lld %s=%lld\n", + MEM_BW_LIMITS_NAME_BURST, + ddr->mem_bw_limits_bytes_in_period, + MEM_BW_LIMITS_NAME_AVG, + ddr->mem_bw_limits_avg_bytes_per_us); + break; default: seq_puts(sf, "\n"); break; @@ -208,6 +286,17 @@ static void drmcg_print_default(struct drmcg_props *props, seq_printf(sf, "%lld\n", props->bo_limits_peak_allocated_default); break; + case DRMCG_TYPE_BANDWIDTH_PERIOD_BURST: + seq_printf(sf, "%lld\n", + props->mem_bw_limits_period_in_us_default); + break; + case DRMCG_TYPE_BANDWIDTH: + seq_printf(sf, "%s=%lld %s=%lld\n", + MEM_BW_LIMITS_NAME_BURST, + props->mem_bw_bytes_in_period_default, + MEM_BW_LIMITS_NAME_AVG, + props->mem_bw_avg_bytes_per_us_default); + break; default: seq_puts(sf, "\n"); break; @@ -237,7 +326,7 @@ static int drmcg_seq_show_fn(int id, void *ptr, void *data) drmcg_print_stats(ddr, sf, type); break; case DRMCG_FTYPE_LIMIT: - drmcg_print_limits(ddr, sf, type); + drmcg_print_limits(ddr, sf, type, minor->dev); break; case DRMCG_FTYPE_DEFAULT: drmcg_print_default(&minor->dev->drmcg_props, sf, type); @@ -301,6 +390,83 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) mutex_unlock(&dev->drmcg_mutex); }
+static void drmcg_nested_limit_parse(struct kernfs_open_file *of, + struct drm_device *dev, char *attrs) +{ + enum drmcg_res_type type = + DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); + struct drmcg *drmcg = css_to_drmcg(of_css(of)); + struct drmcg *parent = drmcg_parent(drmcg); + struct drmcg_props *props = &dev->drmcg_props; + char *cft_name = of_cft(of)->name; + int minor = dev->primary->index; + char *nested = strstrip(attrs); + struct drmcg_device_resource *ddr = + drmcg->dev_resources[minor]; + char *attr; + char sname[256]; + char sval[256]; + s64 val; + s64 p_max; + int rc; + + while (nested != NULL) { + attr = strsep(&nested, " "); + + if (sscanf(attr, "%255[^=]=%255[^=]", sname, sval) != 2) + continue; + + switch (type) { + case DRMCG_TYPE_BANDWIDTH: + if (strncmp(sname, MEM_BW_LIMITS_NAME_BURST, 256) + == 0) { + p_max = parent == NULL ? S64_MAX : + parent->dev_resources[minor]-> + mem_bw_limits_bytes_in_period; + + rc = drmcg_process_limit_s64_val(sval, true, + props->mem_bw_bytes_in_period_default, + p_max, &val); + + if (rc || val < 0) { + drmcg_pr_cft_err(drmcg, rc, cft_name, + minor); + continue; + } + + drmcg_value_apply(dev, + &ddr->mem_bw_limits_bytes_in_period, + val); + continue; + } + + if (strncmp(sname, MEM_BW_LIMITS_NAME_AVG, 256) == 0) { + p_max = parent == NULL ? S64_MAX : + parent->dev_resources[minor]-> + mem_bw_limits_avg_bytes_per_us; + + rc = drmcg_process_limit_s64_val(sval, true, + props->mem_bw_avg_bytes_per_us_default, + p_max, &val); + + if (rc || val < 0) { + drmcg_pr_cft_err(drmcg, rc, cft_name, + minor); + continue; + } + + drmcg_value_apply(dev, + &ddr->mem_bw_limits_avg_bytes_per_us, + val); + continue; + } + break; /* DRMCG_TYPE_BANDWIDTH */ + default: + break; + } /* switch (type) */ + } +} + static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -382,6 +548,25 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, drmcg_value_apply(dm->dev, &ddr->bo_limits_peak_allocated, val); break; + case DRMCG_TYPE_BANDWIDTH_PERIOD_BURST: + rc = drmcg_process_limit_s64_val(sattr, false, + props->mem_bw_limits_period_in_us_default, + S64_MAX, + &val); + + if (rc || val < 2000) { + drmcg_pr_cft_err(drmcg, rc, cft_name, minor); + break; + } + + drmcg_value_apply(dm->dev, + &props->mem_bw_limits_period_in_us, + val); + drmcg_mem_burst_bw_stats_reset(dm->dev); + break; + case DRMCG_TYPE_BANDWIDTH: + drmcg_nested_limit_parse(of, dm->dev, sattr); + break; default: break; } @@ -456,6 +641,41 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_MEM_PEAK, DRMCG_FTYPE_STATS), }, + { + .name = "burst_bw_period_in_us", + .write = drmcg_limit_write, + .seq_show = drmcg_seq_show, + .flags = CFTYPE_ONLY_ON_ROOT, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH_PERIOD_BURST, + DRMCG_FTYPE_LIMIT), + }, + { + .name = "burst_bw_period_in_us.default", + .seq_show = drmcg_seq_show, + .flags = CFTYPE_ONLY_ON_ROOT, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH_PERIOD_BURST, + DRMCG_FTYPE_DEFAULT), + }, + { + .name = "bandwidth.stats", + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, + DRMCG_FTYPE_STATS), + }, + { + .name = "bandwidth.high", + .write = drmcg_limit_write, + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, + DRMCG_FTYPE_LIMIT), + }, + { + .name = "bandwidth.default", + .seq_show = drmcg_seq_show, + .flags = CFTYPE_ONLY_ON_ROOT, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, + DRMCG_FTYPE_DEFAULT), + }, { } /* terminate */ };
@@ -515,6 +735,10 @@ void drmcg_device_early_init(struct drm_device *dev)
dev->drmcg_props.bo_limits_total_allocated_default = S64_MAX; dev->drmcg_props.bo_limits_peak_allocated_default = S64_MAX; + dev->drmcg_props.mem_bw_limits_period_in_us_default = 200000; + dev->drmcg_props.mem_bw_limits_period_in_us = 200000; + dev->drmcg_props.mem_bw_bytes_in_period_default = S64_MAX; + dev->drmcg_props.mem_bw_avg_bytes_per_us_default = 65536;
drmcg_update_cg_tree(dev); } @@ -660,6 +884,27 @@ void drmcg_unchg_mem(struct ttm_buffer_object *tbo) } EXPORT_SYMBOL(drmcg_unchg_mem);
+static inline void drmcg_mem_bw_accum(s64 time_us, + struct drmcg_device_resource *ddr) +{ + s64 increment_us = time_us - ddr->mem_bw_stats_last_update_us; + s64 new_credit = ddr->mem_bw_limits_avg_bytes_per_us * increment_us; + + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_ACCUM_US] + += increment_us; + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_TOTAL_ACCUM_US] + += increment_us; + + if ((S64_MAX - new_credit) > + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_CREDIT]) + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_CREDIT] + += new_credit; + else + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_CREDIT] = S64_MAX; + + ddr->mem_bw_stats_last_update_us = time_us; +} + void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict, struct ttm_mem_reg *new_mem) { @@ -669,6 +914,7 @@ void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict, int devIdx = dev->primary->index; int old_mem_type = old_bo->mem.mem_type; int new_mem_type = new_mem->mem_type; + s64 time_us; struct drmcg_device_resource *ddr;
if (drmcg == NULL) @@ -677,6 +923,14 @@ void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict, old_mem_type = old_mem_type > TTM_PL_PRIV ? TTM_PL_PRIV : old_mem_type; new_mem_type = new_mem_type > TTM_PL_PRIV ? TTM_PL_PRIV : new_mem_type;
+ if (root_drmcg->dev_resources[devIdx] != NULL && + root_drmcg->dev_resources[devIdx]-> + mem_bw_stats[DRMCG_MEM_BW_ATTR_ACCUM_US] >= + dev->drmcg_props.mem_bw_limits_period_in_us) + drmcg_mem_burst_bw_stats_reset(dev); + + time_us = ktime_to_us(ktime_get()); + mutex_lock(&dev->drmcg_mutex); for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { ddr = drmcg->dev_resources[devIdx]; @@ -689,7 +943,68 @@ void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict,
if (evict) ddr->mem_stats_evict++; + + drmcg_mem_bw_accum(time_us, ddr); + + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_MOVED] + += move_in_bytes; + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_TOTAL_BYTE_MOVED] + += move_in_bytes; + + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_CREDIT] + -= move_in_bytes; } mutex_unlock(&dev->drmcg_mutex); } EXPORT_SYMBOL(drmcg_mem_track_move); + +unsigned int drmcg_get_mem_bw_period_in_us(struct ttm_buffer_object *tbo) +{ + struct drmcg_props *props; + + //TODO replace with BUG_ON + if (tbo->bdev->ddev == NULL) + return 0; + + props = &tbo->bdev->ddev->drmcg_props; + + return (unsigned int) props->mem_bw_limits_period_in_us; +} +EXPORT_SYMBOL(drmcg_get_mem_bw_period_in_us); + +bool drmcg_mem_can_move(struct ttm_buffer_object *tbo) +{ + struct drm_device *dev = tbo->bdev->ddev; + struct drmcg *drmcg = tbo->drmcg; + int devIdx = dev->primary->index; + s64 time_us; + struct drmcg_device_resource *ddr; + bool result = true; + + if (root_drmcg->dev_resources[devIdx] != NULL && + root_drmcg->dev_resources[devIdx]-> + mem_bw_stats[DRMCG_MEM_BW_ATTR_ACCUM_US] >= + dev->drmcg_props.mem_bw_limits_period_in_us) + drmcg_mem_burst_bw_stats_reset(dev); + + time_us = ktime_to_us(ktime_get()); + + mutex_lock(&dev->drmcg_mutex); + for ( ; drmcg != NULL; drmcg = drmcg_parent(drmcg)) { + ddr = drmcg->dev_resources[devIdx]; + + drmcg_mem_bw_accum(time_us, ddr); + + if (result && + (ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_MOVED] + >= ddr->mem_bw_limits_bytes_in_period || + ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_CREDIT] + <= 0)) { + result = false; + } + } + mutex_unlock(&dev->drmcg_mutex); + + return result; +} +EXPORT_SYMBOL(drmcg_mem_can_move);
Hi.
On Thu, Aug 29, 2019 at 02:05:28AM -0400, Kenny Ho Kenny.Ho@amd.com wrote:
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1256,6 +1257,12 @@ int ttm_bo_validate(struct ttm_buffer_object *bo, [...]
move_delay /= 2000; /* check every half period in ms*/
[...] diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c [...] @@ -382,6 +548,25 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, [...]
if (rc || val < 2000) {
This just caught my eye and it may be simply caused by RFC-ness of the series but I'd suggest turning this into a constant with descriptive name.
My 2 cents, Michal
The drm resource being limited is the TTM (Translation Table Manager) buffers. TTM manages different types of memory that a GPU might access. These memory types include dedicated Video RAM (VRAM) and host/system memory accessible through IOMMU (GART/GTT). TTM is currently used by multiple drm drivers (amd, ast, bochs, cirrus, hisilicon, maga200, nouveau, qxl, virtio, vmwgfx.)
TTM buffers belonging to drm cgroups under memory pressure will be selected to be evicted first.
drm.memory.high A read-write nested-keyed file which exists on all cgroups. Each entry is keyed by the drm device's major:minor. The following nested keys are defined.
==== ============================================= vram Video RAM soft limit for a drm device in byte ==== =============================================
Reading returns the following::
226:0 vram=0 226:1 vram=17768448 226:2 vram=17768448
drm.memory.default A read-only nested-keyed file which exists on the root cgroup. Each entry is keyed by the drm device's major:minor. The following nested keys are defined.
==== =============================== vram Video RAM default limit in byte ==== ===============================
Reading returns the following::
226:0 vram=0 226:1 vram=17768448 226:2 vram=17768448
Change-Id: I7988e28a453b53140b40a28c176239acbc81d491 Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- drivers/gpu/drm/ttm/ttm_bo.c | 7 ++ include/drm/drm_cgroup.h | 17 +++++ include/linux/cgroup_drm.h | 2 + kernel/cgroup/drm.c | 135 +++++++++++++++++++++++++++++++++++ 4 files changed, 161 insertions(+)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 32eee85f3641..d7e3d3128ebb 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -853,14 +853,21 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev, struct ttm_bo_global *glob = bdev->glob; struct ttm_mem_type_manager *man = &bdev->man[mem_type]; bool locked = false; + bool check_drmcg; unsigned i; int ret;
+ check_drmcg = drmcg_mem_pressure_scan(bdev, mem_type); + spin_lock(&glob->lru_lock); for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) { list_for_each_entry(bo, &man->lru[i], lru) { bool busy;
+ if (check_drmcg && + !drmcg_mem_should_evict(bo, mem_type)) + continue; + if (!ttm_bo_evict_swapout_allowable(bo, ctx, &locked, &busy)) { if (busy && !busy_bo && diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 9ce0d54e6bd8..c11df388fdf2 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -6,6 +6,7 @@
#include <linux/cgroup_drm.h> #include <drm/ttm/ttm_bo_api.h> +#include <drm/ttm/ttm_bo_driver.h>
/** * Per DRM device properties for DRM cgroup controller for the purpose @@ -22,6 +23,8 @@ struct drmcg_props {
s64 mem_bw_bytes_in_period_default; s64 mem_bw_avg_bytes_per_us_default; + + s64 mem_highs_default[TTM_PL_PRIV+1]; };
#ifdef CONFIG_CGROUP_DRM @@ -38,6 +41,8 @@ void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict, struct ttm_mem_reg *new_mem); unsigned int drmcg_get_mem_bw_period_in_us(struct ttm_buffer_object *tbo); bool drmcg_mem_can_move(struct ttm_buffer_object *tbo); +bool drmcg_mem_pressure_scan(struct ttm_bo_device *bdev, unsigned int type); +bool drmcg_mem_should_evict(struct ttm_buffer_object *tbo, unsigned int type);
#else static inline void drmcg_device_update(struct drm_device *device) @@ -81,5 +86,17 @@ static inline bool drmcg_mem_can_move(struct ttm_buffer_object *tbo) { return true; } + +static inline bool drmcg_mem_pressure_scan(struct ttm_bo_device *bdev, + unsigned int type) +{ + return false; +} + +static inline bool drmcg_mem_should_evict(struct ttm_buffer_object *tbo, + unsigned int type) +{ + return true; +} #endif /* CONFIG_CGROUP_DRM */ #endif /* __DRM_CGROUP_H__ */ diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index 27809a583bf2..c56cfe74d1a6 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -50,6 +50,8 @@ struct drmcg_device_resource {
s64 mem_stats[TTM_PL_PRIV+1]; s64 mem_peaks[TTM_PL_PRIV+1]; + s64 mem_highs[TTM_PL_PRIV+1]; + bool mem_pressure[TTM_PL_PRIV+1]; s64 mem_stats_evict;
s64 mem_bw_stats_last_update_us; diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index ab962a277e58..04fb9a398740 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -80,6 +80,7 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) { int minor = dev->primary->index; struct drmcg_device_resource *ddr = drmcg->dev_resources[minor]; + int i;
if (ddr == NULL) { ddr = kzalloc(sizeof(struct drmcg_device_resource), @@ -108,6 +109,12 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) ddr->mem_bw_limits_avg_bytes_per_us = dev->drmcg_props.mem_bw_avg_bytes_per_us_default;
+ ddr->mem_bw_limits_avg_bytes_per_us = + dev->drmcg_props.mem_bw_avg_bytes_per_us_default; + + for (i = 0; i <= TTM_PL_PRIV; i++) + ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i]; + mutex_unlock(&dev->drmcg_mutex); return 0; } @@ -257,6 +264,11 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, case DRMCG_TYPE_BO_PEAK: seq_printf(sf, "%lld\n", ddr->bo_limits_peak_allocated); break; + case DRMCG_TYPE_MEM: + seq_printf(sf, "%s=%lld\n", + ttm_placement_names[TTM_PL_VRAM], + ddr->mem_highs[TTM_PL_VRAM]); + break; case DRMCG_TYPE_BANDWIDTH_PERIOD_BURST: seq_printf(sf, "%lld\n", dev->drmcg_props.mem_bw_limits_period_in_us); @@ -286,6 +298,11 @@ static void drmcg_print_default(struct drmcg_props *props, seq_printf(sf, "%lld\n", props->bo_limits_peak_allocated_default); break; + case DRMCG_TYPE_MEM: + seq_printf(sf, "%s=%lld\n", + ttm_placement_names[TTM_PL_VRAM], + props->mem_highs_default[TTM_PL_VRAM]); + break; case DRMCG_TYPE_BANDWIDTH_PERIOD_BURST: seq_printf(sf, "%lld\n", props->mem_bw_limits_period_in_us_default); @@ -461,6 +478,29 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, continue; } break; /* DRMCG_TYPE_BANDWIDTH */ + case DRMCG_TYPE_MEM: + if (strncmp(sname, ttm_placement_names[TTM_PL_VRAM], + 256) == 0) { + p_max = parent == NULL ? S64_MAX : + parent->dev_resources[minor]-> + mem_highs[TTM_PL_VRAM]; + + rc = drmcg_process_limit_s64_val(sval, true, + props->mem_highs_default[TTM_PL_VRAM], + p_max, &val); + + if (rc || val < 0) { + drmcg_pr_cft_err(drmcg, rc, cft_name, + minor); + continue; + } + + drmcg_value_apply(dev, + &ddr->mem_highs[TTM_PL_VRAM], + val); + continue; + } + break; /* DRMCG_TYPE_MEM */ default: break; } /* switch (type) */ @@ -565,6 +605,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, drmcg_mem_burst_bw_stats_reset(dm->dev); break; case DRMCG_TYPE_BANDWIDTH: + case DRMCG_TYPE_MEM: drmcg_nested_limit_parse(of, dm->dev, sattr); break; default: @@ -641,6 +682,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_MEM_PEAK, DRMCG_FTYPE_STATS), }, + { + .name = "memory.default", + .seq_show = drmcg_seq_show, + .flags = CFTYPE_ONLY_ON_ROOT, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_MEM, + DRMCG_FTYPE_DEFAULT), + }, + { + .name = "memory.high", + .write = drmcg_limit_write, + .seq_show = drmcg_seq_show, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_MEM, + DRMCG_FTYPE_LIMIT), + }, { .name = "burst_bw_period_in_us", .write = drmcg_limit_write, @@ -731,6 +786,8 @@ EXPORT_SYMBOL(drmcg_device_update); */ void drmcg_device_early_init(struct drm_device *dev) { + int i; + dev->drmcg_props.limit_enforced = false;
dev->drmcg_props.bo_limits_total_allocated_default = S64_MAX; @@ -740,6 +797,9 @@ void drmcg_device_early_init(struct drm_device *dev) dev->drmcg_props.mem_bw_bytes_in_period_default = S64_MAX; dev->drmcg_props.mem_bw_avg_bytes_per_us_default = 65536;
+ for (i = 0; i <= TTM_PL_PRIV; i++) + dev->drmcg_props.mem_highs_default[i] = S64_MAX; + drmcg_update_cg_tree(dev); } EXPORT_SYMBOL(drmcg_device_early_init); @@ -1008,3 +1068,78 @@ bool drmcg_mem_can_move(struct ttm_buffer_object *tbo) return result; } EXPORT_SYMBOL(drmcg_mem_can_move); + +static inline void drmcg_mem_set_pressure(struct drmcg *drmcg, + int devIdx, unsigned int mem_type, bool pressure_val) +{ + struct drmcg_device_resource *ddr; + struct cgroup_subsys_state *pos; + struct drmcg *node; + + css_for_each_descendant_pre(pos, &drmcg->css) { + node = css_to_drmcg(pos); + ddr = node->dev_resources[devIdx]; + ddr->mem_pressure[mem_type] = pressure_val; + } +} + +static inline bool drmcg_mem_check(struct drmcg *drmcg, int devIdx, + unsigned int mem_type) +{ + struct drmcg_device_resource *ddr = drmcg->dev_resources[devIdx]; + + /* already under pressure, no need to check and set */ + if (ddr->mem_pressure[mem_type]) + return true; + + if (ddr->mem_stats[mem_type] >= ddr->mem_highs[mem_type]) { + drmcg_mem_set_pressure(drmcg, devIdx, mem_type, true); + return true; + } + + return false; +} + +bool drmcg_mem_pressure_scan(struct ttm_bo_device *bdev, unsigned int type) +{ + struct drm_device *dev = bdev->ddev; + struct cgroup_subsys_state *pos; + struct drmcg *node; + int devIdx; + bool result = false; + + //TODO replace with BUG_ON + if (dev == NULL || type != TTM_PL_VRAM) /* only vram limit for now */ + return false; + + devIdx = dev->primary->index; + + type = type > TTM_PL_PRIV ? TTM_PL_PRIV : type; + + rcu_read_lock(); + drmcg_mem_set_pressure(root_drmcg, devIdx, type, false); + + css_for_each_descendant_pre(pos, &root_drmcg->css) { + node = css_to_drmcg(pos); + result |= drmcg_mem_check(node, devIdx, type); + } + rcu_read_unlock(); + + return result; +} +EXPORT_SYMBOL(drmcg_mem_pressure_scan); + +bool drmcg_mem_should_evict(struct ttm_buffer_object *tbo, unsigned int type) +{ + struct drm_device *dev = tbo->bdev->ddev; + int devIdx; + + //TODO replace with BUG_ON + if (dev == NULL) + return true; + + devIdx = dev->primary->index; + + return tbo->drmcg->dev_resources[devIdx]->mem_pressure[type]; +} +EXPORT_SYMBOL(drmcg_mem_should_evict);
Allow DRM TTM memory manager to register a work_struct, such that, when a drmcgrp is under memory pressure, memory reclaiming can be triggered immediately.
Change-Id: I25ac04e2db9c19ff12652b88ebff18b44b2706d8 Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- drivers/gpu/drm/ttm/ttm_bo.c | 49 +++++++++++++++++++++++++++++++++ include/drm/drm_cgroup.h | 16 +++++++++++ include/drm/ttm/ttm_bo_driver.h | 2 ++ kernel/cgroup/drm.c | 30 ++++++++++++++++++++ 4 files changed, 97 insertions(+)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index d7e3d3128ebb..72efae694b7e 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1590,6 +1590,46 @@ int ttm_bo_evict_mm(struct ttm_bo_device *bdev, unsigned mem_type) } EXPORT_SYMBOL(ttm_bo_evict_mm);
+static void ttm_bo_reclaim_wq(struct work_struct *work) +{ + struct ttm_operation_ctx ctx = { + .interruptible = false, + .no_wait_gpu = false, + .flags = TTM_OPT_FLAG_FORCE_ALLOC + }; + struct ttm_mem_type_manager *man = + container_of(work, struct ttm_mem_type_manager, reclaim_wq); + struct ttm_bo_device *bdev = man->bdev; + struct dma_fence *fence; + int mem_type; + int ret; + + for (mem_type = 0; mem_type < TTM_NUM_MEM_TYPES; mem_type++) + if (&bdev->man[mem_type] == man) + break; + + WARN_ON(mem_type >= TTM_NUM_MEM_TYPES); + if (mem_type >= TTM_NUM_MEM_TYPES) + return; + + if (!drmcg_mem_pressure_scan(bdev, mem_type)) + return; + + ret = ttm_mem_evict_first(bdev, mem_type, NULL, &ctx, NULL); + if (ret) + return; + + spin_lock(&man->move_lock); + fence = dma_fence_get(man->move); + spin_unlock(&man->move_lock); + + if (fence) { + ret = dma_fence_wait(fence, false); + dma_fence_put(fence); + } + +} + int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type, unsigned long p_size) { @@ -1624,6 +1664,13 @@ int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type, INIT_LIST_HEAD(&man->lru[i]); man->move = NULL;
+ pr_err("drmcg %p type %d\n", bdev->ddev, type); + + if (type <= TTM_PL_VRAM) { + INIT_WORK(&man->reclaim_wq, ttm_bo_reclaim_wq); + drmcg_register_device_mm(bdev->ddev, type, &man->reclaim_wq); + } + return 0; } EXPORT_SYMBOL(ttm_bo_init_mm); @@ -1701,6 +1748,8 @@ int ttm_bo_device_release(struct ttm_bo_device *bdev) man = &bdev->man[i]; if (man->has_type) { man->use_type = false; + drmcg_unregister_device_mm(bdev->ddev, i); + cancel_work_sync(&man->reclaim_wq); if ((i != TTM_PL_SYSTEM) && ttm_bo_clean_mm(bdev, i)) { ret = -EBUSY; pr_err("DRM memory manager type %d is not clean\n", diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index c11df388fdf2..6d9707e1eb72 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -5,6 +5,7 @@ #define __DRM_CGROUP_H__
#include <linux/cgroup_drm.h> +#include <linux/workqueue.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -25,12 +26,17 @@ struct drmcg_props { s64 mem_bw_avg_bytes_per_us_default;
s64 mem_highs_default[TTM_PL_PRIV+1]; + + struct work_struct *mem_reclaim_wq[TTM_PL_PRIV]; };
#ifdef CONFIG_CGROUP_DRM
void drmcg_device_update(struct drm_device *device); void drmcg_device_early_init(struct drm_device *device); +void drmcg_register_device_mm(struct drm_device *dev, unsigned int type, + struct work_struct *wq); +void drmcg_unregister_device_mm(struct drm_device *dev, unsigned int type); bool drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size); void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, @@ -53,6 +59,16 @@ static inline void drmcg_device_early_init(struct drm_device *device) { }
+static inline void drmcg_register_device_mm(struct drm_device *dev, + unsigned int type, struct work_struct *wq) +{ +} + +static inline void drmcg_unregister_device_mm(struct drm_device *dev, + unsigned int type) +{ +} + static inline void drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size) { diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index e1a805d65b83..529cef92bcf6 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -205,6 +205,8 @@ struct ttm_mem_type_manager { * Protected by @move_lock. */ struct dma_fence *move; + + struct work_struct reclaim_wq; };
/** diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 04fb9a398740..0ea7f0619e25 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -804,6 +804,29 @@ void drmcg_device_early_init(struct drm_device *dev) } EXPORT_SYMBOL(drmcg_device_early_init);
+void drmcg_register_device_mm(struct drm_device *dev, unsigned int type, + struct work_struct *wq) +{ + if (dev == NULL || type >= TTM_PL_PRIV) + return; + + mutex_lock(&drmcg_mutex); + dev->drmcg_props.mem_reclaim_wq[type] = wq; + mutex_unlock(&drmcg_mutex); +} +EXPORT_SYMBOL(drmcg_register_device_mm); + +void drmcg_unregister_device_mm(struct drm_device *dev, unsigned int type) +{ + if (dev == NULL || type >= TTM_PL_PRIV) + return; + + mutex_lock(&drmcg_mutex); + dev->drmcg_props.mem_reclaim_wq[type] = NULL; + mutex_unlock(&drmcg_mutex); +} +EXPORT_SYMBOL(drmcg_unregister_device_mm); + /** * drmcg_try_chg_bo_alloc - charge GEM buffer usage for a device and cgroup * @drmcg: the DRM cgroup to be charged to @@ -1013,6 +1036,13 @@ void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict,
ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_CREDIT] -= move_in_bytes; + + if (dev->drmcg_props.mem_reclaim_wq[new_mem_type] + != NULL && + ddr->mem_stats[new_mem_type] > + ddr->mem_highs[new_mem_type]) + schedule_work(dev-> + drmcg_props.mem_reclaim_wq[new_mem_type]); } mutex_unlock(&dev->drmcg_mutex); }
Am 29.08.19 um 08:05 schrieb Kenny Ho:
Allow DRM TTM memory manager to register a work_struct, such that, when a drmcgrp is under memory pressure, memory reclaiming can be triggered immediately.
Change-Id: I25ac04e2db9c19ff12652b88ebff18b44b2706d8 Signed-off-by: Kenny Ho Kenny.Ho@amd.com
drivers/gpu/drm/ttm/ttm_bo.c | 49 +++++++++++++++++++++++++++++++++ include/drm/drm_cgroup.h | 16 +++++++++++ include/drm/ttm/ttm_bo_driver.h | 2 ++ kernel/cgroup/drm.c | 30 ++++++++++++++++++++ 4 files changed, 97 insertions(+)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index d7e3d3128ebb..72efae694b7e 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1590,6 +1590,46 @@ int ttm_bo_evict_mm(struct ttm_bo_device *bdev, unsigned mem_type) } EXPORT_SYMBOL(ttm_bo_evict_mm);
+static void ttm_bo_reclaim_wq(struct work_struct *work) +{
- struct ttm_operation_ctx ctx = {
.interruptible = false,
.no_wait_gpu = false,
.flags = TTM_OPT_FLAG_FORCE_ALLOC
- };
- struct ttm_mem_type_manager *man =
container_of(work, struct ttm_mem_type_manager, reclaim_wq);
- struct ttm_bo_device *bdev = man->bdev;
- struct dma_fence *fence;
- int mem_type;
- int ret;
- for (mem_type = 0; mem_type < TTM_NUM_MEM_TYPES; mem_type++)
if (&bdev->man[mem_type] == man)
break;
- WARN_ON(mem_type >= TTM_NUM_MEM_TYPES);
- if (mem_type >= TTM_NUM_MEM_TYPES)
return;
- if (!drmcg_mem_pressure_scan(bdev, mem_type))
return;
- ret = ttm_mem_evict_first(bdev, mem_type, NULL, &ctx, NULL);
- if (ret)
return;
- spin_lock(&man->move_lock);
- fence = dma_fence_get(man->move);
- spin_unlock(&man->move_lock);
- if (fence) {
ret = dma_fence_wait(fence, false);
dma_fence_put(fence);
- }
Why do you want to block for the fence here? That is a rather bad idea and would break pipe-lining.
Apart from that I don't think we should put that into TTM.
Instead drmcg_register_device_mm() should get a function pointer which is called from a work item when the group is under pressure.
TTM can then provides the function which can be called, but the actually registration is job of the device and not TTM.
Regards, Christian.
+}
- int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type, unsigned long p_size) {
@@ -1624,6 +1664,13 @@ int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type, INIT_LIST_HEAD(&man->lru[i]); man->move = NULL;
- pr_err("drmcg %p type %d\n", bdev->ddev, type);
- if (type <= TTM_PL_VRAM) {
INIT_WORK(&man->reclaim_wq, ttm_bo_reclaim_wq);
drmcg_register_device_mm(bdev->ddev, type, &man->reclaim_wq);
- }
- return 0; } EXPORT_SYMBOL(ttm_bo_init_mm);
@@ -1701,6 +1748,8 @@ int ttm_bo_device_release(struct ttm_bo_device *bdev) man = &bdev->man[i]; if (man->has_type) { man->use_type = false;
drmcg_unregister_device_mm(bdev->ddev, i);
cancel_work_sync(&man->reclaim_wq); if ((i != TTM_PL_SYSTEM) && ttm_bo_clean_mm(bdev, i)) { ret = -EBUSY; pr_err("DRM memory manager type %d is not clean\n",
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index c11df388fdf2..6d9707e1eb72 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -5,6 +5,7 @@ #define __DRM_CGROUP_H__
#include <linux/cgroup_drm.h> +#include <linux/workqueue.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -25,12 +26,17 @@ struct drmcg_props { s64 mem_bw_avg_bytes_per_us_default;
s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV]; };
#ifdef CONFIG_CGROUP_DRM
void drmcg_device_update(struct drm_device *device); void drmcg_device_early_init(struct drm_device *device);
+void drmcg_register_device_mm(struct drm_device *dev, unsigned int type,
struct work_struct *wq);
+void drmcg_unregister_device_mm(struct drm_device *dev, unsigned int type); bool drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size); void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, @@ -53,6 +59,16 @@ static inline void drmcg_device_early_init(struct drm_device *device) { }
+static inline void drmcg_register_device_mm(struct drm_device *dev,
unsigned int type, struct work_struct *wq)
+{ +}
+static inline void drmcg_unregister_device_mm(struct drm_device *dev,
unsigned int type)
+{ +}
- static inline void drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size) {
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index e1a805d65b83..529cef92bcf6 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -205,6 +205,8 @@ struct ttm_mem_type_manager { * Protected by @move_lock. */ struct dma_fence *move;
struct work_struct reclaim_wq; };
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 04fb9a398740..0ea7f0619e25 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -804,6 +804,29 @@ void drmcg_device_early_init(struct drm_device *dev) } EXPORT_SYMBOL(drmcg_device_early_init);
+void drmcg_register_device_mm(struct drm_device *dev, unsigned int type,
struct work_struct *wq)
+{
- if (dev == NULL || type >= TTM_PL_PRIV)
return;
- mutex_lock(&drmcg_mutex);
- dev->drmcg_props.mem_reclaim_wq[type] = wq;
- mutex_unlock(&drmcg_mutex);
+} +EXPORT_SYMBOL(drmcg_register_device_mm);
+void drmcg_unregister_device_mm(struct drm_device *dev, unsigned int type) +{
- if (dev == NULL || type >= TTM_PL_PRIV)
return;
- mutex_lock(&drmcg_mutex);
- dev->drmcg_props.mem_reclaim_wq[type] = NULL;
- mutex_unlock(&drmcg_mutex);
+} +EXPORT_SYMBOL(drmcg_unregister_device_mm);
- /**
- drmcg_try_chg_bo_alloc - charge GEM buffer usage for a device and cgroup
- @drmcg: the DRM cgroup to be charged to
@@ -1013,6 +1036,13 @@ void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict,
ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_CREDIT] -= move_in_bytes;
if (dev->drmcg_props.mem_reclaim_wq[new_mem_type]
!= NULL &&
ddr->mem_stats[new_mem_type] >
ddr->mem_highs[new_mem_type])
schedule_work(dev->
} mutex_unlock(&dev->drmcg_mutex); }drmcg_props.mem_reclaim_wq[new_mem_type]);
Thanks for the feedback Christian. I am still digging into this one. Daniel suggested leveraging the Shrinker API for the functionality of this commit in RFC v3 but I am still trying to figure it out how/if ttm fit with shrinker (though the idea behind the shrinker API seems fairly straightforward as far as I understand it currently.)
Regards, Kenny
On Thu, Aug 29, 2019 at 3:08 AM Koenig, Christian Christian.Koenig@amd.com wrote:
Am 29.08.19 um 08:05 schrieb Kenny Ho:
Allow DRM TTM memory manager to register a work_struct, such that, when a drmcgrp is under memory pressure, memory reclaiming can be triggered immediately.
Change-Id: I25ac04e2db9c19ff12652b88ebff18b44b2706d8 Signed-off-by: Kenny Ho Kenny.Ho@amd.com
drivers/gpu/drm/ttm/ttm_bo.c | 49 +++++++++++++++++++++++++++++++++ include/drm/drm_cgroup.h | 16 +++++++++++ include/drm/ttm/ttm_bo_driver.h | 2 ++ kernel/cgroup/drm.c | 30 ++++++++++++++++++++ 4 files changed, 97 insertions(+)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index d7e3d3128ebb..72efae694b7e 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1590,6 +1590,46 @@ int ttm_bo_evict_mm(struct ttm_bo_device *bdev,
unsigned mem_type)
} EXPORT_SYMBOL(ttm_bo_evict_mm);
+static void ttm_bo_reclaim_wq(struct work_struct *work) +{
struct ttm_operation_ctx ctx = {
.interruptible = false,
.no_wait_gpu = false,
.flags = TTM_OPT_FLAG_FORCE_ALLOC
};
struct ttm_mem_type_manager *man =
container_of(work, struct ttm_mem_type_manager, reclaim_wq);
struct ttm_bo_device *bdev = man->bdev;
struct dma_fence *fence;
int mem_type;
int ret;
for (mem_type = 0; mem_type < TTM_NUM_MEM_TYPES; mem_type++)
if (&bdev->man[mem_type] == man)
break;
WARN_ON(mem_type >= TTM_NUM_MEM_TYPES);
if (mem_type >= TTM_NUM_MEM_TYPES)
return;
if (!drmcg_mem_pressure_scan(bdev, mem_type))
return;
ret = ttm_mem_evict_first(bdev, mem_type, NULL, &ctx, NULL);
if (ret)
return;
spin_lock(&man->move_lock);
fence = dma_fence_get(man->move);
spin_unlock(&man->move_lock);
if (fence) {
ret = dma_fence_wait(fence, false);
dma_fence_put(fence);
}
Why do you want to block for the fence here? That is a rather bad idea and would break pipe-lining.
Apart from that I don't think we should put that into TTM.
Instead drmcg_register_device_mm() should get a function pointer which is called from a work item when the group is under pressure.
TTM can then provides the function which can be called, but the actually registration is job of the device and not TTM.
Regards, Christian.
+}
- int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type, unsigned long p_size) {
@@ -1624,6 +1664,13 @@ int ttm_bo_init_mm(struct ttm_bo_device *bdev,
unsigned type,
INIT_LIST_HEAD(&man->lru[i]); man->move = NULL;
pr_err("drmcg %p type %d\n", bdev->ddev, type);
if (type <= TTM_PL_VRAM) {
INIT_WORK(&man->reclaim_wq, ttm_bo_reclaim_wq);
drmcg_register_device_mm(bdev->ddev, type,
&man->reclaim_wq);
}
} EXPORT_SYMBOL(ttm_bo_init_mm);return 0;
@@ -1701,6 +1748,8 @@ int ttm_bo_device_release(struct ttm_bo_device
*bdev)
man = &bdev->man[i]; if (man->has_type) { man->use_type = false;
drmcg_unregister_device_mm(bdev->ddev, i);
cancel_work_sync(&man->reclaim_wq); if ((i != TTM_PL_SYSTEM) && ttm_bo_clean_mm(bdev,
i)) {
ret = -EBUSY; pr_err("DRM memory manager type %d is not
clean\n",
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index c11df388fdf2..6d9707e1eb72 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -5,6 +5,7 @@ #define __DRM_CGROUP_H__
#include <linux/cgroup_drm.h> +#include <linux/workqueue.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -25,12 +26,17 @@ struct drmcg_props { s64 mem_bw_avg_bytes_per_us_default;
s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
};
#ifdef CONFIG_CGROUP_DRM
void drmcg_device_update(struct drm_device *device); void drmcg_device_early_init(struct drm_device *device);
+void drmcg_register_device_mm(struct drm_device *dev, unsigned int type,
struct work_struct *wq);
+void drmcg_unregister_device_mm(struct drm_device *dev, unsigned int
type);
bool drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device
*dev,
size_t size);
void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, @@ -53,6 +59,16 @@ static inline void drmcg_device_early_init(struct
drm_device *device)
{ }
+static inline void drmcg_register_device_mm(struct drm_device *dev,
unsigned int type, struct work_struct *wq)
+{ +}
+static inline void drmcg_unregister_device_mm(struct drm_device *dev,
unsigned int type)
+{ +}
- static inline void drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size) {
diff --git a/include/drm/ttm/ttm_bo_driver.h
b/include/drm/ttm/ttm_bo_driver.h
index e1a805d65b83..529cef92bcf6 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -205,6 +205,8 @@ struct ttm_mem_type_manager { * Protected by @move_lock. */ struct dma_fence *move;
struct work_struct reclaim_wq;
};
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 04fb9a398740..0ea7f0619e25 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -804,6 +804,29 @@ void drmcg_device_early_init(struct drm_device *dev) } EXPORT_SYMBOL(drmcg_device_early_init);
+void drmcg_register_device_mm(struct drm_device *dev, unsigned int type,
struct work_struct *wq)
+{
if (dev == NULL || type >= TTM_PL_PRIV)
return;
mutex_lock(&drmcg_mutex);
dev->drmcg_props.mem_reclaim_wq[type] = wq;
mutex_unlock(&drmcg_mutex);
+} +EXPORT_SYMBOL(drmcg_register_device_mm);
+void drmcg_unregister_device_mm(struct drm_device *dev, unsigned int
type)
+{
if (dev == NULL || type >= TTM_PL_PRIV)
return;
mutex_lock(&drmcg_mutex);
dev->drmcg_props.mem_reclaim_wq[type] = NULL;
mutex_unlock(&drmcg_mutex);
+} +EXPORT_SYMBOL(drmcg_unregister_device_mm);
- /**
- drmcg_try_chg_bo_alloc - charge GEM buffer usage for a device and
cgroup
- @drmcg: the DRM cgroup to be charged to
@@ -1013,6 +1036,13 @@ void drmcg_mem_track_move(struct
ttm_buffer_object *old_bo, bool evict,
ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_CREDIT] -= move_in_bytes;
if (dev->drmcg_props.mem_reclaim_wq[new_mem_type]
!= NULL &&
ddr->mem_stats[new_mem_type] >
ddr->mem_highs[new_mem_type])
schedule_work(dev->
}drmcg_props.mem_reclaim_wq[new_mem_type]); } mutex_unlock(&dev->drmcg_mutex);
Yeah, that's also a really good idea as well.
The problem with the shrinker API is that it only applies to system memory currently.
So you won't have a distinction which domain you need to evict stuff from.
Regards, Christian.
Am 29.08.19 um 16:07 schrieb Kenny Ho: Thanks for the feedback Christian. I am still digging into this one. Daniel suggested leveraging the Shrinker API for the functionality of this commit in RFC v3 but I am still trying to figure it out how/if ttm fit with shrinker (though the idea behind the shrinker API seems fairly straightforward as far as I understand it currently.)
Regards, Kenny
On Thu, Aug 29, 2019 at 3:08 AM Koenig, Christian <Christian.Koenig@amd.commailto:Christian.Koenig@amd.com> wrote: Am 29.08.19 um 08:05 schrieb Kenny Ho:
Allow DRM TTM memory manager to register a work_struct, such that, when a drmcgrp is under memory pressure, memory reclaiming can be triggered immediately.
Change-Id: I25ac04e2db9c19ff12652b88ebff18b44b2706d8 Signed-off-by: Kenny Ho <Kenny.Ho@amd.commailto:Kenny.Ho@amd.com>
drivers/gpu/drm/ttm/ttm_bo.c | 49 +++++++++++++++++++++++++++++++++ include/drm/drm_cgroup.h | 16 +++++++++++ include/drm/ttm/ttm_bo_driver.h | 2 ++ kernel/cgroup/drm.c | 30 ++++++++++++++++++++ 4 files changed, 97 insertions(+)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index d7e3d3128ebb..72efae694b7e 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1590,6 +1590,46 @@ int ttm_bo_evict_mm(struct ttm_bo_device *bdev, unsigned mem_type) } EXPORT_SYMBOL(ttm_bo_evict_mm);
+static void ttm_bo_reclaim_wq(struct work_struct *work) +{
struct ttm_operation_ctx ctx = {
.interruptible = false,
.no_wait_gpu = false,
.flags = TTM_OPT_FLAG_FORCE_ALLOC
};
struct ttm_mem_type_manager *man =
container_of(work, struct ttm_mem_type_manager, reclaim_wq);
struct ttm_bo_device *bdev = man->bdev;
struct dma_fence *fence;
int mem_type;
int ret;
for (mem_type = 0; mem_type < TTM_NUM_MEM_TYPES; mem_type++)
if (&bdev->man[mem_type] == man)
break;
WARN_ON(mem_type >= TTM_NUM_MEM_TYPES);
if (mem_type >= TTM_NUM_MEM_TYPES)
return;
if (!drmcg_mem_pressure_scan(bdev, mem_type))
return;
ret = ttm_mem_evict_first(bdev, mem_type, NULL, &ctx, NULL);
if (ret)
return;
spin_lock(&man->move_lock);
fence = dma_fence_get(man->move);
spin_unlock(&man->move_lock);
if (fence) {
ret = dma_fence_wait(fence, false);
dma_fence_put(fence);
}
Why do you want to block for the fence here? That is a rather bad idea and would break pipe-lining.
Apart from that I don't think we should put that into TTM.
Instead drmcg_register_device_mm() should get a function pointer which is called from a work item when the group is under pressure.
TTM can then provides the function which can be called, but the actually registration is job of the device and not TTM.
Regards, Christian.
+}
- int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type, unsigned long p_size) {
@@ -1624,6 +1664,13 @@ int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type, INIT_LIST_HEAD(&man->lru[i]); man->move = NULL;
pr_err("drmcg %p type %d\n", bdev->ddev, type);
if (type <= TTM_PL_VRAM) {
INIT_WORK(&man->reclaim_wq, ttm_bo_reclaim_wq);
drmcg_register_device_mm(bdev->ddev, type, &man->reclaim_wq);
}
} EXPORT_SYMBOL(ttm_bo_init_mm);return 0;
@@ -1701,6 +1748,8 @@ int ttm_bo_device_release(struct ttm_bo_device *bdev) man = &bdev->man[i]; if (man->has_type) { man->use_type = false;
drmcg_unregister_device_mm(bdev->ddev, i);
cancel_work_sync(&man->reclaim_wq); if ((i != TTM_PL_SYSTEM) && ttm_bo_clean_mm(bdev, i)) { ret = -EBUSY; pr_err("DRM memory manager type %d is not clean\n",
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index c11df388fdf2..6d9707e1eb72 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -5,6 +5,7 @@ #define __DRM_CGROUP_H__
#include <linux/cgroup_drm.h> +#include <linux/workqueue.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -25,12 +26,17 @@ struct drmcg_props { s64 mem_bw_avg_bytes_per_us_default;
s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
};
#ifdef CONFIG_CGROUP_DRM
void drmcg_device_update(struct drm_device *device); void drmcg_device_early_init(struct drm_device *device);
+void drmcg_register_device_mm(struct drm_device *dev, unsigned int type,
struct work_struct *wq);
+void drmcg_unregister_device_mm(struct drm_device *dev, unsigned int type); bool drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size); void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, @@ -53,6 +59,16 @@ static inline void drmcg_device_early_init(struct drm_device *device) { }
+static inline void drmcg_register_device_mm(struct drm_device *dev,
unsigned int type, struct work_struct *wq)
+{ +}
+static inline void drmcg_unregister_device_mm(struct drm_device *dev,
unsigned int type)
+{ +}
- static inline void drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size) {
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index e1a805d65b83..529cef92bcf6 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -205,6 +205,8 @@ struct ttm_mem_type_manager { * Protected by @move_lock. */ struct dma_fence *move;
struct work_struct reclaim_wq;
};
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 04fb9a398740..0ea7f0619e25 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -804,6 +804,29 @@ void drmcg_device_early_init(struct drm_device *dev) } EXPORT_SYMBOL(drmcg_device_early_init);
+void drmcg_register_device_mm(struct drm_device *dev, unsigned int type,
struct work_struct *wq)
+{
if (dev == NULL || type >= TTM_PL_PRIV)
return;
mutex_lock(&drmcg_mutex);
dev->drmcg_props.mem_reclaim_wq[type] = wq;
mutex_unlock(&drmcg_mutex);
+} +EXPORT_SYMBOL(drmcg_register_device_mm);
+void drmcg_unregister_device_mm(struct drm_device *dev, unsigned int type) +{
if (dev == NULL || type >= TTM_PL_PRIV)
return;
mutex_lock(&drmcg_mutex);
dev->drmcg_props.mem_reclaim_wq[type] = NULL;
mutex_unlock(&drmcg_mutex);
+} +EXPORT_SYMBOL(drmcg_unregister_device_mm);
- /**
- drmcg_try_chg_bo_alloc - charge GEM buffer usage for a device and cgroup
- @drmcg: the DRM cgroup to be charged to
@@ -1013,6 +1036,13 @@ void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict,
ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_CREDIT] -= move_in_bytes;
if (dev->drmcg_props.mem_reclaim_wq[new_mem_type]
!= NULL &&
ddr->mem_stats[new_mem_type] >
ddr->mem_highs[new_mem_type])
schedule_work(dev->
}drmcg_props.mem_reclaim_wq[new_mem_type]); } mutex_unlock(&dev->drmcg_mutex);
Yes, and I think it has quite a lot of coupling with mm's page and pressure mechanisms. My current thought is to just copy the API but have a separate implementation of "ttm_shrinker" and "ttm_shrinker_control" or something like that. I am certainly happy to listen to additional feedbacks and suggestions.
Regards, Kenny
On Thu, Aug 29, 2019 at 10:12 AM Koenig, Christian Christian.Koenig@amd.com wrote:
Yeah, that's also a really good idea as well.
The problem with the shrinker API is that it only applies to system memory currently.
So you won't have a distinction which domain you need to evict stuff from.
Regards, Christian.
Am 29.08.19 um 16:07 schrieb Kenny Ho:
Thanks for the feedback Christian. I am still digging into this one. Daniel suggested leveraging the Shrinker API for the functionality of this commit in RFC v3 but I am still trying to figure it out how/if ttm fit with shrinker (though the idea behind the shrinker API seems fairly straightforward as far as I understand it currently.)
Regards, Kenny
On Thu, Aug 29, 2019 at 3:08 AM Koenig, Christian Christian.Koenig@amd.com wrote:
Am 29.08.19 um 08:05 schrieb Kenny Ho:
Allow DRM TTM memory manager to register a work_struct, such that, when a drmcgrp is under memory pressure, memory reclaiming can be triggered immediately.
Change-Id: I25ac04e2db9c19ff12652b88ebff18b44b2706d8 Signed-off-by: Kenny Ho Kenny.Ho@amd.com
drivers/gpu/drm/ttm/ttm_bo.c | 49 +++++++++++++++++++++++++++++++++ include/drm/drm_cgroup.h | 16 +++++++++++ include/drm/ttm/ttm_bo_driver.h | 2 ++ kernel/cgroup/drm.c | 30 ++++++++++++++++++++ 4 files changed, 97 insertions(+)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index d7e3d3128ebb..72efae694b7e 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1590,6 +1590,46 @@ int ttm_bo_evict_mm(struct ttm_bo_device *bdev, unsigned mem_type) } EXPORT_SYMBOL(ttm_bo_evict_mm);
+static void ttm_bo_reclaim_wq(struct work_struct *work) +{
struct ttm_operation_ctx ctx = {
.interruptible = false,
.no_wait_gpu = false,
.flags = TTM_OPT_FLAG_FORCE_ALLOC
};
struct ttm_mem_type_manager *man =
container_of(work, struct ttm_mem_type_manager, reclaim_wq);
struct ttm_bo_device *bdev = man->bdev;
struct dma_fence *fence;
int mem_type;
int ret;
for (mem_type = 0; mem_type < TTM_NUM_MEM_TYPES; mem_type++)
if (&bdev->man[mem_type] == man)
break;
WARN_ON(mem_type >= TTM_NUM_MEM_TYPES);
if (mem_type >= TTM_NUM_MEM_TYPES)
return;
if (!drmcg_mem_pressure_scan(bdev, mem_type))
return;
ret = ttm_mem_evict_first(bdev, mem_type, NULL, &ctx, NULL);
if (ret)
return;
spin_lock(&man->move_lock);
fence = dma_fence_get(man->move);
spin_unlock(&man->move_lock);
if (fence) {
ret = dma_fence_wait(fence, false);
dma_fence_put(fence);
}
Why do you want to block for the fence here? That is a rather bad idea and would break pipe-lining.
Apart from that I don't think we should put that into TTM.
Instead drmcg_register_device_mm() should get a function pointer which is called from a work item when the group is under pressure.
TTM can then provides the function which can be called, but the actually registration is job of the device and not TTM.
Regards, Christian.
+}
- int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type, unsigned long p_size) {
@@ -1624,6 +1664,13 @@ int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type, INIT_LIST_HEAD(&man->lru[i]); man->move = NULL;
pr_err("drmcg %p type %d\n", bdev->ddev, type);
if (type <= TTM_PL_VRAM) {
INIT_WORK(&man->reclaim_wq, ttm_bo_reclaim_wq);
drmcg_register_device_mm(bdev->ddev, type, &man->reclaim_wq);
}
} EXPORT_SYMBOL(ttm_bo_init_mm);return 0;
@@ -1701,6 +1748,8 @@ int ttm_bo_device_release(struct ttm_bo_device *bdev) man = &bdev->man[i]; if (man->has_type) { man->use_type = false;
drmcg_unregister_device_mm(bdev->ddev, i);
cancel_work_sync(&man->reclaim_wq); if ((i != TTM_PL_SYSTEM) && ttm_bo_clean_mm(bdev, i)) { ret = -EBUSY; pr_err("DRM memory manager type %d is not clean\n",
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index c11df388fdf2..6d9707e1eb72 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -5,6 +5,7 @@ #define __DRM_CGROUP_H__
#include <linux/cgroup_drm.h> +#include <linux/workqueue.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -25,12 +26,17 @@ struct drmcg_props { s64 mem_bw_avg_bytes_per_us_default;
s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
};
#ifdef CONFIG_CGROUP_DRM
void drmcg_device_update(struct drm_device *device); void drmcg_device_early_init(struct drm_device *device);
+void drmcg_register_device_mm(struct drm_device *dev, unsigned int type,
struct work_struct *wq);
+void drmcg_unregister_device_mm(struct drm_device *dev, unsigned int type); bool drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size); void drmcg_unchg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, @@ -53,6 +59,16 @@ static inline void drmcg_device_early_init(struct drm_device *device) { }
+static inline void drmcg_register_device_mm(struct drm_device *dev,
unsigned int type, struct work_struct *wq)
+{ +}
+static inline void drmcg_unregister_device_mm(struct drm_device *dev,
unsigned int type)
+{ +}
- static inline void drmcg_try_chg_bo_alloc(struct drmcg *drmcg, struct drm_device *dev, size_t size) {
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index e1a805d65b83..529cef92bcf6 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -205,6 +205,8 @@ struct ttm_mem_type_manager { * Protected by @move_lock. */ struct dma_fence *move;
struct work_struct reclaim_wq;
};
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 04fb9a398740..0ea7f0619e25 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -804,6 +804,29 @@ void drmcg_device_early_init(struct drm_device *dev) } EXPORT_SYMBOL(drmcg_device_early_init);
+void drmcg_register_device_mm(struct drm_device *dev, unsigned int type,
struct work_struct *wq)
+{
if (dev == NULL || type >= TTM_PL_PRIV)
return;
mutex_lock(&drmcg_mutex);
dev->drmcg_props.mem_reclaim_wq[type] = wq;
mutex_unlock(&drmcg_mutex);
+} +EXPORT_SYMBOL(drmcg_register_device_mm);
+void drmcg_unregister_device_mm(struct drm_device *dev, unsigned int type) +{
if (dev == NULL || type >= TTM_PL_PRIV)
return;
mutex_lock(&drmcg_mutex);
dev->drmcg_props.mem_reclaim_wq[type] = NULL;
mutex_unlock(&drmcg_mutex);
+} +EXPORT_SYMBOL(drmcg_unregister_device_mm);
- /**
- drmcg_try_chg_bo_alloc - charge GEM buffer usage for a device and cgroup
- @drmcg: the DRM cgroup to be charged to
@@ -1013,6 +1036,13 @@ void drmcg_mem_track_move(struct ttm_buffer_object *old_bo, bool evict,
ddr->mem_bw_stats[DRMCG_MEM_BW_ATTR_BYTE_CREDIT] -= move_in_bytes;
if (dev->drmcg_props.mem_reclaim_wq[new_mem_type]
!= NULL &&
ddr->mem_stats[new_mem_type] >
ddr->mem_highs[new_mem_type])
schedule_work(dev->
}drmcg_props.mem_reclaim_wq[new_mem_type]); } mutex_unlock(&dev->drmcg_mutex);
drm.lgpu A read-write nested-keyed file which exists on all cgroups. Each entry is keyed by the DRM device's major:minor.
lgpu stands for logical GPU, it is an abstraction used to subdivide a physical DRM device for the purpose of resource management.
The lgpu is a discrete quantity that is device specific (i.e. some DRM devices may have 64 lgpus while others may have 100 lgpus.) The lgpu is a single quantity with two representations denoted by the following nested keys.
===== ======================================== count Representing lgpu as anonymous resource list Representing lgpu as named resource ===== ========================================
For example: 226:0 count=256 list=0-255 226:1 count=4 list=0,2,4,6 226:2 count=32 list=32-63
lgpu is represented by a bitmap and uses the bitmap_parselist kernel function so the list key input format is a comma-separated list of decimal numbers and ranges.
Consecutively set bits are shown as two hyphen-separated decimal numbers, the smallest and largest bit numbers set in the range. Optionally each range can be postfixed to denote that only parts of it should be set. The range will divided to groups of specific size. Syntax: range:used_size/group_size Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
The count key is the hamming weight / hweight of the bitmap.
Both count and list accept the max and default keywords.
Some DRM devices may only support lgpu as anonymous resources. In such case, the significance of the position of the set bits in list will be ignored.
This lgpu resource supports the 'allocation' resource distribution model.
Change-Id: I1afcacf356770930c7f925df043e51ad06ceb98e Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- Documentation/admin-guide/cgroup-v2.rst | 46 ++++++++ include/drm/drm_cgroup.h | 4 + include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 135 ++++++++++++++++++++++++ 4 files changed, 191 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 87a195133eaa..57f18469bd76 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1958,6 +1958,52 @@ DRM Interface Files Set largest allocation for /dev/dri/card1 to 4MB echo "226:1 4m" > drm.buffer.peak.max
+ drm.lgpu + A read-write nested-keyed file which exists on all cgroups. + Each entry is keyed by the DRM device's major:minor. + + lgpu stands for logical GPU, it is an abstraction used to + subdivide a physical DRM device for the purpose of resource + management. + + The lgpu is a discrete quantity that is device specific (i.e. + some DRM devices may have 64 lgpus while others may have 100 + lgpus.) The lgpu is a single quantity with two representations + denoted by the following nested keys. + + ===== ======================================== + count Representing lgpu as anonymous resource + list Representing lgpu as named resource + ===== ======================================== + + For example: + 226:0 count=256 list=0-255 + 226:1 count=4 list=0,2,4,6 + 226:2 count=32 list=32-63 + + lgpu is represented by a bitmap and uses the bitmap_parselist + kernel function so the list key input format is a + comma-separated list of decimal numbers and ranges. + + Consecutively set bits are shown as two hyphen-separated decimal + numbers, the smallest and largest bit numbers set in the range. + Optionally each range can be postfixed to denote that only parts + of it should be set. The range will divided to groups of + specific size. + Syntax: range:used_size/group_size + Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769 + + The count key is the hamming weight / hweight of the bitmap. + + Both count and list accept the max and default keywords. + + Some DRM devices may only support lgpu as anonymous resources. + In such case, the significance of the position of the set bits + in list will be ignored. + + This lgpu resource supports the 'allocation' resource + distribution model. + GEM Buffer Ownership ~~~~~~~~~~~~~~~~~~~~
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 6d9707e1eb72..a8d6be0b075b 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -6,6 +6,7 @@
#include <linux/cgroup_drm.h> #include <linux/workqueue.h> +#include <linux/types.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -28,6 +29,9 @@ struct drmcg_props { s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV]; + + int lgpu_capacity; + DECLARE_BITMAP(lgpu_slots, MAX_DRMCG_LGPU_CAPACITY); };
#ifdef CONFIG_CGROUP_DRM diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index c56cfe74d1a6..7b1cfc4ce4c3 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -14,6 +14,8 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+#define MAX_DRMCG_LGPU_CAPACITY 256 + enum drmcg_mem_bw_attr { DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */ DRMCG_MEM_BW_ATTR_ACCUM_US, /* for calulating 'instantaneous' bw */ @@ -32,6 +34,7 @@ enum drmcg_res_type { DRMCG_TYPE_MEM_PEAK, DRMCG_TYPE_BANDWIDTH, DRMCG_TYPE_BANDWIDTH_PERIOD_BURST, + DRMCG_TYPE_LGPU, __DRMCG_TYPE_LAST, };
@@ -58,6 +61,9 @@ struct drmcg_device_resource { s64 mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST]; s64 mem_bw_limits_bytes_in_period; s64 mem_bw_limits_avg_bytes_per_us; + + s64 lgpu_used; + DECLARE_BITMAP(lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); };
/** diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 0ea7f0619e25..18c4368e2c29 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -9,6 +9,7 @@ #include <linux/cgroup_drm.h> #include <linux/ktime.h> #include <linux/kernel.h> +#include <linux/bitmap.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> #include <drm/ttm/ttm_bo_api.h> @@ -52,6 +53,9 @@ static char const *mem_bw_attr_names[] = { #define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us" #define MEM_BW_LIMITS_NAME_BURST "bytes_in_period"
+#define LGPU_LIMITS_NAME_LIST "list" +#define LGPU_LIMITS_NAME_COUNT "count" + static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data) @@ -115,6 +119,10 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i];
+ bitmap_copy(ddr->lgpu_allocated, dev->drmcg_props.lgpu_slots, + MAX_DRMCG_LGPU_CAPACITY); + ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); + mutex_unlock(&dev->drmcg_mutex); return 0; } @@ -280,6 +288,14 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, MEM_BW_LIMITS_NAME_AVG, ddr->mem_bw_limits_avg_bytes_per_us); break; + case DRMCG_TYPE_LGPU: + seq_printf(sf, "%s=%lld %s=%*pbl\n", + LGPU_LIMITS_NAME_COUNT, + ddr->lgpu_used, + LGPU_LIMITS_NAME_LIST, + dev->drmcg_props.lgpu_capacity, + ddr->lgpu_allocated); + break; default: seq_puts(sf, "\n"); break; @@ -314,6 +330,15 @@ static void drmcg_print_default(struct drmcg_props *props, MEM_BW_LIMITS_NAME_AVG, props->mem_bw_avg_bytes_per_us_default); break; + case DRMCG_TYPE_LGPU: + seq_printf(sf, "%s=%d %s=%*pbl\n", + LGPU_LIMITS_NAME_COUNT, + bitmap_weight(props->lgpu_slots, + props->lgpu_capacity), + LGPU_LIMITS_NAME_LIST, + props->lgpu_capacity, + props->lgpu_slots); + break; default: seq_puts(sf, "\n"); break; @@ -407,9 +432,21 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) mutex_unlock(&dev->drmcg_mutex); }
+static void drmcg_lgpu_values_apply(struct drm_device *dev, + struct drmcg_device_resource *ddr, unsigned long *val) +{ + + mutex_lock(&dev->drmcg_mutex); + bitmap_copy(ddr->lgpu_allocated, val, MAX_DRMCG_LGPU_CAPACITY); + ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); + mutex_unlock(&dev->drmcg_mutex); +} + static void drmcg_nested_limit_parse(struct kernfs_open_file *of, struct drm_device *dev, char *attrs) { + DECLARE_BITMAP(tmp_bitmap, MAX_DRMCG_LGPU_CAPACITY); + DECLARE_BITMAP(chk_bitmap, MAX_DRMCG_LGPU_CAPACITY); enum drmcg_res_type type = DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); struct drmcg *drmcg = css_to_drmcg(of_css(of)); @@ -501,6 +538,83 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, continue; } break; /* DRMCG_TYPE_MEM */ + case DRMCG_TYPE_LGPU: + if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) && + strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) ) + continue; + + if (!strcmp("max", sval) || + !strcmp("default", sval)) { + if (parent != NULL) + drmcg_lgpu_values_apply(dev, ddr, + parent->dev_resources[minor]-> + lgpu_allocated); + else + drmcg_lgpu_values_apply(dev, ddr, + props->lgpu_slots); + + continue; + } + + if (strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) == 0) { + p_max = parent == NULL ? props->lgpu_capacity: + bitmap_weight( + parent->dev_resources[minor]-> + lgpu_allocated, props->lgpu_capacity); + + rc = drmcg_process_limit_s64_val(sval, + false, p_max, p_max, &val); + + if (rc || val < 0) { + drmcg_pr_cft_err(drmcg, rc, cft_name, + minor); + continue; + } + + bitmap_zero(tmp_bitmap, + MAX_DRMCG_LGPU_CAPACITY); + bitmap_set(tmp_bitmap, 0, val); + } + + if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) == 0) { + rc = bitmap_parselist(sval, tmp_bitmap, + MAX_DRMCG_LGPU_CAPACITY); + + if (rc) { + drmcg_pr_cft_err(drmcg, rc, cft_name, + minor); + continue; + } + + bitmap_andnot(chk_bitmap, tmp_bitmap, + props->lgpu_slots, + MAX_DRMCG_LGPU_CAPACITY); + + if (!bitmap_empty(chk_bitmap, + MAX_DRMCG_LGPU_CAPACITY)) { + drmcg_pr_cft_err(drmcg, 0, cft_name, + minor); + continue; + } + } + + + if (parent != NULL) { + bitmap_and(chk_bitmap, tmp_bitmap, + parent->dev_resources[minor]->lgpu_allocated, + props->lgpu_capacity); + + if (bitmap_empty(chk_bitmap, + props->lgpu_capacity)) { + drmcg_pr_cft_err(drmcg, 0, + cft_name, minor); + continue; + } + } + + drmcg_lgpu_values_apply(dev, ddr, tmp_bitmap); + + break; /* DRMCG_TYPE_LGPU */ default: break; } /* switch (type) */ @@ -606,6 +720,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, break; case DRMCG_TYPE_BANDWIDTH: case DRMCG_TYPE_MEM: + case DRMCG_TYPE_LGPU: drmcg_nested_limit_parse(of, dm->dev, sattr); break; default: @@ -731,6 +846,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, DRMCG_FTYPE_DEFAULT), }, + { + .name = "lgpu", + .seq_show = drmcg_seq_show, + .write = drmcg_limit_write, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU, + DRMCG_FTYPE_LIMIT), + }, + { + .name = "lgpu.default", + .seq_show = drmcg_seq_show, + .flags = CFTYPE_ONLY_ON_ROOT, + .private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU, + DRMCG_FTYPE_DEFAULT), + }, { } /* terminate */ };
@@ -744,6 +873,10 @@ struct cgroup_subsys drm_cgrp_subsys = {
static inline void drmcg_update_cg_tree(struct drm_device *dev) { + bitmap_zero(dev->drmcg_props.lgpu_slots, MAX_DRMCG_LGPU_CAPACITY); + bitmap_fill(dev->drmcg_props.lgpu_slots, + dev->drmcg_props.lgpu_capacity); + /* init cgroups created before registration (i.e. root cgroup) */ if (root_drmcg != NULL) { struct cgroup_subsys_state *pos; @@ -800,6 +933,8 @@ void drmcg_device_early_init(struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) dev->drmcg_props.mem_highs_default[i] = S64_MAX;
+ dev->drmcg_props.lgpu_capacity = MAX_DRMCG_LGPU_CAPACITY; + drmcg_update_cg_tree(dev); } EXPORT_SYMBOL(drmcg_device_early_init);
On 2019-08-29 2:05 a.m., Kenny Ho wrote:
drm.lgpu A read-write nested-keyed file which exists on all cgroups. Each entry is keyed by the DRM device's major:minor.
lgpu stands for logical GPU, it is an abstraction used to subdivide a physical DRM device for the purpose of resource management. The lgpu is a discrete quantity that is device specific (i.e. some DRM devices may have 64 lgpus while others may have 100 lgpus.) The lgpu is a single quantity with two representations denoted by the following nested keys. ===== ======================================== count Representing lgpu as anonymous resource list Representing lgpu as named resource ===== ======================================== For example: 226:0 count=256 list=0-255 226:1 count=4 list=0,2,4,6 226:2 count=32 list=32-63 lgpu is represented by a bitmap and uses the bitmap_parselist kernel function so the list key input format is a comma-separated list of decimal numbers and ranges. Consecutively set bits are shown as two hyphen-separated decimal numbers, the smallest and largest bit numbers set in the range. Optionally each range can be postfixed to denote that only parts of it should be set. The range will divided to groups of specific size. Syntax: range:used_size/group_size Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769 The count key is the hamming weight / hweight of the bitmap. Both count and list accept the max and default keywords. Some DRM devices may only support lgpu as anonymous resources. In such case, the significance of the position of the set bits in list will be ignored. This lgpu resource supports the 'allocation' resource distribution model.
Change-Id: I1afcacf356770930c7f925df043e51ad06ceb98e Signed-off-by: Kenny Ho Kenny.Ho@amd.com
The description sounds reasonable to me and maps well to the CU masking feature in our GPUs.
It would also allow us to do more coarse-grained masking for example to guarantee balanced allocation of CUs across shader engines or partitioning of memory bandwidth or CP pipes (if that is supported by the hardware/firmware).
I can't comment on the code as I'm unfamiliar with the details of the cgroup code.
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
Documentation/admin-guide/cgroup-v2.rst | 46 ++++++++ include/drm/drm_cgroup.h | 4 + include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 135 ++++++++++++++++++++++++ 4 files changed, 191 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 87a195133eaa..57f18469bd76 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1958,6 +1958,52 @@ DRM Interface Files Set largest allocation for /dev/dri/card1 to 4MB echo "226:1 4m" > drm.buffer.peak.max
- drm.lgpu
- A read-write nested-keyed file which exists on all cgroups.
- Each entry is keyed by the DRM device's major:minor.
- lgpu stands for logical GPU, it is an abstraction used to
- subdivide a physical DRM device for the purpose of resource
- management.
- The lgpu is a discrete quantity that is device specific (i.e.
- some DRM devices may have 64 lgpus while others may have 100
- lgpus.) The lgpu is a single quantity with two representations
- denoted by the following nested keys.
===== ========================================
count Representing lgpu as anonymous resource
list Representing lgpu as named resource
===== ========================================
- For example:
- 226:0 count=256 list=0-255
- 226:1 count=4 list=0,2,4,6
- 226:2 count=32 list=32-63
- lgpu is represented by a bitmap and uses the bitmap_parselist
- kernel function so the list key input format is a
- comma-separated list of decimal numbers and ranges.
- Consecutively set bits are shown as two hyphen-separated decimal
- numbers, the smallest and largest bit numbers set in the range.
- Optionally each range can be postfixed to denote that only parts
- of it should be set. The range will divided to groups of
- specific size.
- Syntax: range:used_size/group_size
- Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
- The count key is the hamming weight / hweight of the bitmap.
- Both count and list accept the max and default keywords.
- Some DRM devices may only support lgpu as anonymous resources.
- In such case, the significance of the position of the set bits
- in list will be ignored.
- This lgpu resource supports the 'allocation' resource
- distribution model.
- GEM Buffer Ownership
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 6d9707e1eb72..a8d6be0b075b 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -6,6 +6,7 @@
#include <linux/cgroup_drm.h> #include <linux/workqueue.h> +#include <linux/types.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -28,6 +29,9 @@ struct drmcg_props { s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
int lgpu_capacity;
DECLARE_BITMAP(lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
};
#ifdef CONFIG_CGROUP_DRM
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index c56cfe74d1a6..7b1cfc4ce4c3 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -14,6 +14,8 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+#define MAX_DRMCG_LGPU_CAPACITY 256
- enum drmcg_mem_bw_attr { DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */ DRMCG_MEM_BW_ATTR_ACCUM_US, /* for calulating 'instantaneous' bw */
@@ -32,6 +34,7 @@ enum drmcg_res_type { DRMCG_TYPE_MEM_PEAK, DRMCG_TYPE_BANDWIDTH, DRMCG_TYPE_BANDWIDTH_PERIOD_BURST,
- DRMCG_TYPE_LGPU, __DRMCG_TYPE_LAST, };
@@ -58,6 +61,9 @@ struct drmcg_device_resource { s64 mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST]; s64 mem_bw_limits_bytes_in_period; s64 mem_bw_limits_avg_bytes_per_us;
s64 lgpu_used;
DECLARE_BITMAP(lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); };
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 0ea7f0619e25..18c4368e2c29 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -9,6 +9,7 @@ #include <linux/cgroup_drm.h> #include <linux/ktime.h> #include <linux/kernel.h> +#include <linux/bitmap.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> #include <drm/ttm/ttm_bo_api.h> @@ -52,6 +53,9 @@ static char const *mem_bw_attr_names[] = { #define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us" #define MEM_BW_LIMITS_NAME_BURST "bytes_in_period"
+#define LGPU_LIMITS_NAME_LIST "list" +#define LGPU_LIMITS_NAME_COUNT "count"
static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data)
@@ -115,6 +119,10 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i];
- bitmap_copy(ddr->lgpu_allocated, dev->drmcg_props.lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex); return 0; }
@@ -280,6 +288,14 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, MEM_BW_LIMITS_NAME_AVG, ddr->mem_bw_limits_avg_bytes_per_us); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%lld %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
ddr->lgpu_used,
LGPU_LIMITS_NAME_LIST,
dev->drmcg_props.lgpu_capacity,
ddr->lgpu_allocated);
default: seq_puts(sf, "\n"); break;break;
@@ -314,6 +330,15 @@ static void drmcg_print_default(struct drmcg_props *props, MEM_BW_LIMITS_NAME_AVG, props->mem_bw_avg_bytes_per_us_default); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%d %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
bitmap_weight(props->lgpu_slots,
props->lgpu_capacity),
LGPU_LIMITS_NAME_LIST,
props->lgpu_capacity,
props->lgpu_slots);
default: seq_puts(sf, "\n"); break;break;
@@ -407,9 +432,21 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) mutex_unlock(&dev->drmcg_mutex); }
+static void drmcg_lgpu_values_apply(struct drm_device *dev,
struct drmcg_device_resource *ddr, unsigned long *val)
+{
- mutex_lock(&dev->drmcg_mutex);
- bitmap_copy(ddr->lgpu_allocated, val, MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex);
+}
- static void drmcg_nested_limit_parse(struct kernfs_open_file *of, struct drm_device *dev, char *attrs) {
- DECLARE_BITMAP(tmp_bitmap, MAX_DRMCG_LGPU_CAPACITY);
- DECLARE_BITMAP(chk_bitmap, MAX_DRMCG_LGPU_CAPACITY); enum drmcg_res_type type = DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); struct drmcg *drmcg = css_to_drmcg(of_css(of));
@@ -501,6 +538,83 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, continue; } break; /* DRMCG_TYPE_MEM */
case DRMCG_TYPE_LGPU:
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) &&
strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) )
continue;
if (!strcmp("max", sval) ||
!strcmp("default", sval)) {
if (parent != NULL)
drmcg_lgpu_values_apply(dev, ddr,
parent->dev_resources[minor]->
lgpu_allocated);
else
drmcg_lgpu_values_apply(dev, ddr,
props->lgpu_slots);
continue;
}
if (strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) == 0) {
p_max = parent == NULL ? props->lgpu_capacity:
bitmap_weight(
parent->dev_resources[minor]->
lgpu_allocated, props->lgpu_capacity);
rc = drmcg_process_limit_s64_val(sval,
false, p_max, p_max, &val);
if (rc || val < 0) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_zero(tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
bitmap_set(tmp_bitmap, 0, val);
}
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) == 0) {
rc = bitmap_parselist(sval, tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
if (rc) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_andnot(chk_bitmap, tmp_bitmap,
props->lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
if (!bitmap_empty(chk_bitmap,
MAX_DRMCG_LGPU_CAPACITY)) {
drmcg_pr_cft_err(drmcg, 0, cft_name,
minor);
continue;
}
}
if (parent != NULL) {
bitmap_and(chk_bitmap, tmp_bitmap,
parent->dev_resources[minor]->lgpu_allocated,
props->lgpu_capacity);
if (bitmap_empty(chk_bitmap,
props->lgpu_capacity)) {
drmcg_pr_cft_err(drmcg, 0,
cft_name, minor);
continue;
}
}
drmcg_lgpu_values_apply(dev, ddr, tmp_bitmap);
default: break; } /* switch (type) */break; /* DRMCG_TYPE_LGPU */
@@ -606,6 +720,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, break; case DRMCG_TYPE_BANDWIDTH: case DRMCG_TYPE_MEM:
default:case DRMCG_TYPE_LGPU: drmcg_nested_limit_parse(of, dm->dev, sattr); break;
@@ -731,6 +846,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, DRMCG_FTYPE_DEFAULT), },
- {
.name = "lgpu",
.seq_show = drmcg_seq_show,
.write = drmcg_limit_write,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_LIMIT),
- },
- {
.name = "lgpu.default",
.seq_show = drmcg_seq_show,
.flags = CFTYPE_ONLY_ON_ROOT,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_DEFAULT),
- }, { } /* terminate */ };
@@ -744,6 +873,10 @@ struct cgroup_subsys drm_cgrp_subsys = {
static inline void drmcg_update_cg_tree(struct drm_device *dev) {
bitmap_zero(dev->drmcg_props.lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
bitmap_fill(dev->drmcg_props.lgpu_slots,
dev->drmcg_props.lgpu_capacity);
- /* init cgroups created before registration (i.e. root cgroup) */ if (root_drmcg != NULL) { struct cgroup_subsys_state *pos;
@@ -800,6 +933,8 @@ void drmcg_device_early_init(struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) dev->drmcg_props.mem_highs_default[i] = S64_MAX;
- dev->drmcg_props.lgpu_capacity = MAX_DRMCG_LGPU_CAPACITY;
- drmcg_update_cg_tree(dev); } EXPORT_SYMBOL(drmcg_device_early_init);
On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
On 2019-08-29 2:05 a.m., Kenny Ho wrote:
drm.lgpu A read-write nested-keyed file which exists on all cgroups. Each entry is keyed by the DRM device's major:minor.
lgpu stands for logical GPU, it is an abstraction used to subdivide a physical DRM device for the purpose of resource management. The lgpu is a discrete quantity that is device specific (i.e. some DRM devices may have 64 lgpus while others may have 100 lgpus.) The lgpu is a single quantity with two representations denoted by the following nested keys. ===== ======================================== count Representing lgpu as anonymous resource list Representing lgpu as named resource ===== ======================================== For example: 226:0 count=256 list=0-255 226:1 count=4 list=0,2,4,6 226:2 count=32 list=32-63 lgpu is represented by a bitmap and uses the bitmap_parselist kernel function so the list key input format is a comma-separated list of decimal numbers and ranges. Consecutively set bits are shown as two hyphen-separated decimal numbers, the smallest and largest bit numbers set in the range. Optionally each range can be postfixed to denote that only parts of it should be set. The range will divided to groups of specific size. Syntax: range:used_size/group_size Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769 The count key is the hamming weight / hweight of the bitmap. Both count and list accept the max and default keywords. Some DRM devices may only support lgpu as anonymous resources. In such case, the significance of the position of the set bits in list will be ignored. This lgpu resource supports the 'allocation' resource distribution model.
Change-Id: I1afcacf356770930c7f925df043e51ad06ceb98e Signed-off-by: Kenny Ho Kenny.Ho@amd.com
The description sounds reasonable to me and maps well to the CU masking feature in our GPUs.
It would also allow us to do more coarse-grained masking for example to guarantee balanced allocation of CUs across shader engines or partitioning of memory bandwidth or CP pipes (if that is supported by the hardware/firmware).
Hm, so this sounds like the definition for how this cgroup is supposed to work is "amd CU masking" (whatever that exactly is). And the abstract description is just prettification on top, but not actually the real definition you guys want.
I think adding a cgroup which is that much depending upon the hw implementation of the first driver supporting it is not a good idea. -Daniel
I can't comment on the code as I'm unfamiliar with the details of the cgroup code.
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
Documentation/admin-guide/cgroup-v2.rst | 46 ++++++++ include/drm/drm_cgroup.h | 4 + include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 135 ++++++++++++++++++++++++ 4 files changed, 191 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 87a195133eaa..57f18469bd76 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1958,6 +1958,52 @@ DRM Interface Files Set largest allocation for /dev/dri/card1 to 4MB echo "226:1 4m" > drm.buffer.peak.max
- drm.lgpu
- A read-write nested-keyed file which exists on all cgroups.
- Each entry is keyed by the DRM device's major:minor.
- lgpu stands for logical GPU, it is an abstraction used to
- subdivide a physical DRM device for the purpose of resource
- management.
- The lgpu is a discrete quantity that is device specific (i.e.
- some DRM devices may have 64 lgpus while others may have 100
- lgpus.) The lgpu is a single quantity with two representations
- denoted by the following nested keys.
===== ========================================
count Representing lgpu as anonymous resource
list Representing lgpu as named resource
===== ========================================
- For example:
- 226:0 count=256 list=0-255
- 226:1 count=4 list=0,2,4,6
- 226:2 count=32 list=32-63
- lgpu is represented by a bitmap and uses the bitmap_parselist
- kernel function so the list key input format is a
- comma-separated list of decimal numbers and ranges.
- Consecutively set bits are shown as two hyphen-separated decimal
- numbers, the smallest and largest bit numbers set in the range.
- Optionally each range can be postfixed to denote that only parts
- of it should be set. The range will divided to groups of
- specific size.
- Syntax: range:used_size/group_size
- Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
- The count key is the hamming weight / hweight of the bitmap.
- Both count and list accept the max and default keywords.
- Some DRM devices may only support lgpu as anonymous resources.
- In such case, the significance of the position of the set bits
- in list will be ignored.
- This lgpu resource supports the 'allocation' resource
- distribution model.
- GEM Buffer Ownership
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 6d9707e1eb72..a8d6be0b075b 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -6,6 +6,7 @@
#include <linux/cgroup_drm.h> #include <linux/workqueue.h> +#include <linux/types.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -28,6 +29,9 @@ struct drmcg_props { s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
int lgpu_capacity;
DECLARE_BITMAP(lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
};
#ifdef CONFIG_CGROUP_DRM
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index c56cfe74d1a6..7b1cfc4ce4c3 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -14,6 +14,8 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+#define MAX_DRMCG_LGPU_CAPACITY 256
- enum drmcg_mem_bw_attr { DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */ DRMCG_MEM_BW_ATTR_ACCUM_US, /* for calulating 'instantaneous' bw */
@@ -32,6 +34,7 @@ enum drmcg_res_type { DRMCG_TYPE_MEM_PEAK, DRMCG_TYPE_BANDWIDTH, DRMCG_TYPE_BANDWIDTH_PERIOD_BURST,
- DRMCG_TYPE_LGPU, __DRMCG_TYPE_LAST, };
@@ -58,6 +61,9 @@ struct drmcg_device_resource { s64 mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST]; s64 mem_bw_limits_bytes_in_period; s64 mem_bw_limits_avg_bytes_per_us;
s64 lgpu_used;
DECLARE_BITMAP(lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); };
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 0ea7f0619e25..18c4368e2c29 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -9,6 +9,7 @@ #include <linux/cgroup_drm.h> #include <linux/ktime.h> #include <linux/kernel.h> +#include <linux/bitmap.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> #include <drm/ttm/ttm_bo_api.h> @@ -52,6 +53,9 @@ static char const *mem_bw_attr_names[] = { #define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us" #define MEM_BW_LIMITS_NAME_BURST "bytes_in_period"
+#define LGPU_LIMITS_NAME_LIST "list" +#define LGPU_LIMITS_NAME_COUNT "count"
static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data)
@@ -115,6 +119,10 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i];
- bitmap_copy(ddr->lgpu_allocated, dev->drmcg_props.lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex); return 0; }
@@ -280,6 +288,14 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, MEM_BW_LIMITS_NAME_AVG, ddr->mem_bw_limits_avg_bytes_per_us); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%lld %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
ddr->lgpu_used,
LGPU_LIMITS_NAME_LIST,
dev->drmcg_props.lgpu_capacity,
ddr->lgpu_allocated);
default: seq_puts(sf, "\n"); break;break;
@@ -314,6 +330,15 @@ static void drmcg_print_default(struct drmcg_props *props, MEM_BW_LIMITS_NAME_AVG, props->mem_bw_avg_bytes_per_us_default); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%d %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
bitmap_weight(props->lgpu_slots,
props->lgpu_capacity),
LGPU_LIMITS_NAME_LIST,
props->lgpu_capacity,
props->lgpu_slots);
default: seq_puts(sf, "\n"); break;break;
@@ -407,9 +432,21 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) mutex_unlock(&dev->drmcg_mutex); }
+static void drmcg_lgpu_values_apply(struct drm_device *dev,
struct drmcg_device_resource *ddr, unsigned long *val)
+{
- mutex_lock(&dev->drmcg_mutex);
- bitmap_copy(ddr->lgpu_allocated, val, MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex);
+}
- static void drmcg_nested_limit_parse(struct kernfs_open_file *of, struct drm_device *dev, char *attrs) {
- DECLARE_BITMAP(tmp_bitmap, MAX_DRMCG_LGPU_CAPACITY);
- DECLARE_BITMAP(chk_bitmap, MAX_DRMCG_LGPU_CAPACITY); enum drmcg_res_type type = DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); struct drmcg *drmcg = css_to_drmcg(of_css(of));
@@ -501,6 +538,83 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, continue; } break; /* DRMCG_TYPE_MEM */
case DRMCG_TYPE_LGPU:
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) &&
strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) )
continue;
if (!strcmp("max", sval) ||
!strcmp("default", sval)) {
if (parent != NULL)
drmcg_lgpu_values_apply(dev, ddr,
parent->dev_resources[minor]->
lgpu_allocated);
else
drmcg_lgpu_values_apply(dev, ddr,
props->lgpu_slots);
continue;
}
if (strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) == 0) {
p_max = parent == NULL ? props->lgpu_capacity:
bitmap_weight(
parent->dev_resources[minor]->
lgpu_allocated, props->lgpu_capacity);
rc = drmcg_process_limit_s64_val(sval,
false, p_max, p_max, &val);
if (rc || val < 0) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_zero(tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
bitmap_set(tmp_bitmap, 0, val);
}
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) == 0) {
rc = bitmap_parselist(sval, tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
if (rc) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_andnot(chk_bitmap, tmp_bitmap,
props->lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
if (!bitmap_empty(chk_bitmap,
MAX_DRMCG_LGPU_CAPACITY)) {
drmcg_pr_cft_err(drmcg, 0, cft_name,
minor);
continue;
}
}
if (parent != NULL) {
bitmap_and(chk_bitmap, tmp_bitmap,
parent->dev_resources[minor]->lgpu_allocated,
props->lgpu_capacity);
if (bitmap_empty(chk_bitmap,
props->lgpu_capacity)) {
drmcg_pr_cft_err(drmcg, 0,
cft_name, minor);
continue;
}
}
drmcg_lgpu_values_apply(dev, ddr, tmp_bitmap);
default: break; } /* switch (type) */break; /* DRMCG_TYPE_LGPU */
@@ -606,6 +720,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, break; case DRMCG_TYPE_BANDWIDTH: case DRMCG_TYPE_MEM:
default:case DRMCG_TYPE_LGPU: drmcg_nested_limit_parse(of, dm->dev, sattr); break;
@@ -731,6 +846,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, DRMCG_FTYPE_DEFAULT), },
- {
.name = "lgpu",
.seq_show = drmcg_seq_show,
.write = drmcg_limit_write,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_LIMIT),
- },
- {
.name = "lgpu.default",
.seq_show = drmcg_seq_show,
.flags = CFTYPE_ONLY_ON_ROOT,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_DEFAULT),
- }, { } /* terminate */ };
@@ -744,6 +873,10 @@ struct cgroup_subsys drm_cgrp_subsys = {
static inline void drmcg_update_cg_tree(struct drm_device *dev) {
bitmap_zero(dev->drmcg_props.lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
bitmap_fill(dev->drmcg_props.lgpu_slots,
dev->drmcg_props.lgpu_capacity);
- /* init cgroups created before registration (i.e. root cgroup) */ if (root_drmcg != NULL) { struct cgroup_subsys_state *pos;
@@ -800,6 +933,8 @@ void drmcg_device_early_init(struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) dev->drmcg_props.mem_highs_default[i] = S64_MAX;
- dev->drmcg_props.lgpu_capacity = MAX_DRMCG_LGPU_CAPACITY;
- drmcg_update_cg_tree(dev); } EXPORT_SYMBOL(drmcg_device_early_init);
Hi Daniel,
Can you elaborate what you mean in more details? The goal of lgpu is to provide the ability to subdivide a GPU device and give those slices to different users as needed. I don't think there is anything controversial or vendor specific here as requests for this are well documented. The underlying representation is just a bitmap, which is neither unprecedented nor vendor specific (bitmap is used in cpuset for instance.)
An implementation of this abstraction is not hardware specific either. For example, one can associate a virtual function in SRIOV as a lgpu. Alternatively, a device can also declare to have 100 lgpus and treat the lgpu quantity as a percentage representation of GPU subdivision. The fact that an abstraction works well with a vendor implementation does not make it a "prettification" of a vendor feature (by this logic, I hope you are not implying an abstraction is only valid if it does not work with amd CU masking because that seems fairly partisan.)
Did I misread your characterization of this patch?
Regards, Kenny
On Wed, Oct 9, 2019 at 6:31 AM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
On 2019-08-29 2:05 a.m., Kenny Ho wrote:
drm.lgpu A read-write nested-keyed file which exists on all cgroups. Each entry is keyed by the DRM device's major:minor.
lgpu stands for logical GPU, it is an abstraction used to subdivide a physical DRM device for the purpose of resource management. The lgpu is a discrete quantity that is device specific (i.e. some DRM devices may have 64 lgpus while others may have 100 lgpus.) The lgpu is a single quantity with two representations denoted by the following nested keys. ===== ======================================== count Representing lgpu as anonymous resource list Representing lgpu as named resource ===== ======================================== For example: 226:0 count=256 list=0-255 226:1 count=4 list=0,2,4,6 226:2 count=32 list=32-63 lgpu is represented by a bitmap and uses the bitmap_parselist kernel function so the list key input format is a comma-separated list of decimal numbers and ranges. Consecutively set bits are shown as two hyphen-separated decimal numbers, the smallest and largest bit numbers set in the range. Optionally each range can be postfixed to denote that only parts of it should be set. The range will divided to groups of specific size. Syntax: range:used_size/group_size Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769 The count key is the hamming weight / hweight of the bitmap. Both count and list accept the max and default keywords. Some DRM devices may only support lgpu as anonymous resources. In such case, the significance of the position of the set bits in list will be ignored. This lgpu resource supports the 'allocation' resource distribution model.
Change-Id: I1afcacf356770930c7f925df043e51ad06ceb98e Signed-off-by: Kenny Ho Kenny.Ho@amd.com
The description sounds reasonable to me and maps well to the CU masking feature in our GPUs.
It would also allow us to do more coarse-grained masking for example to guarantee balanced allocation of CUs across shader engines or partitioning of memory bandwidth or CP pipes (if that is supported by the hardware/firmware).
Hm, so this sounds like the definition for how this cgroup is supposed to work is "amd CU masking" (whatever that exactly is). And the abstract description is just prettification on top, but not actually the real definition you guys want.
I think adding a cgroup which is that much depending upon the hw implementation of the first driver supporting it is not a good idea. -Daniel
I can't comment on the code as I'm unfamiliar with the details of the cgroup code.
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
Documentation/admin-guide/cgroup-v2.rst | 46 ++++++++ include/drm/drm_cgroup.h | 4 + include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 135 ++++++++++++++++++++++++ 4 files changed, 191 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 87a195133eaa..57f18469bd76 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1958,6 +1958,52 @@ DRM Interface Files Set largest allocation for /dev/dri/card1 to 4MB echo "226:1 4m" > drm.buffer.peak.max
- drm.lgpu
- A read-write nested-keyed file which exists on all cgroups.
- Each entry is keyed by the DRM device's major:minor.
- lgpu stands for logical GPU, it is an abstraction used to
- subdivide a physical DRM device for the purpose of resource
- management.
- The lgpu is a discrete quantity that is device specific (i.e.
- some DRM devices may have 64 lgpus while others may have 100
- lgpus.) The lgpu is a single quantity with two representations
- denoted by the following nested keys.
===== ========================================
count Representing lgpu as anonymous resource
list Representing lgpu as named resource
===== ========================================
- For example:
- 226:0 count=256 list=0-255
- 226:1 count=4 list=0,2,4,6
- 226:2 count=32 list=32-63
- lgpu is represented by a bitmap and uses the bitmap_parselist
- kernel function so the list key input format is a
- comma-separated list of decimal numbers and ranges.
- Consecutively set bits are shown as two hyphen-separated decimal
- numbers, the smallest and largest bit numbers set in the range.
- Optionally each range can be postfixed to denote that only parts
- of it should be set. The range will divided to groups of
- specific size.
- Syntax: range:used_size/group_size
- Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
- The count key is the hamming weight / hweight of the bitmap.
- Both count and list accept the max and default keywords.
- Some DRM devices may only support lgpu as anonymous resources.
- In such case, the significance of the position of the set bits
- in list will be ignored.
- This lgpu resource supports the 'allocation' resource
- distribution model.
- GEM Buffer Ownership
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 6d9707e1eb72..a8d6be0b075b 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -6,6 +6,7 @@
#include <linux/cgroup_drm.h> #include <linux/workqueue.h> +#include <linux/types.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -28,6 +29,9 @@ struct drmcg_props { s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
int lgpu_capacity;
DECLARE_BITMAP(lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
};
#ifdef CONFIG_CGROUP_DRM
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index c56cfe74d1a6..7b1cfc4ce4c3 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -14,6 +14,8 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+#define MAX_DRMCG_LGPU_CAPACITY 256
- enum drmcg_mem_bw_attr { DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */ DRMCG_MEM_BW_ATTR_ACCUM_US, /* for calulating 'instantaneous' bw */
@@ -32,6 +34,7 @@ enum drmcg_res_type { DRMCG_TYPE_MEM_PEAK, DRMCG_TYPE_BANDWIDTH, DRMCG_TYPE_BANDWIDTH_PERIOD_BURST,
- DRMCG_TYPE_LGPU, __DRMCG_TYPE_LAST, };
@@ -58,6 +61,9 @@ struct drmcg_device_resource { s64 mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST]; s64 mem_bw_limits_bytes_in_period; s64 mem_bw_limits_avg_bytes_per_us;
s64 lgpu_used;
DECLARE_BITMAP(lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); };
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 0ea7f0619e25..18c4368e2c29 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -9,6 +9,7 @@ #include <linux/cgroup_drm.h> #include <linux/ktime.h> #include <linux/kernel.h> +#include <linux/bitmap.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> #include <drm/ttm/ttm_bo_api.h> @@ -52,6 +53,9 @@ static char const *mem_bw_attr_names[] = { #define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us" #define MEM_BW_LIMITS_NAME_BURST "bytes_in_period"
+#define LGPU_LIMITS_NAME_LIST "list" +#define LGPU_LIMITS_NAME_COUNT "count"
static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data)
@@ -115,6 +119,10 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i];
- bitmap_copy(ddr->lgpu_allocated, dev->drmcg_props.lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex); return 0; }
@@ -280,6 +288,14 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, MEM_BW_LIMITS_NAME_AVG, ddr->mem_bw_limits_avg_bytes_per_us); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%lld %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
ddr->lgpu_used,
LGPU_LIMITS_NAME_LIST,
dev->drmcg_props.lgpu_capacity,
ddr->lgpu_allocated);
default: seq_puts(sf, "\n"); break;break;
@@ -314,6 +330,15 @@ static void drmcg_print_default(struct drmcg_props *props, MEM_BW_LIMITS_NAME_AVG, props->mem_bw_avg_bytes_per_us_default); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%d %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
bitmap_weight(props->lgpu_slots,
props->lgpu_capacity),
LGPU_LIMITS_NAME_LIST,
props->lgpu_capacity,
props->lgpu_slots);
default: seq_puts(sf, "\n"); break;break;
@@ -407,9 +432,21 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) mutex_unlock(&dev->drmcg_mutex); }
+static void drmcg_lgpu_values_apply(struct drm_device *dev,
struct drmcg_device_resource *ddr, unsigned long *val)
+{
- mutex_lock(&dev->drmcg_mutex);
- bitmap_copy(ddr->lgpu_allocated, val, MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex);
+}
- static void drmcg_nested_limit_parse(struct kernfs_open_file *of, struct drm_device *dev, char *attrs) {
- DECLARE_BITMAP(tmp_bitmap, MAX_DRMCG_LGPU_CAPACITY);
- DECLARE_BITMAP(chk_bitmap, MAX_DRMCG_LGPU_CAPACITY); enum drmcg_res_type type = DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); struct drmcg *drmcg = css_to_drmcg(of_css(of));
@@ -501,6 +538,83 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, continue; } break; /* DRMCG_TYPE_MEM */
case DRMCG_TYPE_LGPU:
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) &&
strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) )
continue;
if (!strcmp("max", sval) ||
!strcmp("default", sval)) {
if (parent != NULL)
drmcg_lgpu_values_apply(dev, ddr,
parent->dev_resources[minor]->
lgpu_allocated);
else
drmcg_lgpu_values_apply(dev, ddr,
props->lgpu_slots);
continue;
}
if (strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) == 0) {
p_max = parent == NULL ? props->lgpu_capacity:
bitmap_weight(
parent->dev_resources[minor]->
lgpu_allocated, props->lgpu_capacity);
rc = drmcg_process_limit_s64_val(sval,
false, p_max, p_max, &val);
if (rc || val < 0) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_zero(tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
bitmap_set(tmp_bitmap, 0, val);
}
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) == 0) {
rc = bitmap_parselist(sval, tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
if (rc) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_andnot(chk_bitmap, tmp_bitmap,
props->lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
if (!bitmap_empty(chk_bitmap,
MAX_DRMCG_LGPU_CAPACITY)) {
drmcg_pr_cft_err(drmcg, 0, cft_name,
minor);
continue;
}
}
if (parent != NULL) {
bitmap_and(chk_bitmap, tmp_bitmap,
parent->dev_resources[minor]->lgpu_allocated,
props->lgpu_capacity);
if (bitmap_empty(chk_bitmap,
props->lgpu_capacity)) {
drmcg_pr_cft_err(drmcg, 0,
cft_name, minor);
continue;
}
}
drmcg_lgpu_values_apply(dev, ddr, tmp_bitmap);
break; /* DRMCG_TYPE_LGPU */ default: break; } /* switch (type) */
@@ -606,6 +720,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, break; case DRMCG_TYPE_BANDWIDTH: case DRMCG_TYPE_MEM:
case DRMCG_TYPE_LGPU: drmcg_nested_limit_parse(of, dm->dev, sattr); break; default:
@@ -731,6 +846,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, DRMCG_FTYPE_DEFAULT), },
- {
.name = "lgpu",
.seq_show = drmcg_seq_show,
.write = drmcg_limit_write,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_LIMIT),
- },
- {
.name = "lgpu.default",
.seq_show = drmcg_seq_show,
.flags = CFTYPE_ONLY_ON_ROOT,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_DEFAULT),
- }, { } /* terminate */ };
@@ -744,6 +873,10 @@ struct cgroup_subsys drm_cgrp_subsys = {
static inline void drmcg_update_cg_tree(struct drm_device *dev) {
bitmap_zero(dev->drmcg_props.lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
bitmap_fill(dev->drmcg_props.lgpu_slots,
dev->drmcg_props.lgpu_capacity);
- /* init cgroups created before registration (i.e. root cgroup) */ if (root_drmcg != NULL) { struct cgroup_subsys_state *pos;
@@ -800,6 +933,8 @@ void drmcg_device_early_init(struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) dev->drmcg_props.mem_highs_default[i] = S64_MAX;
- dev->drmcg_props.lgpu_capacity = MAX_DRMCG_LGPU_CAPACITY;
- drmcg_update_cg_tree(dev); } EXPORT_SYMBOL(drmcg_device_early_init);
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Oct 09, 2019 at 11:08:45AM -0400, Kenny Ho wrote:
Hi Daniel,
Can you elaborate what you mean in more details? The goal of lgpu is to provide the ability to subdivide a GPU device and give those slices to different users as needed. I don't think there is anything controversial or vendor specific here as requests for this are well documented. The underlying representation is just a bitmap, which is neither unprecedented nor vendor specific (bitmap is used in cpuset for instance.)
An implementation of this abstraction is not hardware specific either. For example, one can associate a virtual function in SRIOV as a lgpu. Alternatively, a device can also declare to have 100 lgpus and treat the lgpu quantity as a percentage representation of GPU subdivision. The fact that an abstraction works well with a vendor implementation does not make it a "prettification" of a vendor feature (by this logic, I hope you are not implying an abstraction is only valid if it does not work with amd CU masking because that seems fairly partisan.)
Did I misread your characterization of this patch?
Scenario: I'm a gpgpu customer, and I type some gpgpu program (probably in cude, transpiled for amd using rocm).
How does the stuff I'm seeing in cuda (or vk compute, or whatever) map to the bitmasks I can set in this cgroup controller?
That's the stuff which this spec needs to explain. Currently the answer is "amd CU masking", and that's not going to work on e.g. nvidia hw. We need to come up with end-user relevant resources/meanings for these bits which works across vendors.
On cpu a "cpu core" is rather well-defined, and customers actually know what it means on intel, amd, ibm powerpc or arm. Both on the program side (e.g. what do I need to stuff into relevant system calls to run on a specific "cpu core") and on the admin side.
We need to achieve the same for gpus. "it's a bitmask" is not even close enough imo. -Daniel
Regards, Kenny
On Wed, Oct 9, 2019 at 6:31 AM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
On 2019-08-29 2:05 a.m., Kenny Ho wrote:
drm.lgpu A read-write nested-keyed file which exists on all cgroups. Each entry is keyed by the DRM device's major:minor.
lgpu stands for logical GPU, it is an abstraction used to subdivide a physical DRM device for the purpose of resource management. The lgpu is a discrete quantity that is device specific (i.e. some DRM devices may have 64 lgpus while others may have 100 lgpus.) The lgpu is a single quantity with two representations denoted by the following nested keys. ===== ======================================== count Representing lgpu as anonymous resource list Representing lgpu as named resource ===== ======================================== For example: 226:0 count=256 list=0-255 226:1 count=4 list=0,2,4,6 226:2 count=32 list=32-63 lgpu is represented by a bitmap and uses the bitmap_parselist kernel function so the list key input format is a comma-separated list of decimal numbers and ranges. Consecutively set bits are shown as two hyphen-separated decimal numbers, the smallest and largest bit numbers set in the range. Optionally each range can be postfixed to denote that only parts of it should be set. The range will divided to groups of specific size. Syntax: range:used_size/group_size Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769 The count key is the hamming weight / hweight of the bitmap. Both count and list accept the max and default keywords. Some DRM devices may only support lgpu as anonymous resources. In such case, the significance of the position of the set bits in list will be ignored. This lgpu resource supports the 'allocation' resource distribution model.
Change-Id: I1afcacf356770930c7f925df043e51ad06ceb98e Signed-off-by: Kenny Ho Kenny.Ho@amd.com
The description sounds reasonable to me and maps well to the CU masking feature in our GPUs.
It would also allow us to do more coarse-grained masking for example to guarantee balanced allocation of CUs across shader engines or partitioning of memory bandwidth or CP pipes (if that is supported by the hardware/firmware).
Hm, so this sounds like the definition for how this cgroup is supposed to work is "amd CU masking" (whatever that exactly is). And the abstract description is just prettification on top, but not actually the real definition you guys want.
I think adding a cgroup which is that much depending upon the hw implementation of the first driver supporting it is not a good idea. -Daniel
I can't comment on the code as I'm unfamiliar with the details of the cgroup code.
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
Documentation/admin-guide/cgroup-v2.rst | 46 ++++++++ include/drm/drm_cgroup.h | 4 + include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 135 ++++++++++++++++++++++++ 4 files changed, 191 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 87a195133eaa..57f18469bd76 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1958,6 +1958,52 @@ DRM Interface Files Set largest allocation for /dev/dri/card1 to 4MB echo "226:1 4m" > drm.buffer.peak.max
- drm.lgpu
- A read-write nested-keyed file which exists on all cgroups.
- Each entry is keyed by the DRM device's major:minor.
- lgpu stands for logical GPU, it is an abstraction used to
- subdivide a physical DRM device for the purpose of resource
- management.
- The lgpu is a discrete quantity that is device specific (i.e.
- some DRM devices may have 64 lgpus while others may have 100
- lgpus.) The lgpu is a single quantity with two representations
- denoted by the following nested keys.
===== ========================================
count Representing lgpu as anonymous resource
list Representing lgpu as named resource
===== ========================================
- For example:
- 226:0 count=256 list=0-255
- 226:1 count=4 list=0,2,4,6
- 226:2 count=32 list=32-63
- lgpu is represented by a bitmap and uses the bitmap_parselist
- kernel function so the list key input format is a
- comma-separated list of decimal numbers and ranges.
- Consecutively set bits are shown as two hyphen-separated decimal
- numbers, the smallest and largest bit numbers set in the range.
- Optionally each range can be postfixed to denote that only parts
- of it should be set. The range will divided to groups of
- specific size.
- Syntax: range:used_size/group_size
- Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
- The count key is the hamming weight / hweight of the bitmap.
- Both count and list accept the max and default keywords.
- Some DRM devices may only support lgpu as anonymous resources.
- In such case, the significance of the position of the set bits
- in list will be ignored.
- This lgpu resource supports the 'allocation' resource
- distribution model.
- GEM Buffer Ownership
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 6d9707e1eb72..a8d6be0b075b 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -6,6 +6,7 @@
#include <linux/cgroup_drm.h> #include <linux/workqueue.h> +#include <linux/types.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -28,6 +29,9 @@ struct drmcg_props { s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
int lgpu_capacity;
DECLARE_BITMAP(lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
};
#ifdef CONFIG_CGROUP_DRM
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index c56cfe74d1a6..7b1cfc4ce4c3 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -14,6 +14,8 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+#define MAX_DRMCG_LGPU_CAPACITY 256
- enum drmcg_mem_bw_attr { DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */ DRMCG_MEM_BW_ATTR_ACCUM_US, /* for calulating 'instantaneous' bw */
@@ -32,6 +34,7 @@ enum drmcg_res_type { DRMCG_TYPE_MEM_PEAK, DRMCG_TYPE_BANDWIDTH, DRMCG_TYPE_BANDWIDTH_PERIOD_BURST,
- DRMCG_TYPE_LGPU, __DRMCG_TYPE_LAST, };
@@ -58,6 +61,9 @@ struct drmcg_device_resource { s64 mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST]; s64 mem_bw_limits_bytes_in_period; s64 mem_bw_limits_avg_bytes_per_us;
s64 lgpu_used;
DECLARE_BITMAP(lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); };
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 0ea7f0619e25..18c4368e2c29 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -9,6 +9,7 @@ #include <linux/cgroup_drm.h> #include <linux/ktime.h> #include <linux/kernel.h> +#include <linux/bitmap.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> #include <drm/ttm/ttm_bo_api.h> @@ -52,6 +53,9 @@ static char const *mem_bw_attr_names[] = { #define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us" #define MEM_BW_LIMITS_NAME_BURST "bytes_in_period"
+#define LGPU_LIMITS_NAME_LIST "list" +#define LGPU_LIMITS_NAME_COUNT "count"
static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data)
@@ -115,6 +119,10 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i];
- bitmap_copy(ddr->lgpu_allocated, dev->drmcg_props.lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex); return 0; }
@@ -280,6 +288,14 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, MEM_BW_LIMITS_NAME_AVG, ddr->mem_bw_limits_avg_bytes_per_us); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%lld %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
ddr->lgpu_used,
LGPU_LIMITS_NAME_LIST,
dev->drmcg_props.lgpu_capacity,
ddr->lgpu_allocated);
default: seq_puts(sf, "\n"); break;break;
@@ -314,6 +330,15 @@ static void drmcg_print_default(struct drmcg_props *props, MEM_BW_LIMITS_NAME_AVG, props->mem_bw_avg_bytes_per_us_default); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%d %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
bitmap_weight(props->lgpu_slots,
props->lgpu_capacity),
LGPU_LIMITS_NAME_LIST,
props->lgpu_capacity,
props->lgpu_slots);
default: seq_puts(sf, "\n"); break;break;
@@ -407,9 +432,21 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) mutex_unlock(&dev->drmcg_mutex); }
+static void drmcg_lgpu_values_apply(struct drm_device *dev,
struct drmcg_device_resource *ddr, unsigned long *val)
+{
- mutex_lock(&dev->drmcg_mutex);
- bitmap_copy(ddr->lgpu_allocated, val, MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex);
+}
- static void drmcg_nested_limit_parse(struct kernfs_open_file *of, struct drm_device *dev, char *attrs) {
- DECLARE_BITMAP(tmp_bitmap, MAX_DRMCG_LGPU_CAPACITY);
- DECLARE_BITMAP(chk_bitmap, MAX_DRMCG_LGPU_CAPACITY); enum drmcg_res_type type = DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); struct drmcg *drmcg = css_to_drmcg(of_css(of));
@@ -501,6 +538,83 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, continue; } break; /* DRMCG_TYPE_MEM */
case DRMCG_TYPE_LGPU:
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) &&
strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) )
continue;
if (!strcmp("max", sval) ||
!strcmp("default", sval)) {
if (parent != NULL)
drmcg_lgpu_values_apply(dev, ddr,
parent->dev_resources[minor]->
lgpu_allocated);
else
drmcg_lgpu_values_apply(dev, ddr,
props->lgpu_slots);
continue;
}
if (strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) == 0) {
p_max = parent == NULL ? props->lgpu_capacity:
bitmap_weight(
parent->dev_resources[minor]->
lgpu_allocated, props->lgpu_capacity);
rc = drmcg_process_limit_s64_val(sval,
false, p_max, p_max, &val);
if (rc || val < 0) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_zero(tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
bitmap_set(tmp_bitmap, 0, val);
}
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) == 0) {
rc = bitmap_parselist(sval, tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
if (rc) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_andnot(chk_bitmap, tmp_bitmap,
props->lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
if (!bitmap_empty(chk_bitmap,
MAX_DRMCG_LGPU_CAPACITY)) {
drmcg_pr_cft_err(drmcg, 0, cft_name,
minor);
continue;
}
}
if (parent != NULL) {
bitmap_and(chk_bitmap, tmp_bitmap,
parent->dev_resources[minor]->lgpu_allocated,
props->lgpu_capacity);
if (bitmap_empty(chk_bitmap,
props->lgpu_capacity)) {
drmcg_pr_cft_err(drmcg, 0,
cft_name, minor);
continue;
}
}
drmcg_lgpu_values_apply(dev, ddr, tmp_bitmap);
break; /* DRMCG_TYPE_LGPU */ default: break; } /* switch (type) */
@@ -606,6 +720,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, break; case DRMCG_TYPE_BANDWIDTH: case DRMCG_TYPE_MEM:
case DRMCG_TYPE_LGPU: drmcg_nested_limit_parse(of, dm->dev, sattr); break; default:
@@ -731,6 +846,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, DRMCG_FTYPE_DEFAULT), },
- {
.name = "lgpu",
.seq_show = drmcg_seq_show,
.write = drmcg_limit_write,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_LIMIT),
- },
- {
.name = "lgpu.default",
.seq_show = drmcg_seq_show,
.flags = CFTYPE_ONLY_ON_ROOT,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_DEFAULT),
- }, { } /* terminate */ };
@@ -744,6 +873,10 @@ struct cgroup_subsys drm_cgrp_subsys = {
static inline void drmcg_update_cg_tree(struct drm_device *dev) {
bitmap_zero(dev->drmcg_props.lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
bitmap_fill(dev->drmcg_props.lgpu_slots,
dev->drmcg_props.lgpu_capacity);
- /* init cgroups created before registration (i.e. root cgroup) */ if (root_drmcg != NULL) { struct cgroup_subsys_state *pos;
@@ -800,6 +933,8 @@ void drmcg_device_early_init(struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) dev->drmcg_props.mem_highs_default[i] = S64_MAX;
- dev->drmcg_props.lgpu_capacity = MAX_DRMCG_LGPU_CAPACITY;
- drmcg_update_cg_tree(dev); } EXPORT_SYMBOL(drmcg_device_early_init);
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 2019-10-09 6:31, Daniel Vetter wrote:
On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
The description sounds reasonable to me and maps well to the CU masking feature in our GPUs.
It would also allow us to do more coarse-grained masking for example to guarantee balanced allocation of CUs across shader engines or partitioning of memory bandwidth or CP pipes (if that is supported by the hardware/firmware).
Hm, so this sounds like the definition for how this cgroup is supposed to work is "amd CU masking" (whatever that exactly is). And the abstract description is just prettification on top, but not actually the real definition you guys want.
I think you're reading this as the opposite of what I was trying to say. Using CU masking is one possible implementation of LGPUs on AMD hardware. It's the one that Kenny implemented at the end of this patch series, and I pointed out some problems with that approach. Other ways to partition the hardware into LGPUs are conceivable. For example we're considering splitting it along the lines of shader engines, which is more coarse-grain and would also affect memory bandwidth available to each partition.
We could also consider partitioning pipes in our command processor, although that is not supported by our current CP scheduler firmware.
The bottom line is, the LGPU model proposed by Kenny is quite abstract and allows drivers implementing it a lot of flexibility depending on the capability of their hardware and firmware. We haven't settled on a final implementation choice even for AMD.
Regards, Felix
I think adding a cgroup which is that much depending upon the hw implementation of the first driver supporting it is not a good idea. -Daniel
I can't comment on the code as I'm unfamiliar with the details of the cgroup code.
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
Documentation/admin-guide/cgroup-v2.rst | 46 ++++++++ include/drm/drm_cgroup.h | 4 + include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 135 ++++++++++++++++++++++++ 4 files changed, 191 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 87a195133eaa..57f18469bd76 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1958,6 +1958,52 @@ DRM Interface Files Set largest allocation for /dev/dri/card1 to 4MB echo "226:1 4m" > drm.buffer.peak.max
- drm.lgpu
- A read-write nested-keyed file which exists on all cgroups.
- Each entry is keyed by the DRM device's major:minor.
- lgpu stands for logical GPU, it is an abstraction used to
- subdivide a physical DRM device for the purpose of resource
- management.
- The lgpu is a discrete quantity that is device specific (i.e.
- some DRM devices may have 64 lgpus while others may have 100
- lgpus.) The lgpu is a single quantity with two representations
- denoted by the following nested keys.
===== ========================================
count Representing lgpu as anonymous resource
list Representing lgpu as named resource
===== ========================================
- For example:
- 226:0 count=256 list=0-255
- 226:1 count=4 list=0,2,4,6
- 226:2 count=32 list=32-63
- lgpu is represented by a bitmap and uses the bitmap_parselist
- kernel function so the list key input format is a
- comma-separated list of decimal numbers and ranges.
- Consecutively set bits are shown as two hyphen-separated decimal
- numbers, the smallest and largest bit numbers set in the range.
- Optionally each range can be postfixed to denote that only parts
- of it should be set. The range will divided to groups of
- specific size.
- Syntax: range:used_size/group_size
- Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
- The count key is the hamming weight / hweight of the bitmap.
- Both count and list accept the max and default keywords.
- Some DRM devices may only support lgpu as anonymous resources.
- In such case, the significance of the position of the set bits
- in list will be ignored.
- This lgpu resource supports the 'allocation' resource
- distribution model.
- GEM Buffer Ownership
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 6d9707e1eb72..a8d6be0b075b 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -6,6 +6,7 @@
#include <linux/cgroup_drm.h> #include <linux/workqueue.h> +#include <linux/types.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -28,6 +29,9 @@ struct drmcg_props { s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
int lgpu_capacity;
DECLARE_BITMAP(lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
};
#ifdef CONFIG_CGROUP_DRM
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index c56cfe74d1a6..7b1cfc4ce4c3 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -14,6 +14,8 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+#define MAX_DRMCG_LGPU_CAPACITY 256
- enum drmcg_mem_bw_attr { DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */ DRMCG_MEM_BW_ATTR_ACCUM_US, /* for calulating 'instantaneous' bw */
@@ -32,6 +34,7 @@ enum drmcg_res_type { DRMCG_TYPE_MEM_PEAK, DRMCG_TYPE_BANDWIDTH, DRMCG_TYPE_BANDWIDTH_PERIOD_BURST,
- DRMCG_TYPE_LGPU, __DRMCG_TYPE_LAST, };
@@ -58,6 +61,9 @@ struct drmcg_device_resource { s64 mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST]; s64 mem_bw_limits_bytes_in_period; s64 mem_bw_limits_avg_bytes_per_us;
s64 lgpu_used;
DECLARE_BITMAP(lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); };
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 0ea7f0619e25..18c4368e2c29 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -9,6 +9,7 @@ #include <linux/cgroup_drm.h> #include <linux/ktime.h> #include <linux/kernel.h> +#include <linux/bitmap.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> #include <drm/ttm/ttm_bo_api.h> @@ -52,6 +53,9 @@ static char const *mem_bw_attr_names[] = { #define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us" #define MEM_BW_LIMITS_NAME_BURST "bytes_in_period"
+#define LGPU_LIMITS_NAME_LIST "list" +#define LGPU_LIMITS_NAME_COUNT "count"
static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data)
@@ -115,6 +119,10 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i];
- bitmap_copy(ddr->lgpu_allocated, dev->drmcg_props.lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex); return 0; }
@@ -280,6 +288,14 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, MEM_BW_LIMITS_NAME_AVG, ddr->mem_bw_limits_avg_bytes_per_us); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%lld %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
ddr->lgpu_used,
LGPU_LIMITS_NAME_LIST,
dev->drmcg_props.lgpu_capacity,
ddr->lgpu_allocated);
default: seq_puts(sf, "\n"); break;break;
@@ -314,6 +330,15 @@ static void drmcg_print_default(struct drmcg_props *props, MEM_BW_LIMITS_NAME_AVG, props->mem_bw_avg_bytes_per_us_default); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%d %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
bitmap_weight(props->lgpu_slots,
props->lgpu_capacity),
LGPU_LIMITS_NAME_LIST,
props->lgpu_capacity,
props->lgpu_slots);
default: seq_puts(sf, "\n"); break;break;
@@ -407,9 +432,21 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) mutex_unlock(&dev->drmcg_mutex); }
+static void drmcg_lgpu_values_apply(struct drm_device *dev,
struct drmcg_device_resource *ddr, unsigned long *val)
+{
- mutex_lock(&dev->drmcg_mutex);
- bitmap_copy(ddr->lgpu_allocated, val, MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex);
+}
- static void drmcg_nested_limit_parse(struct kernfs_open_file *of, struct drm_device *dev, char *attrs) {
- DECLARE_BITMAP(tmp_bitmap, MAX_DRMCG_LGPU_CAPACITY);
- DECLARE_BITMAP(chk_bitmap, MAX_DRMCG_LGPU_CAPACITY); enum drmcg_res_type type = DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); struct drmcg *drmcg = css_to_drmcg(of_css(of));
@@ -501,6 +538,83 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, continue; } break; /* DRMCG_TYPE_MEM */
case DRMCG_TYPE_LGPU:
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) &&
strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) )
continue;
if (!strcmp("max", sval) ||
!strcmp("default", sval)) {
if (parent != NULL)
drmcg_lgpu_values_apply(dev, ddr,
parent->dev_resources[minor]->
lgpu_allocated);
else
drmcg_lgpu_values_apply(dev, ddr,
props->lgpu_slots);
continue;
}
if (strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) == 0) {
p_max = parent == NULL ? props->lgpu_capacity:
bitmap_weight(
parent->dev_resources[minor]->
lgpu_allocated, props->lgpu_capacity);
rc = drmcg_process_limit_s64_val(sval,
false, p_max, p_max, &val);
if (rc || val < 0) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_zero(tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
bitmap_set(tmp_bitmap, 0, val);
}
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) == 0) {
rc = bitmap_parselist(sval, tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
if (rc) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_andnot(chk_bitmap, tmp_bitmap,
props->lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
if (!bitmap_empty(chk_bitmap,
MAX_DRMCG_LGPU_CAPACITY)) {
drmcg_pr_cft_err(drmcg, 0, cft_name,
minor);
continue;
}
}
if (parent != NULL) {
bitmap_and(chk_bitmap, tmp_bitmap,
parent->dev_resources[minor]->lgpu_allocated,
props->lgpu_capacity);
if (bitmap_empty(chk_bitmap,
props->lgpu_capacity)) {
drmcg_pr_cft_err(drmcg, 0,
cft_name, minor);
continue;
}
}
drmcg_lgpu_values_apply(dev, ddr, tmp_bitmap);
break; /* DRMCG_TYPE_LGPU */ default: break; } /* switch (type) */
@@ -606,6 +720,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, break; case DRMCG_TYPE_BANDWIDTH: case DRMCG_TYPE_MEM:
case DRMCG_TYPE_LGPU: drmcg_nested_limit_parse(of, dm->dev, sattr); break; default:
@@ -731,6 +846,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, DRMCG_FTYPE_DEFAULT), },
- {
.name = "lgpu",
.seq_show = drmcg_seq_show,
.write = drmcg_limit_write,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_LIMIT),
- },
- {
.name = "lgpu.default",
.seq_show = drmcg_seq_show,
.flags = CFTYPE_ONLY_ON_ROOT,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_DEFAULT),
- }, { } /* terminate */ };
@@ -744,6 +873,10 @@ struct cgroup_subsys drm_cgrp_subsys = {
static inline void drmcg_update_cg_tree(struct drm_device *dev) {
bitmap_zero(dev->drmcg_props.lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
bitmap_fill(dev->drmcg_props.lgpu_slots,
dev->drmcg_props.lgpu_capacity);
- /* init cgroups created before registration (i.e. root cgroup) */ if (root_drmcg != NULL) { struct cgroup_subsys_state *pos;
@@ -800,6 +933,8 @@ void drmcg_device_early_init(struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) dev->drmcg_props.mem_highs_default[i] = S64_MAX;
- dev->drmcg_props.lgpu_capacity = MAX_DRMCG_LGPU_CAPACITY;
- drmcg_update_cg_tree(dev); } EXPORT_SYMBOL(drmcg_device_early_init);
On Wed, Oct 09, 2019 at 03:25:22PM +0000, Kuehling, Felix wrote:
On 2019-10-09 6:31, Daniel Vetter wrote:
On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
The description sounds reasonable to me and maps well to the CU masking feature in our GPUs.
It would also allow us to do more coarse-grained masking for example to guarantee balanced allocation of CUs across shader engines or partitioning of memory bandwidth or CP pipes (if that is supported by the hardware/firmware).
Hm, so this sounds like the definition for how this cgroup is supposed to work is "amd CU masking" (whatever that exactly is). And the abstract description is just prettification on top, but not actually the real definition you guys want.
I think you're reading this as the opposite of what I was trying to say. Using CU masking is one possible implementation of LGPUs on AMD hardware. It's the one that Kenny implemented at the end of this patch series, and I pointed out some problems with that approach. Other ways to partition the hardware into LGPUs are conceivable. For example we're considering splitting it along the lines of shader engines, which is more coarse-grain and would also affect memory bandwidth available to each partition.
If this is supposed to be useful for admins then "other ways to partition the hw are conceivable" is the problem. This should be unique&clear for admins/end-users. Reading the implementation details and realizing that the actual meaning is "amd CU masking" isn't good enough by far, since that's meaningless on any other hw.
And if there's other ways to implement this cgroup for amd, it's also meaningless (to sysadmins/users) for amd hw.
We could also consider partitioning pipes in our command processor, although that is not supported by our current CP scheduler firmware.
The bottom line is, the LGPU model proposed by Kenny is quite abstract and allows drivers implementing it a lot of flexibility depending on the capability of their hardware and firmware. We haven't settled on a final implementation choice even for AMD.
That abstract model of essentially "anything goes" is the problem here imo. E.g. for cpu cgroups this would be similar to allowing the bitmaks to mean "cpu core" on one machine "physical die" on the next and maybe "hyperthread unit" on the 3rd. Useless for admins.
So if we have a gpu bitmaks thing that might mean a command submissio pipe on one hw (maybe matching what vk exposed, maybe not), some compute unit mask on the next and something entirely different (e.g. intel has so called GT slices with compute cores + more stuff around) on the 3rd vendor then that's not useful for admins. -Daniel
Regards, Felix
I think adding a cgroup which is that much depending upon the hw implementation of the first driver supporting it is not a good idea. -Daniel
I can't comment on the code as I'm unfamiliar with the details of the cgroup code.
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
Documentation/admin-guide/cgroup-v2.rst | 46 ++++++++ include/drm/drm_cgroup.h | 4 + include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 135 ++++++++++++++++++++++++ 4 files changed, 191 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 87a195133eaa..57f18469bd76 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1958,6 +1958,52 @@ DRM Interface Files Set largest allocation for /dev/dri/card1 to 4MB echo "226:1 4m" > drm.buffer.peak.max
- drm.lgpu
- A read-write nested-keyed file which exists on all cgroups.
- Each entry is keyed by the DRM device's major:minor.
- lgpu stands for logical GPU, it is an abstraction used to
- subdivide a physical DRM device for the purpose of resource
- management.
- The lgpu is a discrete quantity that is device specific (i.e.
- some DRM devices may have 64 lgpus while others may have 100
- lgpus.) The lgpu is a single quantity with two representations
- denoted by the following nested keys.
===== ========================================
count Representing lgpu as anonymous resource
list Representing lgpu as named resource
===== ========================================
- For example:
- 226:0 count=256 list=0-255
- 226:1 count=4 list=0,2,4,6
- 226:2 count=32 list=32-63
- lgpu is represented by a bitmap and uses the bitmap_parselist
- kernel function so the list key input format is a
- comma-separated list of decimal numbers and ranges.
- Consecutively set bits are shown as two hyphen-separated decimal
- numbers, the smallest and largest bit numbers set in the range.
- Optionally each range can be postfixed to denote that only parts
- of it should be set. The range will divided to groups of
- specific size.
- Syntax: range:used_size/group_size
- Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
- The count key is the hamming weight / hweight of the bitmap.
- Both count and list accept the max and default keywords.
- Some DRM devices may only support lgpu as anonymous resources.
- In such case, the significance of the position of the set bits
- in list will be ignored.
- This lgpu resource supports the 'allocation' resource
- distribution model.
- GEM Buffer Ownership
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 6d9707e1eb72..a8d6be0b075b 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -6,6 +6,7 @@
#include <linux/cgroup_drm.h> #include <linux/workqueue.h> +#include <linux/types.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -28,6 +29,9 @@ struct drmcg_props { s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
int lgpu_capacity;
DECLARE_BITMAP(lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
};
#ifdef CONFIG_CGROUP_DRM
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index c56cfe74d1a6..7b1cfc4ce4c3 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -14,6 +14,8 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+#define MAX_DRMCG_LGPU_CAPACITY 256
- enum drmcg_mem_bw_attr { DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */ DRMCG_MEM_BW_ATTR_ACCUM_US, /* for calulating 'instantaneous' bw */
@@ -32,6 +34,7 @@ enum drmcg_res_type { DRMCG_TYPE_MEM_PEAK, DRMCG_TYPE_BANDWIDTH, DRMCG_TYPE_BANDWIDTH_PERIOD_BURST,
- DRMCG_TYPE_LGPU, __DRMCG_TYPE_LAST, };
@@ -58,6 +61,9 @@ struct drmcg_device_resource { s64 mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST]; s64 mem_bw_limits_bytes_in_period; s64 mem_bw_limits_avg_bytes_per_us;
s64 lgpu_used;
DECLARE_BITMAP(lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); };
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 0ea7f0619e25..18c4368e2c29 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -9,6 +9,7 @@ #include <linux/cgroup_drm.h> #include <linux/ktime.h> #include <linux/kernel.h> +#include <linux/bitmap.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> #include <drm/ttm/ttm_bo_api.h> @@ -52,6 +53,9 @@ static char const *mem_bw_attr_names[] = { #define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us" #define MEM_BW_LIMITS_NAME_BURST "bytes_in_period"
+#define LGPU_LIMITS_NAME_LIST "list" +#define LGPU_LIMITS_NAME_COUNT "count"
static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data)
@@ -115,6 +119,10 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i];
- bitmap_copy(ddr->lgpu_allocated, dev->drmcg_props.lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex); return 0; }
@@ -280,6 +288,14 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, MEM_BW_LIMITS_NAME_AVG, ddr->mem_bw_limits_avg_bytes_per_us); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%lld %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
ddr->lgpu_used,
LGPU_LIMITS_NAME_LIST,
dev->drmcg_props.lgpu_capacity,
ddr->lgpu_allocated);
default: seq_puts(sf, "\n"); break;break;
@@ -314,6 +330,15 @@ static void drmcg_print_default(struct drmcg_props *props, MEM_BW_LIMITS_NAME_AVG, props->mem_bw_avg_bytes_per_us_default); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%d %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
bitmap_weight(props->lgpu_slots,
props->lgpu_capacity),
LGPU_LIMITS_NAME_LIST,
props->lgpu_capacity,
props->lgpu_slots);
default: seq_puts(sf, "\n"); break;break;
@@ -407,9 +432,21 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) mutex_unlock(&dev->drmcg_mutex); }
+static void drmcg_lgpu_values_apply(struct drm_device *dev,
struct drmcg_device_resource *ddr, unsigned long *val)
+{
- mutex_lock(&dev->drmcg_mutex);
- bitmap_copy(ddr->lgpu_allocated, val, MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex);
+}
- static void drmcg_nested_limit_parse(struct kernfs_open_file *of, struct drm_device *dev, char *attrs) {
- DECLARE_BITMAP(tmp_bitmap, MAX_DRMCG_LGPU_CAPACITY);
- DECLARE_BITMAP(chk_bitmap, MAX_DRMCG_LGPU_CAPACITY); enum drmcg_res_type type = DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); struct drmcg *drmcg = css_to_drmcg(of_css(of));
@@ -501,6 +538,83 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, continue; } break; /* DRMCG_TYPE_MEM */
case DRMCG_TYPE_LGPU:
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) &&
strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) )
continue;
if (!strcmp("max", sval) ||
!strcmp("default", sval)) {
if (parent != NULL)
drmcg_lgpu_values_apply(dev, ddr,
parent->dev_resources[minor]->
lgpu_allocated);
else
drmcg_lgpu_values_apply(dev, ddr,
props->lgpu_slots);
continue;
}
if (strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) == 0) {
p_max = parent == NULL ? props->lgpu_capacity:
bitmap_weight(
parent->dev_resources[minor]->
lgpu_allocated, props->lgpu_capacity);
rc = drmcg_process_limit_s64_val(sval,
false, p_max, p_max, &val);
if (rc || val < 0) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_zero(tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
bitmap_set(tmp_bitmap, 0, val);
}
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) == 0) {
rc = bitmap_parselist(sval, tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
if (rc) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_andnot(chk_bitmap, tmp_bitmap,
props->lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
if (!bitmap_empty(chk_bitmap,
MAX_DRMCG_LGPU_CAPACITY)) {
drmcg_pr_cft_err(drmcg, 0, cft_name,
minor);
continue;
}
}
if (parent != NULL) {
bitmap_and(chk_bitmap, tmp_bitmap,
parent->dev_resources[minor]->lgpu_allocated,
props->lgpu_capacity);
if (bitmap_empty(chk_bitmap,
props->lgpu_capacity)) {
drmcg_pr_cft_err(drmcg, 0,
cft_name, minor);
continue;
}
}
drmcg_lgpu_values_apply(dev, ddr, tmp_bitmap);
break; /* DRMCG_TYPE_LGPU */ default: break; } /* switch (type) */
@@ -606,6 +720,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, break; case DRMCG_TYPE_BANDWIDTH: case DRMCG_TYPE_MEM:
case DRMCG_TYPE_LGPU: drmcg_nested_limit_parse(of, dm->dev, sattr); break; default:
@@ -731,6 +846,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, DRMCG_FTYPE_DEFAULT), },
- {
.name = "lgpu",
.seq_show = drmcg_seq_show,
.write = drmcg_limit_write,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_LIMIT),
- },
- {
.name = "lgpu.default",
.seq_show = drmcg_seq_show,
.flags = CFTYPE_ONLY_ON_ROOT,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_DEFAULT),
- }, { } /* terminate */ };
@@ -744,6 +873,10 @@ struct cgroup_subsys drm_cgrp_subsys = {
static inline void drmcg_update_cg_tree(struct drm_device *dev) {
bitmap_zero(dev->drmcg_props.lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
bitmap_fill(dev->drmcg_props.lgpu_slots,
dev->drmcg_props.lgpu_capacity);
- /* init cgroups created before registration (i.e. root cgroup) */ if (root_drmcg != NULL) { struct cgroup_subsys_state *pos;
@@ -800,6 +933,8 @@ void drmcg_device_early_init(struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) dev->drmcg_props.mem_highs_default[i] = S64_MAX;
- dev->drmcg_props.lgpu_capacity = MAX_DRMCG_LGPU_CAPACITY;
- drmcg_update_cg_tree(dev); } EXPORT_SYMBOL(drmcg_device_early_init);
On 2019-10-09 11:34, Daniel Vetter wrote:
On Wed, Oct 09, 2019 at 03:25:22PM +0000, Kuehling, Felix wrote:
On 2019-10-09 6:31, Daniel Vetter wrote:
On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
The description sounds reasonable to me and maps well to the CU masking feature in our GPUs.
It would also allow us to do more coarse-grained masking for example to guarantee balanced allocation of CUs across shader engines or partitioning of memory bandwidth or CP pipes (if that is supported by the hardware/firmware).
Hm, so this sounds like the definition for how this cgroup is supposed to work is "amd CU masking" (whatever that exactly is). And the abstract description is just prettification on top, but not actually the real definition you guys want.
I think you're reading this as the opposite of what I was trying to say. Using CU masking is one possible implementation of LGPUs on AMD hardware. It's the one that Kenny implemented at the end of this patch series, and I pointed out some problems with that approach. Other ways to partition the hardware into LGPUs are conceivable. For example we're considering splitting it along the lines of shader engines, which is more coarse-grain and would also affect memory bandwidth available to each partition.
If this is supposed to be useful for admins then "other ways to partition the hw are conceivable" is the problem. This should be unique&clear for admins/end-users. Reading the implementation details and realizing that the actual meaning is "amd CU masking" isn't good enough by far, since that's meaningless on any other hw.
And if there's other ways to implement this cgroup for amd, it's also meaningless (to sysadmins/users) for amd hw.
We could also consider partitioning pipes in our command processor, although that is not supported by our current CP scheduler firmware.
The bottom line is, the LGPU model proposed by Kenny is quite abstract and allows drivers implementing it a lot of flexibility depending on the capability of their hardware and firmware. We haven't settled on a final implementation choice even for AMD.
That abstract model of essentially "anything goes" is the problem here imo. E.g. for cpu cgroups this would be similar to allowing the bitmaks to mean "cpu core" on one machine "physical die" on the next and maybe "hyperthread unit" on the 3rd. Useless for admins.
So if we have a gpu bitmaks thing that might mean a command submissio pipe on one hw (maybe matching what vk exposed, maybe not), some compute unit mask on the next and something entirely different (e.g. intel has so called GT slices with compute cores + more stuff around) on the 3rd vendor then that's not useful for admins.
The goal is to partition GPU compute resources to eliminate as much resource contention as possible between different partitions. Different hardware will have different capabilities to implement this. No implementation will be perfect. For example, even with CPU cores that are supposedly well defined, you can still have different behaviours depending on CPU cache architectures, NUMA and thermal management across CPU cores. The admin will need some knowledge of their hardware architecture to understand those effects that are not described by the abstract model of cgroups.
The LGPU model is deliberately flexible, because GPU architectures are much less standardized than CPU architectures. Expecting a common model that is both very specific and applicable to to all GPUs is unrealistic, in my opinion.
Regards, Felix
-Daniel
Regards, Felix
I think adding a cgroup which is that much depending upon the hw implementation of the first driver supporting it is not a good idea. -Daniel
I can't comment on the code as I'm unfamiliar with the details of the cgroup code.
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
Documentation/admin-guide/cgroup-v2.rst | 46 ++++++++ include/drm/drm_cgroup.h | 4 + include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 135 ++++++++++++++++++++++++ 4 files changed, 191 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 87a195133eaa..57f18469bd76 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1958,6 +1958,52 @@ DRM Interface Files Set largest allocation for /dev/dri/card1 to 4MB echo "226:1 4m" > drm.buffer.peak.max
- drm.lgpu
- A read-write nested-keyed file which exists on all cgroups.
- Each entry is keyed by the DRM device's major:minor.
- lgpu stands for logical GPU, it is an abstraction used to
- subdivide a physical DRM device for the purpose of resource
- management.
- The lgpu is a discrete quantity that is device specific (i.e.
- some DRM devices may have 64 lgpus while others may have 100
- lgpus.) The lgpu is a single quantity with two representations
- denoted by the following nested keys.
===== ========================================
count Representing lgpu as anonymous resource
list Representing lgpu as named resource
===== ========================================
- For example:
- 226:0 count=256 list=0-255
- 226:1 count=4 list=0,2,4,6
- 226:2 count=32 list=32-63
- lgpu is represented by a bitmap and uses the bitmap_parselist
- kernel function so the list key input format is a
- comma-separated list of decimal numbers and ranges.
- Consecutively set bits are shown as two hyphen-separated decimal
- numbers, the smallest and largest bit numbers set in the range.
- Optionally each range can be postfixed to denote that only parts
- of it should be set. The range will divided to groups of
- specific size.
- Syntax: range:used_size/group_size
- Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
- The count key is the hamming weight / hweight of the bitmap.
- Both count and list accept the max and default keywords.
- Some DRM devices may only support lgpu as anonymous resources.
- In such case, the significance of the position of the set bits
- in list will be ignored.
- This lgpu resource supports the 'allocation' resource
- distribution model.
- GEM Buffer Ownership
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 6d9707e1eb72..a8d6be0b075b 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -6,6 +6,7 @@
#include <linux/cgroup_drm.h> #include <linux/workqueue.h>
+#include <linux/types.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -28,6 +29,9 @@ struct drmcg_props { s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
int lgpu_capacity;
DECLARE_BITMAP(lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
};
#ifdef CONFIG_CGROUP_DRM
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index c56cfe74d1a6..7b1cfc4ce4c3 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -14,6 +14,8 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+#define MAX_DRMCG_LGPU_CAPACITY 256
- enum drmcg_mem_bw_attr { DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */ DRMCG_MEM_BW_ATTR_ACCUM_US, /* for calulating 'instantaneous' bw */
@@ -32,6 +34,7 @@ enum drmcg_res_type { DRMCG_TYPE_MEM_PEAK, DRMCG_TYPE_BANDWIDTH, DRMCG_TYPE_BANDWIDTH_PERIOD_BURST,
- DRMCG_TYPE_LGPU, __DRMCG_TYPE_LAST, };
@@ -58,6 +61,9 @@ struct drmcg_device_resource { s64 mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST]; s64 mem_bw_limits_bytes_in_period; s64 mem_bw_limits_avg_bytes_per_us;
s64 lgpu_used;
DECLARE_BITMAP(lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); };
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 0ea7f0619e25..18c4368e2c29 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -9,6 +9,7 @@ #include <linux/cgroup_drm.h> #include <linux/ktime.h> #include <linux/kernel.h> +#include <linux/bitmap.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> #include <drm/ttm/ttm_bo_api.h> @@ -52,6 +53,9 @@ static char const *mem_bw_attr_names[] = { #define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us" #define MEM_BW_LIMITS_NAME_BURST "bytes_in_period"
+#define LGPU_LIMITS_NAME_LIST "list" +#define LGPU_LIMITS_NAME_COUNT "count"
static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data)
@@ -115,6 +119,10 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i];
- bitmap_copy(ddr->lgpu_allocated, dev->drmcg_props.lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
}mutex_unlock(&dev->drmcg_mutex); return 0;
@@ -280,6 +288,14 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, MEM_BW_LIMITS_NAME_AVG, ddr->mem_bw_limits_avg_bytes_per_us); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%lld %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
ddr->lgpu_used,
LGPU_LIMITS_NAME_LIST,
dev->drmcg_props.lgpu_capacity,
ddr->lgpu_allocated);
break; default: seq_puts(sf, "\n"); break;
@@ -314,6 +330,15 @@ static void drmcg_print_default(struct drmcg_props *props, MEM_BW_LIMITS_NAME_AVG, props->mem_bw_avg_bytes_per_us_default); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%d %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
bitmap_weight(props->lgpu_slots,
props->lgpu_capacity),
LGPU_LIMITS_NAME_LIST,
props->lgpu_capacity,
props->lgpu_slots);
break; default: seq_puts(sf, "\n"); break;
@@ -407,9 +432,21 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) mutex_unlock(&dev->drmcg_mutex); }
+static void drmcg_lgpu_values_apply(struct drm_device *dev,
struct drmcg_device_resource *ddr, unsigned long *val)
+{
- mutex_lock(&dev->drmcg_mutex);
- bitmap_copy(ddr->lgpu_allocated, val, MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex);
+}
- static void drmcg_nested_limit_parse(struct kernfs_open_file *of, struct drm_device *dev, char *attrs) {
- DECLARE_BITMAP(tmp_bitmap, MAX_DRMCG_LGPU_CAPACITY);
- DECLARE_BITMAP(chk_bitmap, MAX_DRMCG_LGPU_CAPACITY); enum drmcg_res_type type = DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); struct drmcg *drmcg = css_to_drmcg(of_css(of));
@@ -501,6 +538,83 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, continue; } break; /* DRMCG_TYPE_MEM */
case DRMCG_TYPE_LGPU:
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) &&
strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) )
continue;
if (!strcmp("max", sval) ||
!strcmp("default", sval)) {
if (parent != NULL)
drmcg_lgpu_values_apply(dev, ddr,
parent->dev_resources[minor]->
lgpu_allocated);
else
drmcg_lgpu_values_apply(dev, ddr,
props->lgpu_slots);
continue;
}
if (strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) == 0) {
p_max = parent == NULL ? props->lgpu_capacity:
bitmap_weight(
parent->dev_resources[minor]->
lgpu_allocated, props->lgpu_capacity);
rc = drmcg_process_limit_s64_val(sval,
false, p_max, p_max, &val);
if (rc || val < 0) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_zero(tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
bitmap_set(tmp_bitmap, 0, val);
}
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) == 0) {
rc = bitmap_parselist(sval, tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
if (rc) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_andnot(chk_bitmap, tmp_bitmap,
props->lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
if (!bitmap_empty(chk_bitmap,
MAX_DRMCG_LGPU_CAPACITY)) {
drmcg_pr_cft_err(drmcg, 0, cft_name,
minor);
continue;
}
}
if (parent != NULL) {
bitmap_and(chk_bitmap, tmp_bitmap,
parent->dev_resources[minor]->lgpu_allocated,
props->lgpu_capacity);
if (bitmap_empty(chk_bitmap,
props->lgpu_capacity)) {
drmcg_pr_cft_err(drmcg, 0,
cft_name, minor);
continue;
}
}
drmcg_lgpu_values_apply(dev, ddr, tmp_bitmap);
break; /* DRMCG_TYPE_LGPU */ default: break; } /* switch (type) */
@@ -606,6 +720,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, break; case DRMCG_TYPE_BANDWIDTH: case DRMCG_TYPE_MEM:
case DRMCG_TYPE_LGPU: drmcg_nested_limit_parse(of, dm->dev, sattr); break; default:
@@ -731,6 +846,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, DRMCG_FTYPE_DEFAULT), },
- {
.name = "lgpu",
.seq_show = drmcg_seq_show,
.write = drmcg_limit_write,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_LIMIT),
- },
- {
.name = "lgpu.default",
.seq_show = drmcg_seq_show,
.flags = CFTYPE_ONLY_ON_ROOT,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_DEFAULT),
- }, { } /* terminate */ };
@@ -744,6 +873,10 @@ struct cgroup_subsys drm_cgrp_subsys = {
static inline void drmcg_update_cg_tree(struct drm_device *dev) {
bitmap_zero(dev->drmcg_props.lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
bitmap_fill(dev->drmcg_props.lgpu_slots,
dev->drmcg_props.lgpu_capacity);
/* init cgroups created before registration (i.e. root cgroup) */ if (root_drmcg != NULL) { struct cgroup_subsys_state *pos;
@@ -800,6 +933,8 @@ void drmcg_device_early_init(struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) dev->drmcg_props.mem_highs_default[i] = S64_MAX;
- dev->drmcg_props.lgpu_capacity = MAX_DRMCG_LGPU_CAPACITY;
} EXPORT_SYMBOL(drmcg_device_early_init);drmcg_update_cg_tree(dev);
On Wed, Oct 09, 2019 at 03:53:42PM +0000, Kuehling, Felix wrote:
On 2019-10-09 11:34, Daniel Vetter wrote:
On Wed, Oct 09, 2019 at 03:25:22PM +0000, Kuehling, Felix wrote:
On 2019-10-09 6:31, Daniel Vetter wrote:
On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
The description sounds reasonable to me and maps well to the CU masking feature in our GPUs.
It would also allow us to do more coarse-grained masking for example to guarantee balanced allocation of CUs across shader engines or partitioning of memory bandwidth or CP pipes (if that is supported by the hardware/firmware).
Hm, so this sounds like the definition for how this cgroup is supposed to work is "amd CU masking" (whatever that exactly is). And the abstract description is just prettification on top, but not actually the real definition you guys want.
I think you're reading this as the opposite of what I was trying to say. Using CU masking is one possible implementation of LGPUs on AMD hardware. It's the one that Kenny implemented at the end of this patch series, and I pointed out some problems with that approach. Other ways to partition the hardware into LGPUs are conceivable. For example we're considering splitting it along the lines of shader engines, which is more coarse-grain and would also affect memory bandwidth available to each partition.
If this is supposed to be useful for admins then "other ways to partition the hw are conceivable" is the problem. This should be unique&clear for admins/end-users. Reading the implementation details and realizing that the actual meaning is "amd CU masking" isn't good enough by far, since that's meaningless on any other hw.
And if there's other ways to implement this cgroup for amd, it's also meaningless (to sysadmins/users) for amd hw.
We could also consider partitioning pipes in our command processor, although that is not supported by our current CP scheduler firmware.
The bottom line is, the LGPU model proposed by Kenny is quite abstract and allows drivers implementing it a lot of flexibility depending on the capability of their hardware and firmware. We haven't settled on a final implementation choice even for AMD.
That abstract model of essentially "anything goes" is the problem here imo. E.g. for cpu cgroups this would be similar to allowing the bitmaks to mean "cpu core" on one machine "physical die" on the next and maybe "hyperthread unit" on the 3rd. Useless for admins.
So if we have a gpu bitmaks thing that might mean a command submissio pipe on one hw (maybe matching what vk exposed, maybe not), some compute unit mask on the next and something entirely different (e.g. intel has so called GT slices with compute cores + more stuff around) on the 3rd vendor then that's not useful for admins.
The goal is to partition GPU compute resources to eliminate as much resource contention as possible between different partitions. Different hardware will have different capabilities to implement this. No implementation will be perfect. For example, even with CPU cores that are supposedly well defined, you can still have different behaviours depending on CPU cache architectures, NUMA and thermal management across CPU cores. The admin will need some knowledge of their hardware architecture to understand those effects that are not described by the abstract model of cgroups.
That's not the point I was making. For cpu cgroups there's a very well defined connection between the cpu bitmasks/numbers in cgroups and the cpu bitmasks you use in various system calls (they match). And that stuff works across vendors.
We need the same for gpus.
The LGPU model is deliberately flexible, because GPU architectures are much less standardized than CPU architectures. Expecting a common model that is both very specific and applicable to to all GPUs is unrealistic, in my opinion.
So pure abstraction isn't useful, we need to know what these bits mean. Since if they e.g. mean vk pipes, then maybe I shouldn't be using those vk pipes in my application anymore. Or we need to define that the userspace driver needs to filter out any pipes that arent' accessible (if that's possible, no idea).
cgroups that essentially have pure hw depedent meaning aren't useful. Note: this is about the fundamental meaning, not about the more unclear isolation guarantees (which are indeed hw specific on different cpu platforms). We're not talking about "different gpus might have different amounts of shared caches bitween different bitmasks". We're talking "different gpus might assign completely differen meaning to these bitmasks". -Daniel
Regards, Felix
-Daniel
Regards, Felix
I think adding a cgroup which is that much depending upon the hw implementation of the first driver supporting it is not a good idea. -Daniel
I can't comment on the code as I'm unfamiliar with the details of the cgroup code.
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
Documentation/admin-guide/cgroup-v2.rst | 46 ++++++++ include/drm/drm_cgroup.h | 4 + include/linux/cgroup_drm.h | 6 ++ kernel/cgroup/drm.c | 135 ++++++++++++++++++++++++ 4 files changed, 191 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 87a195133eaa..57f18469bd76 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1958,6 +1958,52 @@ DRM Interface Files Set largest allocation for /dev/dri/card1 to 4MB echo "226:1 4m" > drm.buffer.peak.max
- drm.lgpu
- A read-write nested-keyed file which exists on all cgroups.
- Each entry is keyed by the DRM device's major:minor.
- lgpu stands for logical GPU, it is an abstraction used to
- subdivide a physical DRM device for the purpose of resource
- management.
- The lgpu is a discrete quantity that is device specific (i.e.
- some DRM devices may have 64 lgpus while others may have 100
- lgpus.) The lgpu is a single quantity with two representations
- denoted by the following nested keys.
===== ========================================
count Representing lgpu as anonymous resource
list Representing lgpu as named resource
===== ========================================
- For example:
- 226:0 count=256 list=0-255
- 226:1 count=4 list=0,2,4,6
- 226:2 count=32 list=32-63
- lgpu is represented by a bitmap and uses the bitmap_parselist
- kernel function so the list key input format is a
- comma-separated list of decimal numbers and ranges.
- Consecutively set bits are shown as two hyphen-separated decimal
- numbers, the smallest and largest bit numbers set in the range.
- Optionally each range can be postfixed to denote that only parts
- of it should be set. The range will divided to groups of
- specific size.
- Syntax: range:used_size/group_size
- Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
- The count key is the hamming weight / hweight of the bitmap.
- Both count and list accept the max and default keywords.
- Some DRM devices may only support lgpu as anonymous resources.
- In such case, the significance of the position of the set bits
- in list will be ignored.
- This lgpu resource supports the 'allocation' resource
- distribution model.
- GEM Buffer Ownership
diff --git a/include/drm/drm_cgroup.h b/include/drm/drm_cgroup.h index 6d9707e1eb72..a8d6be0b075b 100644 --- a/include/drm/drm_cgroup.h +++ b/include/drm/drm_cgroup.h @@ -6,6 +6,7 @@
#include <linux/cgroup_drm.h> #include <linux/workqueue.h>
+#include <linux/types.h> #include <drm/ttm/ttm_bo_api.h> #include <drm/ttm/ttm_bo_driver.h>
@@ -28,6 +29,9 @@ struct drmcg_props { s64 mem_highs_default[TTM_PL_PRIV+1];
struct work_struct *mem_reclaim_wq[TTM_PL_PRIV];
int lgpu_capacity;
DECLARE_BITMAP(lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
};
#ifdef CONFIG_CGROUP_DRM
diff --git a/include/linux/cgroup_drm.h b/include/linux/cgroup_drm.h index c56cfe74d1a6..7b1cfc4ce4c3 100644 --- a/include/linux/cgroup_drm.h +++ b/include/linux/cgroup_drm.h @@ -14,6 +14,8 @@ /* limit defined per the way drm_minor_alloc operates */ #define MAX_DRM_DEV (64 * DRM_MINOR_RENDER)
+#define MAX_DRMCG_LGPU_CAPACITY 256
- enum drmcg_mem_bw_attr { DRMCG_MEM_BW_ATTR_BYTE_MOVED, /* for calulating 'instantaneous' bw */ DRMCG_MEM_BW_ATTR_ACCUM_US, /* for calulating 'instantaneous' bw */
@@ -32,6 +34,7 @@ enum drmcg_res_type { DRMCG_TYPE_MEM_PEAK, DRMCG_TYPE_BANDWIDTH, DRMCG_TYPE_BANDWIDTH_PERIOD_BURST,
- DRMCG_TYPE_LGPU, __DRMCG_TYPE_LAST, };
@@ -58,6 +61,9 @@ struct drmcg_device_resource { s64 mem_bw_stats[__DRMCG_MEM_BW_ATTR_LAST]; s64 mem_bw_limits_bytes_in_period; s64 mem_bw_limits_avg_bytes_per_us;
s64 lgpu_used;
DECLARE_BITMAP(lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY); };
/**
diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 0ea7f0619e25..18c4368e2c29 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -9,6 +9,7 @@ #include <linux/cgroup_drm.h> #include <linux/ktime.h> #include <linux/kernel.h> +#include <linux/bitmap.h> #include <drm/drm_file.h> #include <drm/drm_drv.h> #include <drm/ttm/ttm_bo_api.h> @@ -52,6 +53,9 @@ static char const *mem_bw_attr_names[] = { #define MEM_BW_LIMITS_NAME_AVG "avg_bytes_per_us" #define MEM_BW_LIMITS_NAME_BURST "bytes_in_period"
+#define LGPU_LIMITS_NAME_LIST "list" +#define LGPU_LIMITS_NAME_COUNT "count"
static struct drmcg *root_drmcg __read_mostly;
static int drmcg_css_free_fn(int id, void *ptr, void *data)
@@ -115,6 +119,10 @@ static inline int init_drmcg_single(struct drmcg *drmcg, struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) ddr->mem_highs[i] = dev->drmcg_props.mem_highs_default[i];
- bitmap_copy(ddr->lgpu_allocated, dev->drmcg_props.lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
}mutex_unlock(&dev->drmcg_mutex); return 0;
@@ -280,6 +288,14 @@ static void drmcg_print_limits(struct drmcg_device_resource *ddr, MEM_BW_LIMITS_NAME_AVG, ddr->mem_bw_limits_avg_bytes_per_us); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%lld %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
ddr->lgpu_used,
LGPU_LIMITS_NAME_LIST,
dev->drmcg_props.lgpu_capacity,
ddr->lgpu_allocated);
break; default: seq_puts(sf, "\n"); break;
@@ -314,6 +330,15 @@ static void drmcg_print_default(struct drmcg_props *props, MEM_BW_LIMITS_NAME_AVG, props->mem_bw_avg_bytes_per_us_default); break;
- case DRMCG_TYPE_LGPU:
seq_printf(sf, "%s=%d %s=%*pbl\n",
LGPU_LIMITS_NAME_COUNT,
bitmap_weight(props->lgpu_slots,
props->lgpu_capacity),
LGPU_LIMITS_NAME_LIST,
props->lgpu_capacity,
props->lgpu_slots);
break; default: seq_puts(sf, "\n"); break;
@@ -407,9 +432,21 @@ static void drmcg_value_apply(struct drm_device *dev, s64 *dst, s64 val) mutex_unlock(&dev->drmcg_mutex); }
+static void drmcg_lgpu_values_apply(struct drm_device *dev,
struct drmcg_device_resource *ddr, unsigned long *val)
+{
- mutex_lock(&dev->drmcg_mutex);
- bitmap_copy(ddr->lgpu_allocated, val, MAX_DRMCG_LGPU_CAPACITY);
- ddr->lgpu_used = bitmap_weight(ddr->lgpu_allocated, MAX_DRMCG_LGPU_CAPACITY);
- mutex_unlock(&dev->drmcg_mutex);
+}
- static void drmcg_nested_limit_parse(struct kernfs_open_file *of, struct drm_device *dev, char *attrs) {
- DECLARE_BITMAP(tmp_bitmap, MAX_DRMCG_LGPU_CAPACITY);
- DECLARE_BITMAP(chk_bitmap, MAX_DRMCG_LGPU_CAPACITY); enum drmcg_res_type type = DRMCG_CTF_PRIV2RESTYPE(of_cft(of)->private); struct drmcg *drmcg = css_to_drmcg(of_css(of));
@@ -501,6 +538,83 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, continue; } break; /* DRMCG_TYPE_MEM */
case DRMCG_TYPE_LGPU:
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) &&
strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) )
continue;
if (!strcmp("max", sval) ||
!strcmp("default", sval)) {
if (parent != NULL)
drmcg_lgpu_values_apply(dev, ddr,
parent->dev_resources[minor]->
lgpu_allocated);
else
drmcg_lgpu_values_apply(dev, ddr,
props->lgpu_slots);
continue;
}
if (strncmp(sname, LGPU_LIMITS_NAME_COUNT, 256) == 0) {
p_max = parent == NULL ? props->lgpu_capacity:
bitmap_weight(
parent->dev_resources[minor]->
lgpu_allocated, props->lgpu_capacity);
rc = drmcg_process_limit_s64_val(sval,
false, p_max, p_max, &val);
if (rc || val < 0) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_zero(tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
bitmap_set(tmp_bitmap, 0, val);
}
if (strncmp(sname, LGPU_LIMITS_NAME_LIST, 256) == 0) {
rc = bitmap_parselist(sval, tmp_bitmap,
MAX_DRMCG_LGPU_CAPACITY);
if (rc) {
drmcg_pr_cft_err(drmcg, rc, cft_name,
minor);
continue;
}
bitmap_andnot(chk_bitmap, tmp_bitmap,
props->lgpu_slots,
MAX_DRMCG_LGPU_CAPACITY);
if (!bitmap_empty(chk_bitmap,
MAX_DRMCG_LGPU_CAPACITY)) {
drmcg_pr_cft_err(drmcg, 0, cft_name,
minor);
continue;
}
}
if (parent != NULL) {
bitmap_and(chk_bitmap, tmp_bitmap,
parent->dev_resources[minor]->lgpu_allocated,
props->lgpu_capacity);
if (bitmap_empty(chk_bitmap,
props->lgpu_capacity)) {
drmcg_pr_cft_err(drmcg, 0,
cft_name, minor);
continue;
}
}
drmcg_lgpu_values_apply(dev, ddr, tmp_bitmap);
break; /* DRMCG_TYPE_LGPU */ default: break; } /* switch (type) */
@@ -606,6 +720,7 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, break; case DRMCG_TYPE_BANDWIDTH: case DRMCG_TYPE_MEM:
case DRMCG_TYPE_LGPU: drmcg_nested_limit_parse(of, dm->dev, sattr); break; default:
@@ -731,6 +846,20 @@ struct cftype files[] = { .private = DRMCG_CTF_PRIV(DRMCG_TYPE_BANDWIDTH, DRMCG_FTYPE_DEFAULT), },
- {
.name = "lgpu",
.seq_show = drmcg_seq_show,
.write = drmcg_limit_write,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_LIMIT),
- },
- {
.name = "lgpu.default",
.seq_show = drmcg_seq_show,
.flags = CFTYPE_ONLY_ON_ROOT,
.private = DRMCG_CTF_PRIV(DRMCG_TYPE_LGPU,
DRMCG_FTYPE_DEFAULT),
- }, { } /* terminate */ };
@@ -744,6 +873,10 @@ struct cgroup_subsys drm_cgrp_subsys = {
static inline void drmcg_update_cg_tree(struct drm_device *dev) {
bitmap_zero(dev->drmcg_props.lgpu_slots, MAX_DRMCG_LGPU_CAPACITY);
bitmap_fill(dev->drmcg_props.lgpu_slots,
dev->drmcg_props.lgpu_capacity);
/* init cgroups created before registration (i.e. root cgroup) */ if (root_drmcg != NULL) { struct cgroup_subsys_state *pos;
@@ -800,6 +933,8 @@ void drmcg_device_early_init(struct drm_device *dev) for (i = 0; i <= TTM_PL_PRIV; i++) dev->drmcg_props.mem_highs_default[i] = S64_MAX;
- dev->drmcg_props.lgpu_capacity = MAX_DRMCG_LGPU_CAPACITY;
} EXPORT_SYMBOL(drmcg_device_early_init);drmcg_update_cg_tree(dev);
From: Daniel Vetter daniel.vetter@ffwll.ch On Behalf Of Daniel Vetter Sent: Wednesday, October 9, 2019 11:07 AM On Wed, Oct 09, 2019 at 03:53:42PM +0000, Kuehling, Felix wrote:
On 2019-10-09 11:34, Daniel Vetter wrote:
On Wed, Oct 09, 2019 at 03:25:22PM +0000, Kuehling, Felix wrote:
On 2019-10-09 6:31, Daniel Vetter wrote:
On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote:
The description sounds reasonable to me and maps well to the CU masking feature in our GPUs.
It would also allow us to do more coarse-grained masking for example to guarantee balanced allocation of CUs across shader engines or partitioning of memory bandwidth or CP pipes (if that is supported by the hardware/firmware).
Hm, so this sounds like the definition for how this cgroup is supposed to work is "amd CU masking" (whatever that exactly is). And the abstract description is just prettification on top, but not actually the real definition you guys want.
I think you're reading this as the opposite of what I was trying to say. Using CU masking is one possible implementation of LGPUs on AMD hardware. It's the one that Kenny implemented at the end of this patch series, and I pointed out some problems with that approach. Other ways to partition the hardware into LGPUs are conceivable. For example we're considering splitting it along the lines of shader engines, which is more coarse-grain and would also affect memory bandwidth available to each partition.
If this is supposed to be useful for admins then "other ways to partition the hw are conceivable" is the problem. This should be unique&clear for admins/end-users. Reading the implementation details and realizing that the actual meaning is "amd CU masking" isn't good enough by far, since that's meaningless on any other hw.
And if there's other ways to implement this cgroup for amd, it's also meaningless (to sysadmins/users) for amd hw.
We could also consider partitioning pipes in our command processor, although that is not supported by our current CP scheduler firmware.
The bottom line is, the LGPU model proposed by Kenny is quite abstract and allows drivers implementing it a lot of flexibility depending on the capability of their hardware and firmware. We haven't settled on a final implementation choice even for AMD.
That abstract model of essentially "anything goes" is the problem here imo. E.g. for cpu cgroups this would be similar to allowing the bitmaks to mean "cpu core" on one machine "physical die" on the next and maybe "hyperthread unit" on the 3rd. Useless for admins.
So if we have a gpu bitmaks thing that might mean a command submissio pipe on one hw (maybe matching what vk exposed, maybe not), some compute unit mask on the next and something entirely different (e.g. intel has so called GT slices with compute cores + more stuff around) on the 3rd vendor then that's not useful for admins.
The goal is to partition GPU compute resources to eliminate as much resource contention as possible between different partitions. Different hardware will have different capabilities to implement this. No implementation will be perfect. For example, even with CPU cores that are supposedly well defined, you can still have different behaviours depending on CPU cache architectures, NUMA and thermal management across CPU cores. The admin will need some knowledge of their hardware architecture to understand those effects that are not described by the abstract model of cgroups.
That's not the point I was making. For cpu cgroups there's a very well defined connection between the cpu bitmasks/numbers in cgroups and the cpu bitmasks you use in various system calls (they match). And that stuff works across vendors.
We need the same for gpus.
The LGPU model is deliberately flexible, because GPU architectures are much less standardized than CPU architectures. Expecting a common model that is both very specific and applicable to to all GPUs is unrealistic, in my opinion.
So pure abstraction isn't useful, we need to know what these bits mean. Since if they e.g. mean vk pipes, then maybe I shouldn't be using those vk pipes in my application anymore. Or we need to define that the userspace driver needs to filter out any pipes that arent' accessible (if that's possible, no idea).
cgroups that essentially have pure hw depedent meaning aren't useful. Note: this is about the fundamental meaning, not about the more unclear isolation guarantees (which are indeed hw specific on different cpu platforms). We're not talking about "different gpus might have different amounts of shared caches bitween different bitmasks". We're talking "different gpus might assign completely differen meaning to these bitmasks". -Daniel
<snip>
One thing that comes to mind is the OpenCL 1.2+ SubDevices mechanism: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateSubDe...
The concept of LGPU in cgroups seems to match up nicely with an OpenCL SubDevice, at least for compute tasks. We want to divide up the device and give some configurable subset of it to the user as a logical GPU or sub-device.
OpenCL defines Compute Units (CUs), and any GPU vendor that runs OpenCL has some mapping of their internal compute resources to this concept of CUs. Off the top of my head (I may be misremembering some of these): - AMD: Compute Units (CUs) - ARM: Shader Cores (SCs) - Intel: Execution Units (EUs) - Nvidia: Streaming Multiprocessors (SMs) - Qualcomm: Shader Processors (SPs)
The clCreateSubDevices() API has a variety of ways to slice and dice these compute resources across sub-devices. PARTITION_EQUALLY and PARTITION_BY_COUNTS could possibly be handled by a simple high-level mechanism that just allows you to request some percentage of the available GPU compute resources.
PARTITION_BY_AFFINITY_DOMAIN, however, splits up the CUs based on lower-level information such as what cache levels are shared or what NUMA domain a collection of CUs is in. I would argue that a runtime that wants to do this needs to know a bit about the mapping of CUs to underlying hardware resources.
A cgroup implementation that presented a CU bitmap could sit at the bottom of all three of these partitioning schemes, and more advanced ones if they come up. We might be getting side-tracked by the fact that AMD calls its resources CUs. The OpenCL (or Vulkan, etc.) concept of a Compute Unit is cross-vendor. The concept of targeting work to [Khronos-defined] Compute Units isn't AMD-specific. A bitmap of [Khronos-defined] CUs could map to any of these broad vendor compute resources.
There may be other parts of the GPU that we want to divide up -- command queue resources, pipes, render backends, etc. I'm not sure if any of those have been "standardized" between GPUs to such an extent that they make sense to put into cgroups yet -- I'm ignorant outside of the compute world. But at least the concept of CUs (or SMs, or EUs, etc.) seems to be standard across GPUs and (to me anyway) seems like a reasonable place to allows administrators, developers, users, etc. to divide up their GPUs.
And whatever mechanisms a GPU vendor may put in place to do clCreateSubDevices() could then be additionally used inside the kernel for their cgroups LGPU partitioning.
Thanks -Joe
On Wed, Oct 9, 2019 at 8:52 PM Greathouse, Joseph Joseph.Greathouse@amd.com wrote:
From: Daniel Vetter daniel.vetter@ffwll.ch On Behalf Of Daniel Vetter Sent: Wednesday, October 9, 2019 11:07 AM On Wed, Oct 09, 2019 at 03:53:42PM +0000, Kuehling, Felix wrote:
On 2019-10-09 11:34, Daniel Vetter wrote:
On Wed, Oct 09, 2019 at 03:25:22PM +0000, Kuehling, Felix wrote:
On 2019-10-09 6:31, Daniel Vetter wrote:
On Tue, Oct 08, 2019 at 06:53:18PM +0000, Kuehling, Felix wrote: > The description sounds reasonable to me and maps well to the CU masking > feature in our GPUs. > > It would also allow us to do more coarse-grained masking for example to > guarantee balanced allocation of CUs across shader engines or > partitioning of memory bandwidth or CP pipes (if that is supported by > the hardware/firmware). Hm, so this sounds like the definition for how this cgroup is supposed to work is "amd CU masking" (whatever that exactly is). And the abstract description is just prettification on top, but not actually the real definition you guys want.
I think you're reading this as the opposite of what I was trying to say. Using CU masking is one possible implementation of LGPUs on AMD hardware. It's the one that Kenny implemented at the end of this patch series, and I pointed out some problems with that approach. Other ways to partition the hardware into LGPUs are conceivable. For example we're considering splitting it along the lines of shader engines, which is more coarse-grain and would also affect memory bandwidth available to each partition.
If this is supposed to be useful for admins then "other ways to partition the hw are conceivable" is the problem. This should be unique&clear for admins/end-users. Reading the implementation details and realizing that the actual meaning is "amd CU masking" isn't good enough by far, since that's meaningless on any other hw.
And if there's other ways to implement this cgroup for amd, it's also meaningless (to sysadmins/users) for amd hw.
We could also consider partitioning pipes in our command processor, although that is not supported by our current CP scheduler firmware.
The bottom line is, the LGPU model proposed by Kenny is quite abstract and allows drivers implementing it a lot of flexibility depending on the capability of their hardware and firmware. We haven't settled on a final implementation choice even for AMD.
That abstract model of essentially "anything goes" is the problem here imo. E.g. for cpu cgroups this would be similar to allowing the bitmaks to mean "cpu core" on one machine "physical die" on the next and maybe "hyperthread unit" on the 3rd. Useless for admins.
So if we have a gpu bitmaks thing that might mean a command submissio pipe on one hw (maybe matching what vk exposed, maybe not), some compute unit mask on the next and something entirely different (e.g. intel has so called GT slices with compute cores + more stuff around) on the 3rd vendor then that's not useful for admins.
The goal is to partition GPU compute resources to eliminate as much resource contention as possible between different partitions. Different hardware will have different capabilities to implement this. No implementation will be perfect. For example, even with CPU cores that are supposedly well defined, you can still have different behaviours depending on CPU cache architectures, NUMA and thermal management across CPU cores. The admin will need some knowledge of their hardware architecture to understand those effects that are not described by the abstract model of cgroups.
That's not the point I was making. For cpu cgroups there's a very well defined connection between the cpu bitmasks/numbers in cgroups and the cpu bitmasks you use in various system calls (they match). And that stuff works across vendors.
We need the same for gpus.
The LGPU model is deliberately flexible, because GPU architectures are much less standardized than CPU architectures. Expecting a common model that is both very specific and applicable to to all GPUs is unrealistic, in my opinion.
So pure abstraction isn't useful, we need to know what these bits mean. Since if they e.g. mean vk pipes, then maybe I shouldn't be using those vk pipes in my application anymore. Or we need to define that the userspace driver needs to filter out any pipes that arent' accessible (if that's possible, no idea).
cgroups that essentially have pure hw depedent meaning aren't useful. Note: this is about the fundamental meaning, not about the more unclear isolation guarantees (which are indeed hw specific on different cpu platforms). We're not talking about "different gpus might have different amounts of shared caches bitween different bitmasks". We're talking "different gpus might assign completely differen meaning to these bitmasks". -Daniel
<snip>
One thing that comes to mind is the OpenCL 1.2+ SubDevices mechanism: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateSubDe...
The concept of LGPU in cgroups seems to match up nicely with an OpenCL SubDevice, at least for compute tasks. We want to divide up the device and give some configurable subset of it to the user as a logical GPU or sub-device.
OpenCL defines Compute Units (CUs), and any GPU vendor that runs OpenCL has some mapping of their internal compute resources to this concept of CUs. Off the top of my head (I may be misremembering some of these):
- AMD: Compute Units (CUs)
- ARM: Shader Cores (SCs)
- Intel: Execution Units (EUs)
- Nvidia: Streaming Multiprocessors (SMs)
- Qualcomm: Shader Processors (SPs)
The clCreateSubDevices() API has a variety of ways to slice and dice these compute resources across sub-devices. PARTITION_EQUALLY and PARTITION_BY_COUNTS could possibly be handled by a simple high-level mechanism that just allows you to request some percentage of the available GPU compute resources.
PARTITION_BY_AFFINITY_DOMAIN, however, splits up the CUs based on lower-level information such as what cache levels are shared or what NUMA domain a collection of CUs is in. I would argue that a runtime that wants to do this needs to know a bit about the mapping of CUs to underlying hardware resources.
A cgroup implementation that presented a CU bitmap could sit at the bottom of all three of these partitioning schemes, and more advanced ones if they come up. We might be getting side-tracked by the fact that AMD calls its resources CUs. The OpenCL (or Vulkan, etc.) concept of a Compute Unit is cross-vendor. The concept of targeting work to [Khronos-defined] Compute Units isn't AMD-specific. A bitmap of [Khronos-defined] CUs could map to any of these broad vendor compute resources.
There may be other parts of the GPU that we want to divide up -- command queue resources, pipes, render backends, etc. I'm not sure if any of those have been "standardized" between GPUs to such an extent that they make sense to put into cgroups yet -- I'm ignorant outside of the compute world. But at least the concept of CUs (or SMs, or EUs, etc.) seems to be standard across GPUs and (to me anyway) seems like a reasonable place to allows administrators, developers, users, etc. to divide up their GPUs.
And whatever mechanisms a GPU vendor may put in place to do clCreateSubDevices() could then be additionally used inside the kernel for their cgroups LGPU partitioning.
Yeah this is the stuff I meant. I quickly checked intel's CL driver, and from a quick look we don't support that. Adding Karol, who might know whether this works on nvidia hw and how. If opencl CU don't really apply to more than amdgpu, then that's not really helping much with making this stuff more broadly useful. -Daniel
Hello, Daniel.
On Wed, Oct 09, 2019 at 06:06:52PM +0200, Daniel Vetter wrote:
That's not the point I was making. For cpu cgroups there's a very well defined connection between the cpu bitmasks/numbers in cgroups and the cpu bitmasks you use in various system calls (they match). And that stuff works across vendors.
Please note that there are a lot of limitations even to cpuset. Affinity is easy to implement and seems attractive in terms of absolute isolation but it's inherently cumbersome and limited in granularity and can lead to surprising failure modes where contention on one cpu can't be resolved by the load balancer and leads to system wide slowdowns / stalls caused by the dependency chain anchored at the affinity limited tasks.
Maybe this is a less of a problem for gpu workloads but in general the more constraints are put on scheduling, the more likely is the system to develop twisted dependency chains while other parts of the system are sitting idle.
How does scheduling currently work when there are competing gpu workloads? There gotta be some fairness provision whether that's unit allocation based or time slicing, right? If that's the case, it might be best to implement proportional control on top of that. Work-conserving mechanisms are the most versatile, easiest to use and least likely to cause regressions.
Thanks.
On 2019-10-11 1:12 p.m., tj@kernel.org wrote:
Hello, Daniel.
On Wed, Oct 09, 2019 at 06:06:52PM +0200, Daniel Vetter wrote:
That's not the point I was making. For cpu cgroups there's a very well defined connection between the cpu bitmasks/numbers in cgroups and the cpu bitmasks you use in various system calls (they match). And that stuff works across vendors.
Please note that there are a lot of limitations even to cpuset. Affinity is easy to implement and seems attractive in terms of absolute isolation but it's inherently cumbersome and limited in granularity and can lead to surprising failure modes where contention on one cpu can't be resolved by the load balancer and leads to system wide slowdowns / stalls caused by the dependency chain anchored at the affinity limited tasks.
Maybe this is a less of a problem for gpu workloads but in general the more constraints are put on scheduling, the more likely is the system to develop twisted dependency chains while other parts of the system are sitting idle.
How does scheduling currently work when there are competing gpu workloads? There gotta be some fairness provision whether that's unit allocation based or time slicing, right?
The scheduling of competing workloads on GPUs is handled in hardware and firmware. The Linux kernel and driver are not really involved. We have some knobs we can tweak in the driver (queue and pipe priorities, resource reservations for certain types of workloads), but they are pretty HW-specific and I wouldn't make any claims about fairness.
Regards, Felix
If that's the case, it might be best to implement proportional control on top of that. Work-conserving mechanisms are the most versatile, easiest to use and least likely to cause regressions.
Thanks.
Before this commit, drmcg limits are updated but enforcement is delayed until the next time the driver check against the new limit. While this is sufficient for certain resources, a more proactive enforcement may be needed for other resources.
Introducing an optional drmcg_limit_updated callback for the DRM drivers. When defined, it will be called in two scenarios: 1) When limits are updated for a particular cgroup, the callback will be triggered for each task in the updated cgroup. 2) When a task is migrated from one cgroup to another, the callback will be triggered for each resource type for the migrated task.
Change-Id: I68187a72818b855b5f295aefcb241cda8ab63b00 Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- include/drm/drm_drv.h | 10 ++++++++ kernel/cgroup/drm.c | 57 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 67 insertions(+)
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index c8a37a08d98d..7e588b874a27 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -669,6 +669,16 @@ struct drm_driver { void (*drmcg_custom_init)(struct drm_device *dev, struct drmcg_props *props);
+ /** + * @drmcg_limit_updated + * + * Optional callback + */ + void (*drmcg_limit_updated)(struct drm_device *dev, + struct task_struct *task,\ + struct drmcg_device_resource *ddr, + enum drmcg_res_type res_type); + /** * @gem_vm_ops: Driver private ops for this object */ diff --git a/kernel/cgroup/drm.c b/kernel/cgroup/drm.c index 18c4368e2c29..99772e5d9ccc 100644 --- a/kernel/cgroup/drm.c +++ b/kernel/cgroup/drm.c @@ -621,6 +621,23 @@ static void drmcg_nested_limit_parse(struct kernfs_open_file *of, } }
+static void drmcg_limit_updated(struct drm_device *dev, struct drmcg *drmcg, + enum drmcg_res_type res_type) +{ + struct drmcg_device_resource *ddr = + drmcg->dev_resources[dev->primary->index]; + struct css_task_iter it; + struct task_struct *task; + + css_task_iter_start(&drmcg->css.cgroup->self, + CSS_TASK_ITER_PROCS, &it); + while ((task = css_task_iter_next(&it))) { + dev->driver->drmcg_limit_updated(dev, task, + ddr, res_type); + } + css_task_iter_end(&it); +} + static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -726,6 +743,10 @@ static ssize_t drmcg_limit_write(struct kernfs_open_file *of, char *buf, default: break; } + + if (dm->dev->driver->drmcg_limit_updated) + drmcg_limit_updated(dm->dev, drmcg, type); + drm_dev_put(dm->dev); /* release from drm_minor_acquire */ }
@@ -863,9 +884,45 @@ struct cftype files[] = { { } /* terminate */ };
+static int drmcg_attach_fn(int id, void *ptr, void *data) +{ + struct drm_minor *minor = ptr; + struct task_struct *task = data; + struct drm_device *dev; + + if (minor->type != DRM_MINOR_PRIMARY) + return 0; + + dev = minor->dev; + + if (dev->driver->drmcg_limit_updated) { + struct drmcg *drmcg = drmcg_get(task); + struct drmcg_device_resource *ddr = + drmcg->dev_resources[minor->index]; + enum drmcg_res_type type; + + for (type = 0; type < __DRMCG_TYPE_LAST; type++) + dev->driver->drmcg_limit_updated(dev, task, ddr, type); + + drmcg_put(drmcg); + } + + return 0; +} + +static void drmcg_attach(struct cgroup_taskset *tset) +{ + struct task_struct *task; + struct cgroup_subsys_state *css; + + cgroup_taskset_for_each(task, css, tset) + drm_minor_for_each(&drmcg_attach_fn, task); +} + struct cgroup_subsys drm_cgrp_subsys = { .css_alloc = drmcg_css_alloc, .css_free = drmcg_css_free, + .attach = drmcg_attach, .early_init = false, .legacy_cftypes = files, .dfl_cftypes = files,
The number of logical gpu (lgpu) is defined to be the number of compute unit (CU) for a device. The lgpu allocation limit only applies to compute workload for the moment (enforced via kfd queue creation.) Any cu_mask update is validated against the availability of the compute unit as defined by the drmcg the kfd process belongs to.
Change-Id: I69a57452c549173a1cd623c30dc57195b3b6563e Signed-off-by: Kenny Ho Kenny.Ho@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 21 +++ drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + .../amd/amdkfd/kfd_process_queue_manager.c | 140 ++++++++++++++++++ 5 files changed, 174 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index 55cb1b2094fd..369915337213 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -198,6 +198,10 @@ uint8_t amdgpu_amdkfd_get_xgmi_hops_count(struct kgd_dev *dst, struct kgd_dev *s valid; \ })
+int amdgpu_amdkfd_update_cu_mask_for_process(struct task_struct *task, + struct amdgpu_device *adev, unsigned long *lgpu_bitmap, + unsigned int nbits); + /* GPUVM API */ int amdgpu_amdkfd_gpuvm_create_process_vm(struct kgd_dev *kgd, unsigned int pasid, void **vm, void **process_info, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 163a4fbf0611..8abeffdd2e5b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -1398,9 +1398,29 @@ amdgpu_get_crtc_scanout_position(struct drm_device *dev, unsigned int pipe, static void amdgpu_drmcg_custom_init(struct drm_device *dev, struct drmcg_props *props) { + struct amdgpu_device *adev = dev->dev_private; + + props->lgpu_capacity = adev->gfx.cu_info.number; + props->limit_enforced = true; }
+static void amdgpu_drmcg_limit_updated(struct drm_device *dev, + struct task_struct *task, struct drmcg_device_resource *ddr, + enum drmcg_res_type res_type) +{ + struct amdgpu_device *adev = dev->dev_private; + + switch (res_type) { + case DRMCG_TYPE_LGPU: + amdgpu_amdkfd_update_cu_mask_for_process(task, adev, + ddr->lgpu_allocated, dev->drmcg_props.lgpu_capacity); + break; + default: + break; + } +} + static struct drm_driver kms_driver = { .driver_features = DRIVER_USE_AGP | DRIVER_ATOMIC | @@ -1438,6 +1458,7 @@ static struct drm_driver kms_driver = { .gem_prime_mmap = amdgpu_gem_prime_mmap,
.drmcg_custom_init = amdgpu_drmcg_custom_init, + .drmcg_limit_updated = amdgpu_drmcg_limit_updated,
.name = DRIVER_NAME, .desc = DRIVER_DESC, diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 138c70454e2b..fa765b803f97 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -450,6 +450,12 @@ static int kfd_ioctl_set_cu_mask(struct file *filp, struct kfd_process *p, return -EFAULT; }
+ if (!pqm_drmcg_lgpu_validate(p, args->queue_id, properties.cu_mask, cu_mask_size)) { + pr_debug("CU mask not permitted by DRM Cgroup"); + kfree(properties.cu_mask); + return -EACCES; + } + mutex_lock(&p->mutex);
retval = pqm_set_cu_mask(&p->pqm, args->queue_id, &properties); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index 8b0eee5b3521..88881bec7550 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -1038,6 +1038,9 @@ int pqm_get_wave_state(struct process_queue_manager *pqm, u32 *ctl_stack_used_size, u32 *save_area_used_size);
+bool pqm_drmcg_lgpu_validate(struct kfd_process *p, int qid, u32 *cu_mask, + unsigned int cu_mask_size); + int amdkfd_fence_wait_timeout(unsigned int *fence_addr, unsigned int fence_value, unsigned int timeout_ms); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c index 7e6c3ee82f5b..a896de290307 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c @@ -23,9 +23,11 @@
#include <linux/slab.h> #include <linux/list.h> +#include <linux/cgroup_drm.h> #include "kfd_device_queue_manager.h" #include "kfd_priv.h" #include "kfd_kernel_queue.h" +#include "amdgpu.h" #include "amdgpu_amdkfd.h"
static inline struct process_queue_node *get_queue_by_qid( @@ -167,6 +169,7 @@ static int create_cp_queue(struct process_queue_manager *pqm, struct queue_properties *q_properties, struct file *f, unsigned int qid) { + struct drmcg *drmcg; int retval;
/* Doorbell initialized in user space*/ @@ -180,6 +183,36 @@ static int create_cp_queue(struct process_queue_manager *pqm, if (retval != 0) return retval;
+ + drmcg = drmcg_get(pqm->process->lead_thread); + if (drmcg) { + struct amdgpu_device *adev; + struct drmcg_device_resource *ddr; + int mask_size; + u32 *mask; + + adev = (struct amdgpu_device *) dev->kgd; + + mask_size = adev->ddev->drmcg_props.lgpu_capacity; + mask = kzalloc(sizeof(u32) * round_up(mask_size, 32), + GFP_KERNEL); + + if (!mask) { + drmcg_put(drmcg); + uninit_queue(*q); + return -ENOMEM; + } + + ddr = drmcg->dev_resources[adev->ddev->primary->index]; + + bitmap_to_arr32(mask, ddr->lgpu_allocated, mask_size); + + (*q)->properties.cu_mask_count = mask_size; + (*q)->properties.cu_mask = mask; + + drmcg_put(drmcg); + } + (*q)->device = dev; (*q)->process = pqm->process;
@@ -495,6 +528,113 @@ int pqm_get_wave_state(struct process_queue_manager *pqm, save_area_used_size); }
+bool pqm_drmcg_lgpu_validate(struct kfd_process *p, int qid, u32 *cu_mask, + unsigned int cu_mask_size) +{ + DECLARE_BITMAP(curr_mask, MAX_DRMCG_LGPU_CAPACITY); + struct drmcg_device_resource *ddr; + struct process_queue_node *pqn; + struct amdgpu_device *adev; + struct drmcg *drmcg; + bool result; + + if (cu_mask_size > MAX_DRMCG_LGPU_CAPACITY) + return false; + + bitmap_from_arr32(curr_mask, cu_mask, cu_mask_size); + + pqn = get_queue_by_qid(&p->pqm, qid); + if (!pqn) + return false; + + adev = (struct amdgpu_device *)pqn->q->device->kgd; + + drmcg = drmcg_get(p->lead_thread); + ddr = drmcg->dev_resources[adev->ddev->primary->index]; + + if (bitmap_subset(curr_mask, ddr->lgpu_allocated, + MAX_DRMCG_LGPU_CAPACITY)) + result = true; + else + result = false; + + drmcg_put(drmcg); + + return result; +} + +int amdgpu_amdkfd_update_cu_mask_for_process(struct task_struct *task, + struct amdgpu_device *adev, unsigned long *lgpu_bm, + unsigned int lgpu_bm_size) +{ + struct kfd_dev *kdev = adev->kfd.dev; + struct process_queue_node *pqn; + struct kfd_process *kfdproc; + size_t size_in_bytes; + u32 *cu_mask; + int rc = 0; + + if ((lgpu_bm_size % 32) != 0) { + pr_warn("lgpu_bm_size %d must be a multiple of 32", + lgpu_bm_size); + return -EINVAL; + } + + kfdproc = kfd_get_process(task); + + if (IS_ERR(kfdproc)) + return -ESRCH; + + size_in_bytes = sizeof(u32) * round_up(lgpu_bm_size, 32); + + mutex_lock(&kfdproc->mutex); + list_for_each_entry(pqn, &kfdproc->pqm.queues, process_queue_list) { + if (pqn->q && pqn->q->device == kdev) { + /* update cu_mask accordingly */ + cu_mask = kzalloc(size_in_bytes, GFP_KERNEL); + if (!cu_mask) { + rc = -ENOMEM; + break; + } + + if (pqn->q->properties.cu_mask) { + DECLARE_BITMAP(curr_mask, + MAX_DRMCG_LGPU_CAPACITY); + + if (pqn->q->properties.cu_mask_count > + lgpu_bm_size) { + rc = -EINVAL; + kfree(cu_mask); + break; + } + + bitmap_from_arr32(curr_mask, + pqn->q->properties.cu_mask, + pqn->q->properties.cu_mask_count); + + bitmap_and(curr_mask, curr_mask, lgpu_bm, + lgpu_bm_size); + + bitmap_to_arr32(cu_mask, curr_mask, + lgpu_bm_size); + + kfree(curr_mask); + } else + bitmap_to_arr32(cu_mask, lgpu_bm, + lgpu_bm_size); + + pqn->q->properties.cu_mask = cu_mask; + pqn->q->properties.cu_mask_count = lgpu_bm_size; + + rc = pqn->q->device->dqm->ops.update_queue( + pqn->q->device->dqm, pqn->q); + } + } + mutex_unlock(&kfdproc->mutex); + + return rc; +} + #if defined(CONFIG_DEBUG_FS)
int pqm_debugfs_mqds(struct seq_file *m, void *data)
On 2019-08-29 2:05 a.m., Kenny Ho wrote:
The number of logical gpu (lgpu) is defined to be the number of compute unit (CU) for a device. The lgpu allocation limit only applies to compute workload for the moment (enforced via kfd queue creation.) Any cu_mask update is validated against the availability of the compute unit as defined by the drmcg the kfd process belongs to.
There is something missing here. There is an API for the application to specify a CU mask. Right now it looks like the application-specified and CGroup-specified CU masks would clobber each other. Instead the two should be merged.
The CGroup-specified mask should specify a subset of CUs available for application-specified CU masks. When the cgroup CU mask changes, you'd need to take any application-specified CU masks into account before updating the hardware.
The KFD topology APIs report the number of available CUs to the application. CGroups would change that number at runtime and applications would not expect that. I think the best way to deal with that would be to have multiple bits in the application-specified CU mask map to the same CU. How to do that in a fair way is not obvious. I guess a more coarse-grain division of the GPU into LGPUs would make this somewhat easier.
How is this problem handled for CPU cores and the interaction with CPU pthread_setaffinity_np?
Regards, Felix
Change-Id: I69a57452c549173a1cd623c30dc57195b3b6563e Signed-off-by: Kenny Ho Kenny.Ho@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 21 +++ drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + .../amd/amdkfd/kfd_process_queue_manager.c | 140 ++++++++++++++++++ 5 files changed, 174 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index 55cb1b2094fd..369915337213 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -198,6 +198,10 @@ uint8_t amdgpu_amdkfd_get_xgmi_hops_count(struct kgd_dev *dst, struct kgd_dev *s valid; \ })
+int amdgpu_amdkfd_update_cu_mask_for_process(struct task_struct *task,
struct amdgpu_device *adev, unsigned long *lgpu_bitmap,
unsigned int nbits);
- /* GPUVM API */ int amdgpu_amdkfd_gpuvm_create_process_vm(struct kgd_dev *kgd, unsigned int pasid, void **vm, void **process_info,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 163a4fbf0611..8abeffdd2e5b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -1398,9 +1398,29 @@ amdgpu_get_crtc_scanout_position(struct drm_device *dev, unsigned int pipe, static void amdgpu_drmcg_custom_init(struct drm_device *dev, struct drmcg_props *props) {
- struct amdgpu_device *adev = dev->dev_private;
- props->lgpu_capacity = adev->gfx.cu_info.number;
- props->limit_enforced = true; }
+static void amdgpu_drmcg_limit_updated(struct drm_device *dev,
struct task_struct *task, struct drmcg_device_resource *ddr,
enum drmcg_res_type res_type)
+{
- struct amdgpu_device *adev = dev->dev_private;
- switch (res_type) {
- case DRMCG_TYPE_LGPU:
amdgpu_amdkfd_update_cu_mask_for_process(task, adev,
ddr->lgpu_allocated, dev->drmcg_props.lgpu_capacity);
break;
- default:
break;
- }
+}
- static struct drm_driver kms_driver = { .driver_features = DRIVER_USE_AGP | DRIVER_ATOMIC |
@@ -1438,6 +1458,7 @@ static struct drm_driver kms_driver = { .gem_prime_mmap = amdgpu_gem_prime_mmap,
.drmcg_custom_init = amdgpu_drmcg_custom_init,
.drmcg_limit_updated = amdgpu_drmcg_limit_updated,
.name = DRIVER_NAME, .desc = DRIVER_DESC,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 138c70454e2b..fa765b803f97 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -450,6 +450,12 @@ static int kfd_ioctl_set_cu_mask(struct file *filp, struct kfd_process *p, return -EFAULT; }
if (!pqm_drmcg_lgpu_validate(p, args->queue_id, properties.cu_mask, cu_mask_size)) {
pr_debug("CU mask not permitted by DRM Cgroup");
kfree(properties.cu_mask);
return -EACCES;
}
mutex_lock(&p->mutex);
retval = pqm_set_cu_mask(&p->pqm, args->queue_id, &properties);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index 8b0eee5b3521..88881bec7550 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -1038,6 +1038,9 @@ int pqm_get_wave_state(struct process_queue_manager *pqm, u32 *ctl_stack_used_size, u32 *save_area_used_size);
+bool pqm_drmcg_lgpu_validate(struct kfd_process *p, int qid, u32 *cu_mask,
unsigned int cu_mask_size);
- int amdkfd_fence_wait_timeout(unsigned int *fence_addr, unsigned int fence_value, unsigned int timeout_ms);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c index 7e6c3ee82f5b..a896de290307 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c @@ -23,9 +23,11 @@
#include <linux/slab.h> #include <linux/list.h> +#include <linux/cgroup_drm.h> #include "kfd_device_queue_manager.h" #include "kfd_priv.h" #include "kfd_kernel_queue.h" +#include "amdgpu.h" #include "amdgpu_amdkfd.h"
static inline struct process_queue_node *get_queue_by_qid( @@ -167,6 +169,7 @@ static int create_cp_queue(struct process_queue_manager *pqm, struct queue_properties *q_properties, struct file *f, unsigned int qid) {
struct drmcg *drmcg; int retval;
/* Doorbell initialized in user space*/
@@ -180,6 +183,36 @@ static int create_cp_queue(struct process_queue_manager *pqm, if (retval != 0) return retval;
- drmcg = drmcg_get(pqm->process->lead_thread);
- if (drmcg) {
struct amdgpu_device *adev;
struct drmcg_device_resource *ddr;
int mask_size;
u32 *mask;
adev = (struct amdgpu_device *) dev->kgd;
mask_size = adev->ddev->drmcg_props.lgpu_capacity;
mask = kzalloc(sizeof(u32) * round_up(mask_size, 32),
GFP_KERNEL);
if (!mask) {
drmcg_put(drmcg);
uninit_queue(*q);
return -ENOMEM;
}
ddr = drmcg->dev_resources[adev->ddev->primary->index];
bitmap_to_arr32(mask, ddr->lgpu_allocated, mask_size);
(*q)->properties.cu_mask_count = mask_size;
(*q)->properties.cu_mask = mask;
drmcg_put(drmcg);
- }
- (*q)->device = dev; (*q)->process = pqm->process;
@@ -495,6 +528,113 @@ int pqm_get_wave_state(struct process_queue_manager *pqm, save_area_used_size); }
+bool pqm_drmcg_lgpu_validate(struct kfd_process *p, int qid, u32 *cu_mask,
unsigned int cu_mask_size)
+{
- DECLARE_BITMAP(curr_mask, MAX_DRMCG_LGPU_CAPACITY);
- struct drmcg_device_resource *ddr;
- struct process_queue_node *pqn;
- struct amdgpu_device *adev;
- struct drmcg *drmcg;
- bool result;
- if (cu_mask_size > MAX_DRMCG_LGPU_CAPACITY)
return false;
- bitmap_from_arr32(curr_mask, cu_mask, cu_mask_size);
- pqn = get_queue_by_qid(&p->pqm, qid);
- if (!pqn)
return false;
- adev = (struct amdgpu_device *)pqn->q->device->kgd;
- drmcg = drmcg_get(p->lead_thread);
- ddr = drmcg->dev_resources[adev->ddev->primary->index];
- if (bitmap_subset(curr_mask, ddr->lgpu_allocated,
MAX_DRMCG_LGPU_CAPACITY))
result = true;
- else
result = false;
- drmcg_put(drmcg);
- return result;
+}
+int amdgpu_amdkfd_update_cu_mask_for_process(struct task_struct *task,
struct amdgpu_device *adev, unsigned long *lgpu_bm,
unsigned int lgpu_bm_size)
+{
- struct kfd_dev *kdev = adev->kfd.dev;
- struct process_queue_node *pqn;
- struct kfd_process *kfdproc;
- size_t size_in_bytes;
- u32 *cu_mask;
- int rc = 0;
- if ((lgpu_bm_size % 32) != 0) {
pr_warn("lgpu_bm_size %d must be a multiple of 32",
lgpu_bm_size);
return -EINVAL;
- }
- kfdproc = kfd_get_process(task);
- if (IS_ERR(kfdproc))
return -ESRCH;
- size_in_bytes = sizeof(u32) * round_up(lgpu_bm_size, 32);
- mutex_lock(&kfdproc->mutex);
- list_for_each_entry(pqn, &kfdproc->pqm.queues, process_queue_list) {
if (pqn->q && pqn->q->device == kdev) {
/* update cu_mask accordingly */
cu_mask = kzalloc(size_in_bytes, GFP_KERNEL);
if (!cu_mask) {
rc = -ENOMEM;
break;
}
if (pqn->q->properties.cu_mask) {
DECLARE_BITMAP(curr_mask,
MAX_DRMCG_LGPU_CAPACITY);
if (pqn->q->properties.cu_mask_count >
lgpu_bm_size) {
rc = -EINVAL;
kfree(cu_mask);
break;
}
bitmap_from_arr32(curr_mask,
pqn->q->properties.cu_mask,
pqn->q->properties.cu_mask_count);
bitmap_and(curr_mask, curr_mask, lgpu_bm,
lgpu_bm_size);
bitmap_to_arr32(cu_mask, curr_mask,
lgpu_bm_size);
kfree(curr_mask);
} else
bitmap_to_arr32(cu_mask, lgpu_bm,
lgpu_bm_size);
pqn->q->properties.cu_mask = cu_mask;
pqn->q->properties.cu_mask_count = lgpu_bm_size;
rc = pqn->q->device->dqm->ops.update_queue(
pqn->q->device->dqm, pqn->q);
}
- }
- mutex_unlock(&kfdproc->mutex);
- return rc;
+}
#if defined(CONFIG_DEBUG_FS)
int pqm_debugfs_mqds(struct seq_file *m, void *data)
Hello,
I just glanced through the interface and don't have enough context to give any kind of detailed review yet. I'll try to read up and understand more and would greatly appreciate if you can give me some pointers to read up on the resources being controlled and how the actual use cases would look like. That said, I have some basic concerns.
* TTM vs. GEM distinction seems to be internal implementation detail rather than anything relating to underlying physical resources. Provided that's the case, I'm afraid these internal constructs being used as primary resource control objects likely isn't the right approach. Whether a given driver uses one or the other internal abstraction layer shouldn't determine how resources are represented at the userland interface layer.
* While breaking up and applying control to different types of internal objects may seem attractive to folks who work day in and day out with the subsystem, they aren't all that useful to users and the siloed controls are likely to make the whole mechanism a lot less useful. We had the same problem with cgroup1 memcg - putting control of different uses of memory under separate knobs. It made the whole thing pretty useless. e.g. if you constrain all knobs tight enough to control the overall usage, overall utilization suffers, but if you don't, you really don't have control over actual usage. For memcg, what has to be allocated and controlled is physical memory, no matter how they're used. It's not like you can go buy more "socket" memory. At least from the looks of it, I'm afraid gpu controller is repeating the same mistakes.
Thanks.
On Fri, Aug 30, 2019 at 09:28:57PM -0700, Tejun Heo wrote:
Hello,
I just glanced through the interface and don't have enough context to give any kind of detailed review yet. I'll try to read up and understand more and would greatly appreciate if you can give me some pointers to read up on the resources being controlled and how the actual use cases would look like. That said, I have some basic concerns.
- TTM vs. GEM distinction seems to be internal implementation detail rather than anything relating to underlying physical resources. Provided that's the case, I'm afraid these internal constructs being used as primary resource control objects likely isn't the right approach. Whether a given driver uses one or the other internal abstraction layer shouldn't determine how resources are represented at the userland interface layer.
Yeah there's another RFC series from Brian Welty to abstract this away as a memory region concept for gpus.
- While breaking up and applying control to different types of internal objects may seem attractive to folks who work day in and day out with the subsystem, they aren't all that useful to users and the siloed controls are likely to make the whole mechanism a lot less useful. We had the same problem with cgroup1 memcg - putting control of different uses of memory under separate knobs. It made the whole thing pretty useless. e.g. if you constrain all knobs tight enough to control the overall usage, overall utilization suffers, but if you don't, you really don't have control over actual usage. For memcg, what has to be allocated and controlled is physical memory, no matter how they're used. It's not like you can go buy more "socket" memory. At least from the looks of it, I'm afraid gpu controller is repeating the same mistakes.
We do have quite a pile of different memories and ranges, so I don't thinkt we're doing the same mistake here. But it is maybe a bit too complicated, and exposes stuff that most users really don't care about. -Daniel
Hello, Daniel.
On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
- While breaking up and applying control to different types of internal objects may seem attractive to folks who work day in and day out with the subsystem, they aren't all that useful to users and the siloed controls are likely to make the whole mechanism a lot less useful. We had the same problem with cgroup1 memcg - putting control of different uses of memory under separate knobs. It made the whole thing pretty useless. e.g. if you constrain all knobs tight enough to control the overall usage, overall utilization suffers, but if you don't, you really don't have control over actual usage. For memcg, what has to be allocated and controlled is physical memory, no matter how they're used. It's not like you can go buy more "socket" memory. At least from the looks of it, I'm afraid gpu controller is repeating the same mistakes.
We do have quite a pile of different memories and ranges, so I don't thinkt we're doing the same mistake here. But it is maybe a bit too
I see. One thing which caught my eyes was the system memory control. Shouldn't that be controlled by memcg? Is there something special about system memory used by gpus?
complicated, and exposes stuff that most users really don't care about.
Could be from me not knowing much about gpus but definitely looks too complex to me. I don't see how users would be able to alloate, vram, system memory and GART with reasonable accuracy. memcg on cgroup2 deals with just single number and that's already plenty challenging.
Thanks.
Hi Tejun,
Thanks for looking into this. I can definitely help where I can and I am sure other experts will jump in if I start misrepresenting the reality :) (as Daniel already have done.)
Regarding your points, my understanding is that there isn't really a TTM vs GEM situation anymore (there is an lwn.net article about that, but it is more than a decade old.) I believe GEM is the common interface at this point and more and more features are being refactored into it. For example, AMD's driver uses TTM internally but things are exposed via the GEM interface.
This GEM resource is actually the single number resource you just referred to. A GEM buffer (the drm.buffer.* resources) can be backed by VRAM, or system memory or other type of memory. The more fine grain control is the drm.memory.* resources which still need more discussion. (As some of the functionalities in TTM are being refactored into the GEM level. I have seen some patches that make TTM a subclass of GEM.)
This RFC can be grouped into 3 areas and they are fairly independent so they can be reviewed separately: high level device memory control (buffer.*), fine grain memory control and bandwidth (memory.*) and compute resources (lgpu.*) I think the memory.* resources are the most controversial part but I think it's still needed.
Perhaps an analogy may help. For a system, we have CPUs and memory. And within memory, it can be backed by RAM or swap. For GPU, each device can have LGPUs and buffers. And within the buffers, it can be backed by VRAM, or system RAM or even swap.
As for setting the right amount, I think that's where the profiling aspect of the *.stats comes in. And while one can't necessary buy more VRAM, it is still a useful knob to adjust if the intention is to pack more work into a GPU device with predictable performance. This research on various GPU workload may be of interest:
A Taxonomy of GPGPU Performance Scaling http://www.computermachines.org/joe/posters/iiswc2015_taxonomy.pdf http://www.computermachines.org/joe/publications/pdfs/iiswc2015_taxonomy.pdf
(summary: GPU workload can be memory bound or compute bound. So it's possible to pack different workload together to improve utilization.)
Regards, Kenny
On Tue, Sep 3, 2019 at 2:50 PM Tejun Heo tj@kernel.org wrote:
Hello, Daniel.
On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
- While breaking up and applying control to different types of internal objects may seem attractive to folks who work day in and day out with the subsystem, they aren't all that useful to users and the siloed controls are likely to make the whole mechanism a lot less useful. We had the same problem with cgroup1 memcg - putting control of different uses of memory under separate knobs. It made the whole thing pretty useless. e.g. if you constrain all knobs tight enough to control the overall usage, overall utilization suffers, but if you don't, you really don't have control over actual usage. For memcg, what has to be allocated and controlled is physical memory, no matter how they're used. It's not like you can go buy more "socket" memory. At least from the looks of it, I'm afraid gpu controller is repeating the same mistakes.
We do have quite a pile of different memories and ranges, so I don't thinkt we're doing the same mistake here. But it is maybe a bit too
I see. One thing which caught my eyes was the system memory control. Shouldn't that be controlled by memcg? Is there something special about system memory used by gpus?
complicated, and exposes stuff that most users really don't care about.
Could be from me not knowing much about gpus but definitely looks too complex to me. I don't see how users would be able to alloate, vram, system memory and GART with reasonable accuracy. memcg on cgroup2 deals with just single number and that's already plenty challenging.
Thanks.
-- tejun
On Tue, Sep 3, 2019 at 8:50 PM Tejun Heo tj@kernel.org wrote:
Hello, Daniel.
On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
- While breaking up and applying control to different types of internal objects may seem attractive to folks who work day in and day out with the subsystem, they aren't all that useful to users and the siloed controls are likely to make the whole mechanism a lot less useful. We had the same problem with cgroup1 memcg - putting control of different uses of memory under separate knobs. It made the whole thing pretty useless. e.g. if you constrain all knobs tight enough to control the overall usage, overall utilization suffers, but if you don't, you really don't have control over actual usage. For memcg, what has to be allocated and controlled is physical memory, no matter how they're used. It's not like you can go buy more "socket" memory. At least from the looks of it, I'm afraid gpu controller is repeating the same mistakes.
We do have quite a pile of different memories and ranges, so I don't thinkt we're doing the same mistake here. But it is maybe a bit too
I see. One thing which caught my eyes was the system memory control. Shouldn't that be controlled by memcg? Is there something special about system memory used by gpus?
I think system memory separate from vram makes sense. For one, vram is like 10x+ faster than system memory, so we definitely want to have good control on that. But maybe we only want one vram bucket overall for the entire system?
The trouble with system memory is that gpu tasks pin that memory to prep execution. There's two solutions: - i915 has a shrinker. Lots (and I really mean lots) of pain with direct reclaim recursion, which often means we can't free memory, and we're angering the oom killer a lot. Plus it introduces real bad latency spikes everywhere (gpu workloads are occasionally really slow, think "worse than pageout to spinning rust" to get memory freed). - ttm just has a global limit, set to 50% of system memory.
I do think a global system memory limit to tame the shrinker, without the ttm approach of possible just wasting half your memory, could be useful.
complicated, and exposes stuff that most users really don't care about.
Could be from me not knowing much about gpus but definitely looks too complex to me. I don't see how users would be able to alloate, vram, system memory and GART with reasonable accuracy. memcg on cgroup2 deals with just single number and that's already plenty challenging.
Yeah, especially wrt GART and some of the other more specialized things I don't think there's any modern gpu were you can actually run out of that stuff. At least not before you run out of every other kind of memory (GART is just a remapping table to make system memory visible to the gpu).
I'm also not sure of the bw limits, given all the fun we have on the block io cgroups side. Aside from that the current bw limit only controls the bw the kernel uses, userspace can submit unlimited amounts of copying commands that use the same pcie links directly to the gpu, bypassing this cg knob. Also, controlling execution time for gpus is very tricky, since they work a lot more like a block io device or maybe a network controller with packet scheduling, than a cpu. -Daniel
Hello, Daniel.
On Tue, Sep 03, 2019 at 09:48:22PM +0200, Daniel Vetter wrote:
I think system memory separate from vram makes sense. For one, vram is like 10x+ faster than system memory, so we definitely want to have good control on that. But maybe we only want one vram bucket overall for the entire system?
The trouble with system memory is that gpu tasks pin that memory to prep execution. There's two solutions:
- i915 has a shrinker. Lots (and I really mean lots) of pain with
direct reclaim recursion, which often means we can't free memory, and we're angering the oom killer a lot. Plus it introduces real bad latency spikes everywhere (gpu workloads are occasionally really slow, think "worse than pageout to spinning rust" to get memory freed).
- ttm just has a global limit, set to 50% of system memory.
I do think a global system memory limit to tame the shrinker, without the ttm approach of possible just wasting half your memory, could be useful.
Hmm... what'd be the fundamental difference from slab or socket memory which are handled through memcg? Is system memory used by GPUs have further global restrictions in addition to the amount of physical memory used?
I'm also not sure of the bw limits, given all the fun we have on the block io cgroups side. Aside from that the current bw limit only controls the bw the kernel uses, userspace can submit unlimited amounts of copying commands that use the same pcie links directly to the gpu, bypassing this cg knob. Also, controlling execution time for gpus is very tricky, since they work a lot more like a block io device or maybe a network controller with packet scheduling, than a cpu.
At the system level, it just gets folded into cpu time, which isn't perfect but is usually a good enough approximation of compute related dynamic resources. Can gpu do someting similar or at least start with that?
Thanks.
On Fri, Sep 6, 2019 at 5:23 PM Tejun Heo tj@kernel.org wrote:
Hello, Daniel.
On Tue, Sep 03, 2019 at 09:48:22PM +0200, Daniel Vetter wrote:
I think system memory separate from vram makes sense. For one, vram is like 10x+ faster than system memory, so we definitely want to have good control on that. But maybe we only want one vram bucket overall for the entire system?
The trouble with system memory is that gpu tasks pin that memory to prep execution. There's two solutions:
- i915 has a shrinker. Lots (and I really mean lots) of pain with
direct reclaim recursion, which often means we can't free memory, and we're angering the oom killer a lot. Plus it introduces real bad latency spikes everywhere (gpu workloads are occasionally really slow, think "worse than pageout to spinning rust" to get memory freed).
- ttm just has a global limit, set to 50% of system memory.
I do think a global system memory limit to tame the shrinker, without the ttm approach of possible just wasting half your memory, could be useful.
Hmm... what'd be the fundamental difference from slab or socket memory which are handled through memcg? Is system memory used by GPUs have further global restrictions in addition to the amount of physical memory used?
Sometimes, but that would be specific resources (kinda like vram), e.g. CMA regions used by a gpu. But probably not something you'll run in a datacenter and want cgroups for ...
I guess we could try to integrate with the memcg group controller. One trouble is that aside from i915 most gpu drivers do not really have a full shrinker, so not sure how that would all integrate.
The overall gpu memory controller would still be outside of memcg I think, since that would include swapped-out gpu objects, and stuff in special memory regions like vram.
I'm also not sure of the bw limits, given all the fun we have on the block io cgroups side. Aside from that the current bw limit only controls the bw the kernel uses, userspace can submit unlimited amounts of copying commands that use the same pcie links directly to the gpu, bypassing this cg knob. Also, controlling execution time for gpus is very tricky, since they work a lot more like a block io device or maybe a network controller with packet scheduling, than a cpu.
At the system level, it just gets folded into cpu time, which isn't perfect but is usually a good enough approximation of compute related dynamic resources. Can gpu do someting similar or at least start with that?
So generally there's a pile of engines, often of different type (e.g. amd hw has an entire pile of copy engines), with some ill-defined sharing charateristics for some (often compute/render engines use the same shader cores underneath), kinda like hyperthreading. So at that detail it's all extremely hw specific, and probably too hard to control in a useful way for users. And I'm not sure we can really do a reasonable knob for overall gpu usage, e.g. if we include all the copy engines, but the workloads are only running on compute engines, then you might only get 10% overall utilization by engine-time. While the shaders (which is most of the chip area/power consumption) are actually at 100%. On top, with many userspace apis those engines are an internal implementation detail of a more abstract gpu device (e.g. opengl), but with others, this is all fully exposed (like vulkan).
Plus the kernel needs to use at least copy engines for vram management itself, and you really can't take that away. Although Kenny here has some proposal for a separate cgroup resource just for that.
I just think it's all a bit too ill-defined, and we might be better off nailing the memory side first and get some real world experience on this stuff. For context, there's not even a cross-driver standard for how priorities are handled, that's all driver-specific interfaces. -Daniel
Hello, Daniel.
On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
Hmm... what'd be the fundamental difference from slab or socket memory which are handled through memcg? Is system memory used by GPUs have further global restrictions in addition to the amount of physical memory used?
Sometimes, but that would be specific resources (kinda like vram), e.g. CMA regions used by a gpu. But probably not something you'll run in a datacenter and want cgroups for ...
I guess we could try to integrate with the memcg group controller. One trouble is that aside from i915 most gpu drivers do not really have a full shrinker, so not sure how that would all integrate.
So, while it'd great to have shrinkers in the longer term, it's not a strict requirement to be accounted in memcg. It already accounts a lot of memory which isn't reclaimable (a lot of slabs and socket buffer).
The overall gpu memory controller would still be outside of memcg I think, since that would include swapped-out gpu objects, and stuff in special memory regions like vram.
Yeah, for resources which are on the GPU itself or hard limitations arising from it. In general, we wanna make cgroup controllers control something real and concrete as in physical resources.
At the system level, it just gets folded into cpu time, which isn't perfect but is usually a good enough approximation of compute related dynamic resources. Can gpu do someting similar or at least start with that?
So generally there's a pile of engines, often of different type (e.g. amd hw has an entire pile of copy engines), with some ill-defined sharing charateristics for some (often compute/render engines use the same shader cores underneath), kinda like hyperthreading. So at that detail it's all extremely hw specific, and probably too hard to control in a useful way for users. And I'm not sure we can really do a reasonable knob for overall gpu usage, e.g. if we include all the copy engines, but the workloads are only running on compute engines, then you might only get 10% overall utilization by engine-time. While the shaders (which is most of the chip area/power consumption) are actually at 100%. On top, with many userspace apis those engines are an internal implementation detail of a more abstract gpu device (e.g. opengl), but with others, this is all fully exposed (like vulkan).
Plus the kernel needs to use at least copy engines for vram management itself, and you really can't take that away. Although Kenny here has some proposal for a separate cgroup resource just for that.
I just think it's all a bit too ill-defined, and we might be better off nailing the memory side first and get some real world experience on this stuff. For context, there's not even a cross-driver standard for how priorities are handled, that's all driver-specific interfaces.
I see. Yeah, figuring it out as this develops makes sense to me. One thing I wanna raise is that in general we don't want to expose device or implementation details in cgroup interface. What we want expressed there is the intentions of the user. The more internal details we expose the more we end up getting tied down to the specific implementation which we should avoid especially given the early stage of development.
Thanks.
On Fri 06-09-19 08:45:39, Tejun Heo wrote:
Hello, Daniel.
On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
Hmm... what'd be the fundamental difference from slab or socket memory which are handled through memcg? Is system memory used by GPUs have further global restrictions in addition to the amount of physical memory used?
Sometimes, but that would be specific resources (kinda like vram), e.g. CMA regions used by a gpu. But probably not something you'll run in a datacenter and want cgroups for ...
I guess we could try to integrate with the memcg group controller. One trouble is that aside from i915 most gpu drivers do not really have a full shrinker, so not sure how that would all integrate.
So, while it'd great to have shrinkers in the longer term, it's not a strict requirement to be accounted in memcg. It already accounts a lot of memory which isn't reclaimable (a lot of slabs and socket buffer).
Yeah, having a shrinker is preferred but the memory should better be reclaimable in some form. If not by any other means then at least bound to a user process context so that it goes away with a task being killed by the OOM killer. If that is not the case then we cannot really charge it because then the memcg controller is of no use. We can tolerate it to some degree if the amount of memory charged like that is negligible to the overall size. But from the discussion it seems that these buffers are really large.
Hello, Michal.
On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
So, while it'd great to have shrinkers in the longer term, it's not a strict requirement to be accounted in memcg. It already accounts a lot of memory which isn't reclaimable (a lot of slabs and socket buffer).
Yeah, having a shrinker is preferred but the memory should better be reclaimable in some form. If not by any other means then at least bound to a user process context so that it goes away with a task being killed by the OOM killer. If that is not the case then we cannot really charge it because then the memcg controller is of no use. We can tolerate it to some degree if the amount of memory charged like that is negligible to the overall size. But from the discussion it seems that these buffers are really large.
Yeah, oom kills should be able to reduce the usage; however, please note that tmpfs, among other things, can already escape this restriction and we can have cgroups which are over max and empty. It's obviously not ideal but the system doesn't melt down from it either.
Thanks.
On Tue 10-09-19 09:03:29, Tejun Heo wrote:
Hello, Michal.
On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
So, while it'd great to have shrinkers in the longer term, it's not a strict requirement to be accounted in memcg. It already accounts a lot of memory which isn't reclaimable (a lot of slabs and socket buffer).
Yeah, having a shrinker is preferred but the memory should better be reclaimable in some form. If not by any other means then at least bound to a user process context so that it goes away with a task being killed by the OOM killer. If that is not the case then we cannot really charge it because then the memcg controller is of no use. We can tolerate it to some degree if the amount of memory charged like that is negligible to the overall size. But from the discussion it seems that these buffers are really large.
Yeah, oom kills should be able to reduce the usage; however, please note that tmpfs, among other things, can already escape this restriction and we can have cgroups which are over max and empty. It's obviously not ideal but the system doesn't melt down from it either.
Right, and that is a reason why an access to tmpfs should be restricted when containing a workload by memcg. My understanding of this particular feature is that memcg should be the primary containment method and that's why I brought this up.
On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
On Fri 06-09-19 08:45:39, Tejun Heo wrote:
Hello, Daniel.
On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
Hmm... what'd be the fundamental difference from slab or socket memory which are handled through memcg? Is system memory used by GPUs have further global restrictions in addition to the amount of physical memory used?
Sometimes, but that would be specific resources (kinda like vram), e.g. CMA regions used by a gpu. But probably not something you'll run in a datacenter and want cgroups for ...
I guess we could try to integrate with the memcg group controller. One trouble is that aside from i915 most gpu drivers do not really have a full shrinker, so not sure how that would all integrate.
So, while it'd great to have shrinkers in the longer term, it's not a strict requirement to be accounted in memcg. It already accounts a lot of memory which isn't reclaimable (a lot of slabs and socket buffer).
Yeah, having a shrinker is preferred but the memory should better be reclaimable in some form. If not by any other means then at least bound to a user process context so that it goes away with a task being killed by the OOM killer. If that is not the case then we cannot really charge it because then the memcg controller is of no use. We can tolerate it to some degree if the amount of memory charged like that is negligible to the overall size. But from the discussion it seems that these buffers are really large.
I think we can just make "must have a shrinker" as a requirement for system memory cgroup thing for gpu buffers. There's mild locking inversion fun to be had when typing one, but I think the problem is well-understood enough that this isn't a huge hurdle to climb over. And should give admins an easier to mange system, since it works more like what they know already. -Daniel
On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
This is a follow up to the RFC I made previously to introduce a cgroup controller for the GPU/DRM subsystem [v1,v2,v3]. The goal is to be able to provide resource management to GPU resources using things like container.
With this RFC v4, I am hoping to have some consensus on a merge plan. I believe the GEM related resources (drm.buffer.*) introduced in previous RFC and, hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are uncontroversial and ready to move out of RFC and into a more formal review. I will continue to work on the memory backend resources (drm.memory.*).
The cover letter from v1 is copied below for reference.
So looking at all this doesn't seem to have changed much, and the old discussion didn't really conclude anywhere (aside from some details).
One more open though that crossed my mind, having read a ton of ttm again recently: How does this all interact with ttm global limits? I'd say the ttm global limits is the ur-cgroups we have in drm, and not looking at that seems kinda bad. -Daniel
v4: Unchanged (no review needed)
- drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
and shrinker) Base on feedbacks on v3:
- update nominclature to drmcg
- embed per device drmcg properties into drm_device
- split GEM buffer related commits into stats and limit
- rename function name to align with convention
- combined buffer accounting and check into a try_charge function
- support buffer stats without limit enforcement
- removed GEM buffer sharing limitation
- updated documentations
New features:
- introducing logical GPU concept
- example implementation with AMD KFD
v3: Base on feedbacks on v2:
- removed .help type file from v2
- conform to cgroup convention for default and max handling
- conform to cgroup convention for addressing device specific limits (with major:minor)
New function:
- adopted memparse for memory size related attributes
- added macro to marshall drmcgrp cftype private (DRMCG_CTF_PRIV, etc.)
- added ttm buffer usage stats (per cgroup, for system, tt, vram.)
- added ttm buffer usage limit (per cgroup, for vram.)
- added per cgroup bandwidth stats and limiting (burst and average bandwidth)
v2:
- Removed the vendoring concepts
- Add limit to total buffer allocation
- Add limit to the maximum size of a buffer allocation
v1: cover letter
The purpose of this patch series is to start a discussion for a generic cgroup controller for the drm subsystem. The design proposed here is a very early one. We are hoping to engage the community as we develop the idea.
Backgrounds
Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour, such as accounting/limiting the resources which processes in a cgroup can access[1]. Weights, limits, protections, allocations are the main resource distribution models. Existing cgroup controllers includes cpu, memory, io, rdma, and more. cgroup is one of the foundational technologies that enables the popular container application deployment and management method.
Direct Rendering Manager/drm contains code intended to support the needs of complex graphics devices. Graphics drivers in the kernel may make use of DRM functions to make tasks like memory management, interrupt handling and DMA easier, and provide a uniform interface to applications. The DRM has also developed beyond traditional graphics applications to support compute/GPGPU applications.
Motivations
As GPU grow beyond the realm of desktop/workstation graphics into areas like data center clusters and IoT, there are increasing needs to monitor and regulate GPU as a resource like cpu, memory and io.
Matt Roper from Intel began working on similar idea in early 2018 [2] for the purpose of managing GPU priority using the cgroup hierarchy. While that particular use case may not warrant a standalone drm cgroup controller, there are other use cases where having one can be useful [3]. Monitoring GPU resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help sysadmins get a better understanding of the applications usage profile. Further usage regulations of the aforementioned resources can also help sysadmins optimize workload deployment on limited GPU resources.
With the increased importance of machine learning, data science and other cloud-based applications, GPUs are already in production use in data centers today [5,6,7]. Existing GPU resource management is very course grain, however, as sysadmins are only able to distribute workload on a per-GPU basis [8]. An alternative is to use GPU virtualization (with or without SRIOV) but it generally acts on the entire GPU instead of the specific resources in a GPU. With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU resource management (in addition to what may be available via GPU virtualization.)
In addition to production use, the DRM cgroup can also help with testing graphics application robustness by providing a mean to artificially limit DRM resources availble to the applications.
Challenges
While there are common infrastructure in DRM that is shared across many vendors (the scheduler [4] for example), there are also aspects of DRM that are vendor specific. To accommodate this, we borrowed the mechanism used by the cgroup to handle different kinds of cgroup controller.
Resources for DRM are also often device (GPU) specific instead of system specific and a system may contain more than one GPU. For this, we borrowed some of the ideas from RDMA cgroup controller.
Approach
To experiment with the idea of a DRM cgroup, we would like to start with basic accounting and statistics, then continue to iterate and add regulating mechanisms into the driver.
[1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html [3] https://www.spinics.net/lists/cgroups/msg20720.html [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-st... [7] https://github.com/RadeonOpenCompute/k8s-device-plugin [8] https://github.com/kubernetes/kubernetes/issues/52757
Kenny Ho (16): drm: Add drm_minor_for_each cgroup: Introduce cgroup for drm subsystem drm, cgroup: Initialize drmcg properties drm, cgroup: Add total GEM buffer allocation stats drm, cgroup: Add peak GEM buffer allocation stats drm, cgroup: Add GEM buffer allocation count stats drm, cgroup: Add total GEM buffer allocation limit drm, cgroup: Add peak GEM buffer allocation limit drm, cgroup: Add TTM buffer allocation stats drm, cgroup: Add TTM buffer peak usage stats drm, cgroup: Add per cgroup bw measure and control drm, cgroup: Add soft VRAM limit drm, cgroup: Allow more aggressive memory reclaim drm, cgroup: Introduce lgpu as DRM cgroup resource drm, cgroup: add update trigger after limit change drm/amdgpu: Integrate with DRM cgroup
Documentation/admin-guide/cgroup-v2.rst | 163 +- Documentation/cgroup-v1/drm.rst | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 29 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + .../amd/amdkfd/kfd_process_queue_manager.c | 140 ++ drivers/gpu/drm/drm_drv.c | 26 + drivers/gpu/drm/drm_gem.c | 16 +- drivers/gpu/drm/drm_internal.h | 4 - drivers/gpu/drm/ttm/ttm_bo.c | 93 ++ drivers/gpu/drm/ttm/ttm_bo_util.c | 4 + include/drm/drm_cgroup.h | 122 ++ include/drm/drm_device.h | 7 + include/drm/drm_drv.h | 23 + include/drm/drm_gem.h | 13 +- include/drm/ttm/ttm_bo_api.h | 2 + include/drm/ttm/ttm_bo_driver.h | 10 + include/linux/cgroup_drm.h | 151 ++ include/linux/cgroup_subsys.h | 4 + init/Kconfig | 5 + kernel/cgroup/Makefile | 1 + kernel/cgroup/drm.c | 1367 +++++++++++++++++ 25 files changed, 2193 insertions(+), 10 deletions(-) create mode 100644 Documentation/cgroup-v1/drm.rst create mode 100644 include/drm/drm_cgroup.h create mode 100644 include/linux/cgroup_drm.h create mode 100644 kernel/cgroup/drm.c
-- 2.22.0
Am 03.09.19 um 10:02 schrieb Daniel Vetter:
On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
This is a follow up to the RFC I made previously to introduce a cgroup controller for the GPU/DRM subsystem [v1,v2,v3]. The goal is to be able to provide resource management to GPU resources using things like container.
With this RFC v4, I am hoping to have some consensus on a merge plan. I believe the GEM related resources (drm.buffer.*) introduced in previous RFC and, hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are uncontroversial and ready to move out of RFC and into a more formal review. I will continue to work on the memory backend resources (drm.memory.*).
The cover letter from v1 is copied below for reference.
So looking at all this doesn't seem to have changed much, and the old discussion didn't really conclude anywhere (aside from some details).
One more open though that crossed my mind, having read a ton of ttm again recently: How does this all interact with ttm global limits? I'd say the ttm global limits is the ur-cgroups we have in drm, and not looking at that seems kinda bad.
At least my hope was to completely replace ttm globals with those limitations here when it is ready.
Christian.
-Daniel
v4: Unchanged (no review needed)
- drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
and shrinker) Base on feedbacks on v3:
- update nominclature to drmcg
- embed per device drmcg properties into drm_device
- split GEM buffer related commits into stats and limit
- rename function name to align with convention
- combined buffer accounting and check into a try_charge function
- support buffer stats without limit enforcement
- removed GEM buffer sharing limitation
- updated documentations
New features:
- introducing logical GPU concept
- example implementation with AMD KFD
v3: Base on feedbacks on v2:
- removed .help type file from v2
- conform to cgroup convention for default and max handling
- conform to cgroup convention for addressing device specific limits (with major:minor)
New function:
- adopted memparse for memory size related attributes
- added macro to marshall drmcgrp cftype private (DRMCG_CTF_PRIV, etc.)
- added ttm buffer usage stats (per cgroup, for system, tt, vram.)
- added ttm buffer usage limit (per cgroup, for vram.)
- added per cgroup bandwidth stats and limiting (burst and average bandwidth)
v2:
- Removed the vendoring concepts
- Add limit to total buffer allocation
- Add limit to the maximum size of a buffer allocation
v1: cover letter
The purpose of this patch series is to start a discussion for a generic cgroup controller for the drm subsystem. The design proposed here is a very early one. We are hoping to engage the community as we develop the idea.
Backgrounds
Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour, such as accounting/limiting the resources which processes in a cgroup can access[1]. Weights, limits, protections, allocations are the main resource distribution models. Existing cgroup controllers includes cpu, memory, io, rdma, and more. cgroup is one of the foundational technologies that enables the popular container application deployment and management method.
Direct Rendering Manager/drm contains code intended to support the needs of complex graphics devices. Graphics drivers in the kernel may make use of DRM functions to make tasks like memory management, interrupt handling and DMA easier, and provide a uniform interface to applications. The DRM has also developed beyond traditional graphics applications to support compute/GPGPU applications.
Motivations
As GPU grow beyond the realm of desktop/workstation graphics into areas like data center clusters and IoT, there are increasing needs to monitor and regulate GPU as a resource like cpu, memory and io.
Matt Roper from Intel began working on similar idea in early 2018 [2] for the purpose of managing GPU priority using the cgroup hierarchy. While that particular use case may not warrant a standalone drm cgroup controller, there are other use cases where having one can be useful [3]. Monitoring GPU resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help sysadmins get a better understanding of the applications usage profile. Further usage regulations of the aforementioned resources can also help sysadmins optimize workload deployment on limited GPU resources.
With the increased importance of machine learning, data science and other cloud-based applications, GPUs are already in production use in data centers today [5,6,7]. Existing GPU resource management is very course grain, however, as sysadmins are only able to distribute workload on a per-GPU basis [8]. An alternative is to use GPU virtualization (with or without SRIOV) but it generally acts on the entire GPU instead of the specific resources in a GPU. With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU resource management (in addition to what may be available via GPU virtualization.)
In addition to production use, the DRM cgroup can also help with testing graphics application robustness by providing a mean to artificially limit DRM resources availble to the applications.
Challenges
While there are common infrastructure in DRM that is shared across many vendors (the scheduler [4] for example), there are also aspects of DRM that are vendor specific. To accommodate this, we borrowed the mechanism used by the cgroup to handle different kinds of cgroup controller.
Resources for DRM are also often device (GPU) specific instead of system specific and a system may contain more than one GPU. For this, we borrowed some of the ideas from RDMA cgroup controller.
Approach
To experiment with the idea of a DRM cgroup, we would like to start with basic accounting and statistics, then continue to iterate and add regulating mechanisms into the driver.
[1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html [3] https://www.spinics.net/lists/cgroups/msg20720.html [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-st... [7] https://github.com/RadeonOpenCompute/k8s-device-plugin [8] https://github.com/kubernetes/kubernetes/issues/52757
Kenny Ho (16): drm: Add drm_minor_for_each cgroup: Introduce cgroup for drm subsystem drm, cgroup: Initialize drmcg properties drm, cgroup: Add total GEM buffer allocation stats drm, cgroup: Add peak GEM buffer allocation stats drm, cgroup: Add GEM buffer allocation count stats drm, cgroup: Add total GEM buffer allocation limit drm, cgroup: Add peak GEM buffer allocation limit drm, cgroup: Add TTM buffer allocation stats drm, cgroup: Add TTM buffer peak usage stats drm, cgroup: Add per cgroup bw measure and control drm, cgroup: Add soft VRAM limit drm, cgroup: Allow more aggressive memory reclaim drm, cgroup: Introduce lgpu as DRM cgroup resource drm, cgroup: add update trigger after limit change drm/amdgpu: Integrate with DRM cgroup
Documentation/admin-guide/cgroup-v2.rst | 163 +- Documentation/cgroup-v1/drm.rst | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 29 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + .../amd/amdkfd/kfd_process_queue_manager.c | 140 ++ drivers/gpu/drm/drm_drv.c | 26 + drivers/gpu/drm/drm_gem.c | 16 +- drivers/gpu/drm/drm_internal.h | 4 - drivers/gpu/drm/ttm/ttm_bo.c | 93 ++ drivers/gpu/drm/ttm/ttm_bo_util.c | 4 + include/drm/drm_cgroup.h | 122 ++ include/drm/drm_device.h | 7 + include/drm/drm_drv.h | 23 + include/drm/drm_gem.h | 13 +- include/drm/ttm/ttm_bo_api.h | 2 + include/drm/ttm/ttm_bo_driver.h | 10 + include/linux/cgroup_drm.h | 151 ++ include/linux/cgroup_subsys.h | 4 + init/Kconfig | 5 + kernel/cgroup/Makefile | 1 + kernel/cgroup/drm.c | 1367 +++++++++++++++++ 25 files changed, 2193 insertions(+), 10 deletions(-) create mode 100644 Documentation/cgroup-v1/drm.rst create mode 100644 include/drm/drm_cgroup.h create mode 100644 include/linux/cgroup_drm.h create mode 100644 kernel/cgroup/drm.c
-- 2.22.0
On Tue, Sep 3, 2019 at 10:24 AM Koenig, Christian Christian.Koenig@amd.com wrote:
Am 03.09.19 um 10:02 schrieb Daniel Vetter:
On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
This is a follow up to the RFC I made previously to introduce a cgroup controller for the GPU/DRM subsystem [v1,v2,v3]. The goal is to be able to provide resource management to GPU resources using things like container.
With this RFC v4, I am hoping to have some consensus on a merge plan. I believe the GEM related resources (drm.buffer.*) introduced in previous RFC and, hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are uncontroversial and ready to move out of RFC and into a more formal review. I will continue to work on the memory backend resources (drm.memory.*).
The cover letter from v1 is copied below for reference.
So looking at all this doesn't seem to have changed much, and the old discussion didn't really conclude anywhere (aside from some details).
One more open though that crossed my mind, having read a ton of ttm again recently: How does this all interact with ttm global limits? I'd say the ttm global limits is the ur-cgroups we have in drm, and not looking at that seems kinda bad.
At least my hope was to completely replace ttm globals with those limitations here when it is ready.
You need more, at least some kind of shrinker to cut down bo placed in system memory when we're under memory pressure. Which drags in a pretty epic amount of locking lols (see i915's shrinker fun, where we attempt that). Probably another good idea to share at least some concepts, maybe even code. -Daniel
Christian.
-Daniel
v4: Unchanged (no review needed)
- drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
and shrinker) Base on feedbacks on v3:
- update nominclature to drmcg
- embed per device drmcg properties into drm_device
- split GEM buffer related commits into stats and limit
- rename function name to align with convention
- combined buffer accounting and check into a try_charge function
- support buffer stats without limit enforcement
- removed GEM buffer sharing limitation
- updated documentations
New features:
- introducing logical GPU concept
- example implementation with AMD KFD
v3: Base on feedbacks on v2:
- removed .help type file from v2
- conform to cgroup convention for default and max handling
- conform to cgroup convention for addressing device specific limits (with major:minor)
New function:
- adopted memparse for memory size related attributes
- added macro to marshall drmcgrp cftype private (DRMCG_CTF_PRIV, etc.)
- added ttm buffer usage stats (per cgroup, for system, tt, vram.)
- added ttm buffer usage limit (per cgroup, for vram.)
- added per cgroup bandwidth stats and limiting (burst and average bandwidth)
v2:
- Removed the vendoring concepts
- Add limit to total buffer allocation
- Add limit to the maximum size of a buffer allocation
v1: cover letter
The purpose of this patch series is to start a discussion for a generic cgroup controller for the drm subsystem. The design proposed here is a very early one. We are hoping to engage the community as we develop the idea.
Backgrounds
Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour, such as accounting/limiting the resources which processes in a cgroup can access[1]. Weights, limits, protections, allocations are the main resource distribution models. Existing cgroup controllers includes cpu, memory, io, rdma, and more. cgroup is one of the foundational technologies that enables the popular container application deployment and management method.
Direct Rendering Manager/drm contains code intended to support the needs of complex graphics devices. Graphics drivers in the kernel may make use of DRM functions to make tasks like memory management, interrupt handling and DMA easier, and provide a uniform interface to applications. The DRM has also developed beyond traditional graphics applications to support compute/GPGPU applications.
Motivations
As GPU grow beyond the realm of desktop/workstation graphics into areas like data center clusters and IoT, there are increasing needs to monitor and regulate GPU as a resource like cpu, memory and io.
Matt Roper from Intel began working on similar idea in early 2018 [2] for the purpose of managing GPU priority using the cgroup hierarchy. While that particular use case may not warrant a standalone drm cgroup controller, there are other use cases where having one can be useful [3]. Monitoring GPU resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help sysadmins get a better understanding of the applications usage profile. Further usage regulations of the aforementioned resources can also help sysadmins optimize workload deployment on limited GPU resources.
With the increased importance of machine learning, data science and other cloud-based applications, GPUs are already in production use in data centers today [5,6,7]. Existing GPU resource management is very course grain, however, as sysadmins are only able to distribute workload on a per-GPU basis [8]. An alternative is to use GPU virtualization (with or without SRIOV) but it generally acts on the entire GPU instead of the specific resources in a GPU. With a drm cgroup controller, we can enable alternate, fine-grain, sub-GPU resource management (in addition to what may be available via GPU virtualization.)
In addition to production use, the DRM cgroup can also help with testing graphics application robustness by providing a mean to artificially limit DRM resources availble to the applications.
Challenges
While there are common infrastructure in DRM that is shared across many vendors (the scheduler [4] for example), there are also aspects of DRM that are vendor specific. To accommodate this, we borrowed the mechanism used by the cgroup to handle different kinds of cgroup controller.
Resources for DRM are also often device (GPU) specific instead of system specific and a system may contain more than one GPU. For this, we borrowed some of the ideas from RDMA cgroup controller.
Approach
To experiment with the idea of a DRM cgroup, we would like to start with basic accounting and statistics, then continue to iterate and add regulating mechanisms into the driver.
[1] https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt [2] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html [3] https://www.spinics.net/lists/cgroups/msg20720.html [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler [5] https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ [6] https://blog.openshift.com/gpu-accelerated-sql-queries-with-postgresql-pg-st... [7] https://github.com/RadeonOpenCompute/k8s-device-plugin [8] https://github.com/kubernetes/kubernetes/issues/52757
Kenny Ho (16): drm: Add drm_minor_for_each cgroup: Introduce cgroup for drm subsystem drm, cgroup: Initialize drmcg properties drm, cgroup: Add total GEM buffer allocation stats drm, cgroup: Add peak GEM buffer allocation stats drm, cgroup: Add GEM buffer allocation count stats drm, cgroup: Add total GEM buffer allocation limit drm, cgroup: Add peak GEM buffer allocation limit drm, cgroup: Add TTM buffer allocation stats drm, cgroup: Add TTM buffer peak usage stats drm, cgroup: Add per cgroup bw measure and control drm, cgroup: Add soft VRAM limit drm, cgroup: Allow more aggressive memory reclaim drm, cgroup: Introduce lgpu as DRM cgroup resource drm, cgroup: add update trigger after limit change drm/amdgpu: Integrate with DRM cgroup
Documentation/admin-guide/cgroup-v2.rst | 163 +- Documentation/cgroup-v1/drm.rst | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 29 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + .../amd/amdkfd/kfd_process_queue_manager.c | 140 ++ drivers/gpu/drm/drm_drv.c | 26 + drivers/gpu/drm/drm_gem.c | 16 +- drivers/gpu/drm/drm_internal.h | 4 - drivers/gpu/drm/ttm/ttm_bo.c | 93 ++ drivers/gpu/drm/ttm/ttm_bo_util.c | 4 + include/drm/drm_cgroup.h | 122 ++ include/drm/drm_device.h | 7 + include/drm/drm_drv.h | 23 + include/drm/drm_gem.h | 13 +- include/drm/ttm/ttm_bo_api.h | 2 + include/drm/ttm/ttm_bo_driver.h | 10 + include/linux/cgroup_drm.h | 151 ++ include/linux/cgroup_subsys.h | 4 + init/Kconfig | 5 + kernel/cgroup/Makefile | 1 + kernel/cgroup/drm.c | 1367 +++++++++++++++++ 25 files changed, 2193 insertions(+), 10 deletions(-) create mode 100644 Documentation/cgroup-v1/drm.rst create mode 100644 include/drm/drm_cgroup.h create mode 100644 include/linux/cgroup_drm.h create mode 100644 kernel/cgroup/drm.c
-- 2.22.0
On Tue, Sep 3, 2019 at 5:20 AM Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Sep 3, 2019 at 10:24 AM Koenig, Christian Christian.Koenig@amd.com wrote:
Am 03.09.19 um 10:02 schrieb Daniel Vetter:
On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
With this RFC v4, I am hoping to have some consensus on a merge plan. I believe the GEM related resources (drm.buffer.*) introduced in previous RFC and, hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are uncontroversial and ready to move out of RFC and into a more formal review. I will continue to work on the memory backend resources (drm.memory.*).
The cover letter from v1 is copied below for reference.
So looking at all this doesn't seem to have changed much, and the old discussion didn't really conclude anywhere (aside from some details).
One more open though that crossed my mind, having read a ton of ttm again recently: How does this all interact with ttm global limits? I'd say the ttm global limits is the ur-cgroups we have in drm, and not looking at that seems kinda bad.
At least my hope was to completely replace ttm globals with those limitations here when it is ready.
You need more, at least some kind of shrinker to cut down bo placed in system memory when we're under memory pressure. Which drags in a pretty epic amount of locking lols (see i915's shrinker fun, where we attempt that). Probably another good idea to share at least some concepts, maybe even code.
I am still looking into your shrinker suggestion so the memory.* resources are untouch from RFC v3. The main change for the buffer.* resources is the removal of buffer sharing restriction as you suggested and additional documentation of that behaviour. (I may have neglected mentioning it in the cover.) The other key part of RFC v4 is the "logical GPU/lgpu" concept. I am hoping to get it out there early for feedback while I continue to work on the memory.* parts.
Kenny
-Daniel
Christian.
-Daniel
v4: Unchanged (no review needed)
- drm.memory.*/ttm resources (Patch 9-13, I am still working on memory bandwidth
and shrinker) Base on feedbacks on v3:
- update nominclature to drmcg
- embed per device drmcg properties into drm_device
- split GEM buffer related commits into stats and limit
- rename function name to align with convention
- combined buffer accounting and check into a try_charge function
- support buffer stats without limit enforcement
- removed GEM buffer sharing limitation
- updated documentations
New features:
- introducing logical GPU concept
- example implementation with AMD KFD
v3: Base on feedbacks on v2:
- removed .help type file from v2
- conform to cgroup convention for default and max handling
- conform to cgroup convention for addressing device specific limits (with major:minor)
New function:
- adopted memparse for memory size related attributes
- added macro to marshall drmcgrp cftype private (DRMCG_CTF_PRIV, etc.)
- added ttm buffer usage stats (per cgroup, for system, tt, vram.)
- added ttm buffer usage limit (per cgroup, for vram.)
- added per cgroup bandwidth stats and limiting (burst and average bandwidth)
v2:
- Removed the vendoring concepts
- Add limit to total buffer allocation
- Add limit to the maximum size of a buffer allocation
v1: cover letter
dri-devel@lists.freedesktop.org