Adding gpu folks.
On Tue, Nov 03, 2020 at 03:28:05PM -0800, Alexei Starovoitov wrote:
On Tue, Nov 03, 2020 at 05:57:47PM -0500, Kenny Ho wrote:
On Tue, Nov 3, 2020 at 4:04 PM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Tue, Nov 03, 2020 at 02:19:22PM -0500, Kenny Ho wrote:
On Tue, Nov 3, 2020 at 12:43 AM Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
On Mon, Nov 2, 2020 at 9:39 PM Kenny Ho y2kenny@gmail.com wrote:
Sounds like either bpf_lsm needs to be made aware of cgv2 (which would be a great thing to have regardless) or cgroup-bpf needs a drm/gpu specific hook. I think generic ioctl hook is too broad for this use case. I suspect drm/gpu internal state would be easier to access inside bpf program if the hook is next to gpu/drm. At ioctl level there is 'file'. It's probably too abstract for the things you want to do. Like how VRAM/shader/etc can be accessed through file? Probably possible through a bunch of lookups and dereferences, but if the hook is custom to GPU that info is likely readily available. Then such cgroup-bpf check would be suitable in execution paths where ioctl-based hook would be too slow.
Just to clarify, when you say drm specific hook, did you mean just a unique attach_type or a unique prog_type+attach_type combination? (I am still a bit fuzzy on when a new prog type is needed vs a new attach type. I think prog type is associated with a unique type of context that the bpf prog will get but I could be missing some nuances.)
When I was thinking of doing an ioctl wide hook, the file would be the device file and the thinking was to have a helper function provided by device drivers to further disambiguate. For our (AMD's) driver, we have a bunch of ioctls for set/get/create/destroy (https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdkfd/kf...) so the bpf prog can make the decision after the disambiguation. For example, we have an ioctl called "kfd_ioctl_set_cu_mask." You can
Thanks for the pointer. That's one monster ioctl. So much copy_from_user. BPF prog would need to be sleepable to able to examine the args in such depth. After quick glance at the code I would put a new hook into kfd_ioctl() right before retcode = func(filep, process, kdata); At this point kdata is already copied from user space and usize, that is cmd specific, is known. So bpf prog wouldn't need to copy that data again. That will save one copy. To drill into details of kfd_ioctl_set_cu_mask() the prog would need to be sleepable to do second copy_from_user of cu_mask. At least it's not that big. Yes, the attachment point will be amd driver specific, but the program doesn't need to be. It can be generic tracing prog that is agumented to use BTF. Something like writeable tracepoint with BTF support would do. So on the bpf side there will be minimal amount of changes. And in the driver you'll add one or few writeable tracepoints and the result of the tracepoint will gate retcode = func(filep, process, kdata); call in kfd_ioctl(). The writeable tracepoint would need to be cgroup-bpf based. So that's the only tricky part. BPF infra doesn't have cgroup+tracepoint scheme. It's probably going to be useful in other cases like this. See trace_nbd_send_request.
Yeah I think this proposal doesn't work:
- inspecting ioctl arguments that need copying outside of the driver/subsystem doing that copying is fundamentally racy
- there's been a pile of cgroups proposal to manage gpus at the drm subsystem level, some by Kenny, and frankly this at least looks a bit like a quick hack to sidestep the consensus process for that.
So once we push this into drivers it's not going to be a bpf hook anymore I think.
Cheers, Daniel
On Mon, Feb 1, 2021 at 9:49 AM Daniel Vetter daniel@ffwll.ch wrote:
- there's been a pile of cgroups proposal to manage gpus at the drm subsystem level, some by Kenny, and frankly this at least looks a bit like a quick hack to sidestep the consensus process for that.
No Daniel, this is quick *draft* to get a conversation going. Bpf was actually a path suggested by Tejun back in 2018 so I think you are mischaracterizing this quite a bit.
"2018-11-20 Kenny Ho: To put the questions in more concrete terms, let say a user wants to expose certain part of a gpu to a particular cgroup similar to the way selective cpu cores are exposed to a cgroup via cpuset, how should we go about enabling such functionality?
2018-11-20 Tejun Heo: Do what the intel driver or bpf is doing? It's not difficult to hook into cgroup for identification purposes."
Kenny
[Resent in plain text.]
On Mon, Feb 1, 2021 at 9:49 AM Daniel Vetter daniel@ffwll.ch wrote:
- there's been a pile of cgroups proposal to manage gpus at the drm subsystem level, some by Kenny, and frankly this at least looks a bit like a quick hack to sidestep the consensus process for that.
No Daniel, this is quick *draft* to get a conversation going. Bpf was actually a path suggested by Tejun back in 2018 so I think you are mischaracterizing this quite a bit.
"2018-11-20 Kenny Ho: To put the questions in more concrete terms, let say a user wants to expose certain part of a gpu to a particular cgroup similar to the way selective cpu cores are exposed to a cgroup via cpuset, how should we go about enabling such functionality?
2018-11-20 Tejun Heo: Do what the intel driver or bpf is doing? It's not difficult to hook into cgroup for identification purposes."
Kenny
On Mon, Feb 01, 2021 at 11:51:07AM -0500, Kenny Ho wrote:
[Resent in plain text.]
On Mon, Feb 1, 2021 at 9:49 AM Daniel Vetter daniel@ffwll.ch wrote:
- there's been a pile of cgroups proposal to manage gpus at the drm subsystem level, some by Kenny, and frankly this at least looks a bit like a quick hack to sidestep the consensus process for that.
No Daniel, this is quick *draft* to get a conversation going. Bpf was actually a path suggested by Tejun back in 2018 so I think you are mischaracterizing this quite a bit.
"2018-11-20 Kenny Ho: To put the questions in more concrete terms, let say a user wants to expose certain part of a gpu to a particular cgroup similar to the way selective cpu cores are exposed to a cgroup via cpuset, how should we go about enabling such functionality?
2018-11-20 Tejun Heo: Do what the intel driver or bpf is doing? It's not difficult to hook into cgroup for identification purposes."
Yeah, but if you go full amd specific for this, you might as well have a specific BPF hook which is called in amdgpu/kfd and returns you the CU mask for a given cgroups (and figures that out however it pleases).
Not a generic framework which lets you build pretty much any possible cgroups controller for anything else using BPF. Trying to filter anything at the generic ioctl just doesn't feel like a great idea that's long term maintainable. E.g. what happens if there's new uapi for command submission/context creation and now your bpf filter isn't catching all access anymore? If it's an explicit hook that explicitly computes the CU mask, then we can add more checks as needed. With ioctl that's impossible.
Plus I'm also not sure whether that's really a good idea still, since if cloud companies have to built their own bespoke container stuff for every gpu vendor, that's quite a bad platform we're building. And "I'd like to make sure my gpu is used fairly among multiple tenents" really isn't a use-case that's specific to amd.
If this would be something very hw specific like cache assignment and quality of service stuff or things like that, then vendor specific imo makes sense. But for CU masks essentially we're cutting the compute resources up in some way, and I kinda expect everyone with a gpu who cares about isolating workloads with cgroups wants to do that. -Daniel
Daniel,
I will have to get back to you later on the details of this because my head is currently context switched to some infrastructure and Kubernetes/golang work, so I am having a hard time digesting what you are saying. I am new to the bpf stuff so this is about my own learning as well as a conversation starter. The high level goal here is to have a path for flexibility via a bpf program. Not just GPU or DRM or CU mask, but devices making decisions via an operator-written bpf-prog attached to a cgroup. More inline.
On Wed, Feb 3, 2021 at 6:09 AM Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Feb 01, 2021 at 11:51:07AM -0500, Kenny Ho wrote:
On Mon, Feb 1, 2021 at 9:49 AM Daniel Vetter daniel@ffwll.ch wrote:
- there's been a pile of cgroups proposal to manage gpus at the drm subsystem level, some by Kenny, and frankly this at least looks a bit like a quick hack to sidestep the consensus process for that.
No Daniel, this is quick *draft* to get a conversation going. Bpf was actually a path suggested by Tejun back in 2018 so I think you are mischaracterizing this quite a bit.
"2018-11-20 Kenny Ho: To put the questions in more concrete terms, let say a user wants to expose certain part of a gpu to a particular cgroup similar to the way selective cpu cores are exposed to a cgroup via cpuset, how should we go about enabling such functionality?
2018-11-20 Tejun Heo: Do what the intel driver or bpf is doing? It's not difficult to hook into cgroup for identification purposes."
Yeah, but if you go full amd specific for this, you might as well have a specific BPF hook which is called in amdgpu/kfd and returns you the CU mask for a given cgroups (and figures that out however it pleases).
Not a generic framework which lets you build pretty much any possible cgroups controller for anything else using BPF. Trying to filter anything at the generic ioctl just doesn't feel like a great idea that's long term maintainable. E.g. what happens if there's new uapi for command submission/context creation and now your bpf filter isn't catching all access anymore? If it's an explicit hook that explicitly computes the CU mask, then we can add more checks as needed. With ioctl that's impossible.
Plus I'm also not sure whether that's really a good idea still, since if cloud companies have to built their own bespoke container stuff for every gpu vendor, that's quite a bad platform we're building. And "I'd like to make sure my gpu is used fairly among multiple tenents" really isn't a use-case that's specific to amd.
I don't understand what you are saying about containers here since bpf-progs are not the same as container nor are they deployed from inside a container (as far as I know, I am actually not sure how bpf-cgroup works with higher level cloud orchestration since folks like Docker just migrated to cgroup v2 very recently... I don't think you can specify a bpf-prog to load as part of a k8s pod definition.) That said, the bit I understand ("not sure whether that's really a good idea....cloud companies have to built their own bespoke container stuff for every gpu vendor...") is in fact the current status quo. If you look into some of the popular ML/AI-oriented containers/apps, you will likely see things are mostly hardcoded to CUDA. Since I work for AMD, I wouldn't say that's a good thing but this is just the reality. For Kubernetes at least (where my head is currently), the official mechanisms are Device Plugins (I am the author for the one for AMD but there are a few ones from Intel too, you can confirm with your colleagues) and Node Feature/Labels. Kubernetes schedules pod/container launched by users to the node/servers by the affinity of the node resources/labels, and the resources/labels in the pod specification created by the users.
If this would be something very hw specific like cache assignment and quality of service stuff or things like that, then vendor specific imo makes sense. But for CU masks essentially we're cutting the compute resources up in some way, and I kinda expect everyone with a gpu who cares about isolating workloads with cgroups wants to do that.
Right, but isolating workloads is quality of service stuff and *how* compute resources are cut up are vendor specific.
Anyway, as I said at the beginning of this reply, this is about flexibility in support of the diversity of devices and architectures. CU mask is simply a concrete example of hw diversity that a bpf-program can encapsulate. I can see this framework (a custom program making decisions in a specific cgroup and device context) use for other things as well. It may even be useful within a vendor to handle the diversity between SKUs.
Kenny
Hi Kenny
On Wed, Feb 3, 2021 at 8:01 PM Kenny Ho y2kenny@gmail.com wrote:
Daniel,
I will have to get back to you later on the details of this because my head is currently context switched to some infrastructure and Kubernetes/golang work, so I am having a hard time digesting what you are saying. I am new to the bpf stuff so this is about my own learning as well as a conversation starter. The high level goal here is to have a path for flexibility via a bpf program. Not just GPU or DRM or CU mask, but devices making decisions via an operator-written bpf-prog attached to a cgroup. More inline.
If you have some pointers on this, I'm happy to do some reading and learning too.
On Wed, Feb 3, 2021 at 6:09 AM Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Feb 01, 2021 at 11:51:07AM -0500, Kenny Ho wrote:
On Mon, Feb 1, 2021 at 9:49 AM Daniel Vetter daniel@ffwll.ch wrote:
- there's been a pile of cgroups proposal to manage gpus at the drm subsystem level, some by Kenny, and frankly this at least looks a bit like a quick hack to sidestep the consensus process for that.
No Daniel, this is quick *draft* to get a conversation going. Bpf was actually a path suggested by Tejun back in 2018 so I think you are mischaracterizing this quite a bit.
"2018-11-20 Kenny Ho: To put the questions in more concrete terms, let say a user wants to expose certain part of a gpu to a particular cgroup similar to the way selective cpu cores are exposed to a cgroup via cpuset, how should we go about enabling such functionality?
2018-11-20 Tejun Heo: Do what the intel driver or bpf is doing? It's not difficult to hook into cgroup for identification purposes."
Yeah, but if you go full amd specific for this, you might as well have a specific BPF hook which is called in amdgpu/kfd and returns you the CU mask for a given cgroups (and figures that out however it pleases).
Not a generic framework which lets you build pretty much any possible cgroups controller for anything else using BPF. Trying to filter anything at the generic ioctl just doesn't feel like a great idea that's long term maintainable. E.g. what happens if there's new uapi for command submission/context creation and now your bpf filter isn't catching all access anymore? If it's an explicit hook that explicitly computes the CU mask, then we can add more checks as needed. With ioctl that's impossible.
Plus I'm also not sure whether that's really a good idea still, since if cloud companies have to built their own bespoke container stuff for every gpu vendor, that's quite a bad platform we're building. And "I'd like to make sure my gpu is used fairly among multiple tenents" really isn't a use-case that's specific to amd.
I don't understand what you are saying about containers here since bpf-progs are not the same as container nor are they deployed from inside a container (as far as I know, I am actually not sure how bpf-cgroup works with higher level cloud orchestration since folks like Docker just migrated to cgroup v2 very recently... I don't think you can specify a bpf-prog to load as part of a k8s pod definition.) That said, the bit I understand ("not sure whether that's really a good idea....cloud companies have to built their own bespoke container stuff for every gpu vendor...") is in fact the current status quo. If you look into some of the popular ML/AI-oriented containers/apps, you will likely see things are mostly hardcoded to CUDA. Since I work for AMD, I wouldn't say that's a good thing but this is just the reality. For Kubernetes at least (where my head is currently), the official mechanisms are Device Plugins (I am the author for the one for AMD but there are a few ones from Intel too, you can confirm with your colleagues) and Node Feature/Labels. Kubernetes schedules pod/container launched by users to the node/servers by the affinity of the node resources/labels, and the resources/labels in the pod specification created by the users.
Sure the current gpu compute ecosystem is pretty badly fragmented, forcing higher levels (like containers, but also hpc runtimes, or anything else) to paper over that with more plugins and abstraction layers.
That's not really a good excuse that when we upstream these features, that we should continue with the fragmentation.
If this would be something very hw specific like cache assignment and quality of service stuff or things like that, then vendor specific imo makes sense. But for CU masks essentially we're cutting the compute resources up in some way, and I kinda expect everyone with a gpu who cares about isolating workloads with cgroups wants to do that.
Right, but isolating workloads is quality of service stuff and *how* compute resources are cut up are vendor specific.
Anyway, as I said at the beginning of this reply, this is about flexibility in support of the diversity of devices and architectures. CU mask is simply a concrete example of hw diversity that a bpf-program can encapsulate. I can see this framework (a custom program making decisions in a specific cgroup and device context) use for other things as well. It may even be useful within a vendor to handle the diversity between SKUs.
So I agree that on one side CU mask can be used for low-level quality of service guarantees (like the CLOS cache stuff on intel cpus as an example), and that's going to be rather hw specific no matter what.
But my understanding of AMD's plans here is that CU mask is the only thing you'll have to partition gpu usage in a multi-tenant environment - whether that's cloud or also whether that's containing apps to make sure the compositor can still draw the desktop (except for fullscreen ofc) doesn't really matter I think. And since there's clearly a need for more general (but necessarily less well-defined) gpu usage controlling and accounting I don't think exposing just the CU mask is a good idea. That just perpetuates the current fragmented landscape, and I really don't see why it's not possible to have a generic "I want 50% of my gpu available for these 2 containers each" solution
Of course on top of that having a bfp hook in amd to do the fine grained QOS assignement for e.g. embedded application which are very carefully tuned, should still be possible. But that's on top, not as the exclusive thing available.
Cheers, Daniel
Sorry for the late reply (I have been working on other stuff.)
On Fri, Feb 5, 2021 at 8:49 AM Daniel Vetter daniel@ffwll.ch wrote:
So I agree that on one side CU mask can be used for low-level quality of service guarantees (like the CLOS cache stuff on intel cpus as an example), and that's going to be rather hw specific no matter what.
But my understanding of AMD's plans here is that CU mask is the only thing you'll have to partition gpu usage in a multi-tenant environment
- whether that's cloud or also whether that's containing apps to make
sure the compositor can still draw the desktop (except for fullscreen ofc) doesn't really matter I think.
This is not correct. Even in the original cgroup proposal, it supports both mask and count as a way to define unit(s) of sub-device. For AMD, we already have SRIOV that supports GPU partitioning in a time-sliced-of-a-whole-GPU fashion.
Kenny
On Thu, May 06, 2021 at 10:06:32PM -0400, Kenny Ho wrote:
Sorry for the late reply (I have been working on other stuff.)
On Fri, Feb 5, 2021 at 8:49 AM Daniel Vetter daniel@ffwll.ch wrote:
So I agree that on one side CU mask can be used for low-level quality of service guarantees (like the CLOS cache stuff on intel cpus as an example), and that's going to be rather hw specific no matter what.
But my understanding of AMD's plans here is that CU mask is the only thing you'll have to partition gpu usage in a multi-tenant environment
- whether that's cloud or also whether that's containing apps to make
sure the compositor can still draw the desktop (except for fullscreen ofc) doesn't really matter I think.
This is not correct. Even in the original cgroup proposal, it supports both mask and count as a way to define unit(s) of sub-device. For AMD, we already have SRIOV that supports GPU partitioning in a time-sliced-of-a-whole-GPU fashion.
Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu cgroups controler to get started, since it's much closer to other cgroups that control bandwidth of some kind. Whether it's i/o bandwidth or compute bandwidht is kinda a wash.
CU mask feels a lot more like an isolation/guaranteed forward progress kind of thing, and I suspect that's always going to be a lot more gpu hw specific than anything we can reasonably put into a general cgroups controller.
Also for the time slice cgroups thing, can you pls give me pointers to these old patches that had it, and how it's done? I very obviously missed that part.
Thanks, Daniel
On Fri, May 7, 2021 at 4:59 AM Daniel Vetter daniel@ffwll.ch wrote:
Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu cgroups controler to get started, since it's much closer to other cgroups that control bandwidth of some kind. Whether it's i/o bandwidth or compute bandwidht is kinda a wash.
sriov/time-sliced-of-a-whole gpu does not really need a cgroup interface since each slice appears as a stand alone device. This is already in production (not using cgroup) with users. The cgroup proposal has always been parallel to that in many sense: 1) spatial partitioning as an independent but equally valid use case as time sharing, 2) sub-device resource control as opposed to full device control motivated by the workload characterization paper. It was never about time vs space in terms of use cases but having new API for users to be able to do spatial subdevice partitioning.
CU mask feels a lot more like an isolation/guaranteed forward progress kind of thing, and I suspect that's always going to be a lot more gpu hw specific than anything we can reasonably put into a general cgroups controller.
The first half is correct but I disagree with the conclusion. The analogy I would use is multi-core CPU. The capability of individual CPU cores, core count and core arrangement may be hw specific but there are general interfaces to support selection of these cores. CU mask may be hw specific but spatial partitioning as an idea is not. Most gpu vendors have the concept of sub-device compute units (EU, SE, etc.); OpenCL has the concept of subdevice in the language. I don't see any obstacle for vendors to implement spatial partitioning just like many CPU vendors support the idea of multi-core.
Also for the time slice cgroups thing, can you pls give me pointers to these old patches that had it, and how it's done? I very obviously missed that part.
I think you misunderstood what I wrote earlier. The original proposal was about spatial partitioning of subdevice resources not time sharing using cgroup (since time sharing is already supported elsewhere.)
Kenny
On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
On Fri, May 7, 2021 at 4:59 AM Daniel Vetter daniel@ffwll.ch wrote:
Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu cgroups controler to get started, since it's much closer to other cgroups that control bandwidth of some kind. Whether it's i/o bandwidth or compute bandwidht is kinda a wash.
sriov/time-sliced-of-a-whole gpu does not really need a cgroup interface since each slice appears as a stand alone device. This is already in production (not using cgroup) with users. The cgroup proposal has always been parallel to that in many sense: 1) spatial partitioning as an independent but equally valid use case as time sharing, 2) sub-device resource control as opposed to full device control motivated by the workload characterization paper. It was never about time vs space in terms of use cases but having new API for users to be able to do spatial subdevice partitioning.
CU mask feels a lot more like an isolation/guaranteed forward progress kind of thing, and I suspect that's always going to be a lot more gpu hw specific than anything we can reasonably put into a general cgroups controller.
The first half is correct but I disagree with the conclusion. The analogy I would use is multi-core CPU. The capability of individual CPU cores, core count and core arrangement may be hw specific but there are general interfaces to support selection of these cores. CU mask may be hw specific but spatial partitioning as an idea is not. Most gpu vendors have the concept of sub-device compute units (EU, SE, etc.); OpenCL has the concept of subdevice in the language. I don't see any obstacle for vendors to implement spatial partitioning just like many CPU vendors support the idea of multi-core.
Also for the time slice cgroups thing, can you pls give me pointers to these old patches that had it, and how it's done? I very obviously missed that part.
I think you misunderstood what I wrote earlier. The original proposal was about spatial partitioning of subdevice resources not time sharing using cgroup (since time sharing is already supported elsewhere.)
Well SRIOV time-sharing is for virtualization. cgroups is for containerization, which is just virtualization but with less overhead and more security bugs.
More or less.
So either I get things still wrong, or we'll get time-sharing for virtualization, and partitioning of CU for containerization. That doesn't make that much sense to me.
Since time-sharing is the first thing that's done for virtualization I think it's probably also the most reasonable to start with for containers. -Daniel
On Fri, May 7, 2021 at 12:13 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
On Fri, May 7, 2021 at 4:59 AM Daniel Vetter daniel@ffwll.ch wrote:
Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu cgroups controler to get started, since it's much closer to other cgroups that control bandwidth of some kind. Whether it's i/o bandwidth or compute bandwidht is kinda a wash.
sriov/time-sliced-of-a-whole gpu does not really need a cgroup interface since each slice appears as a stand alone device. This is already in production (not using cgroup) with users. The cgroup proposal has always been parallel to that in many sense: 1) spatial partitioning as an independent but equally valid use case as time sharing, 2) sub-device resource control as opposed to full device control motivated by the workload characterization paper. It was never about time vs space in terms of use cases but having new API for users to be able to do spatial subdevice partitioning.
CU mask feels a lot more like an isolation/guaranteed forward progress kind of thing, and I suspect that's always going to be a lot more gpu hw specific than anything we can reasonably put into a general cgroups controller.
The first half is correct but I disagree with the conclusion. The analogy I would use is multi-core CPU. The capability of individual CPU cores, core count and core arrangement may be hw specific but there are general interfaces to support selection of these cores. CU mask may be hw specific but spatial partitioning as an idea is not. Most gpu vendors have the concept of sub-device compute units (EU, SE, etc.); OpenCL has the concept of subdevice in the language. I don't see any obstacle for vendors to implement spatial partitioning just like many CPU vendors support the idea of multi-core.
Also for the time slice cgroups thing, can you pls give me pointers to these old patches that had it, and how it's done? I very obviously missed that part.
I think you misunderstood what I wrote earlier. The original proposal was about spatial partitioning of subdevice resources not time sharing using cgroup (since time sharing is already supported elsewhere.)
Well SRIOV time-sharing is for virtualization. cgroups is for containerization, which is just virtualization but with less overhead and more security bugs.
More or less.
So either I get things still wrong, or we'll get time-sharing for virtualization, and partitioning of CU for containerization. That doesn't make that much sense to me.
You could still potentially do SR-IOV for containerization. You'd just pass one of the PCI VFs (virtual functions) to the container and you'd automatically get the time slice. I don't see why cgroups would be a factor there.
Alex
Since time-sharing is the first thing that's done for virtualization I think it's probably also the most reasonable to start with for containers.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Fri, May 07, 2021 at 12:19:13PM -0400, Alex Deucher wrote:
On Fri, May 7, 2021 at 12:13 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
On Fri, May 7, 2021 at 4:59 AM Daniel Vetter daniel@ffwll.ch wrote:
Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu cgroups controler to get started, since it's much closer to other cgroups that control bandwidth of some kind. Whether it's i/o bandwidth or compute bandwidht is kinda a wash.
sriov/time-sliced-of-a-whole gpu does not really need a cgroup interface since each slice appears as a stand alone device. This is already in production (not using cgroup) with users. The cgroup proposal has always been parallel to that in many sense: 1) spatial partitioning as an independent but equally valid use case as time sharing, 2) sub-device resource control as opposed to full device control motivated by the workload characterization paper. It was never about time vs space in terms of use cases but having new API for users to be able to do spatial subdevice partitioning.
CU mask feels a lot more like an isolation/guaranteed forward progress kind of thing, and I suspect that's always going to be a lot more gpu hw specific than anything we can reasonably put into a general cgroups controller.
The first half is correct but I disagree with the conclusion. The analogy I would use is multi-core CPU. The capability of individual CPU cores, core count and core arrangement may be hw specific but there are general interfaces to support selection of these cores. CU mask may be hw specific but spatial partitioning as an idea is not. Most gpu vendors have the concept of sub-device compute units (EU, SE, etc.); OpenCL has the concept of subdevice in the language. I don't see any obstacle for vendors to implement spatial partitioning just like many CPU vendors support the idea of multi-core.
Also for the time slice cgroups thing, can you pls give me pointers to these old patches that had it, and how it's done? I very obviously missed that part.
I think you misunderstood what I wrote earlier. The original proposal was about spatial partitioning of subdevice resources not time sharing using cgroup (since time sharing is already supported elsewhere.)
Well SRIOV time-sharing is for virtualization. cgroups is for containerization, which is just virtualization but with less overhead and more security bugs.
More or less.
So either I get things still wrong, or we'll get time-sharing for virtualization, and partitioning of CU for containerization. That doesn't make that much sense to me.
You could still potentially do SR-IOV for containerization. You'd just pass one of the PCI VFs (virtual functions) to the container and you'd automatically get the time slice. I don't see why cgroups would be a factor there.
Standard interface to manage that time-slicing. I guess for SRIOV it's all vendor sauce (intel as guilty as anyone else from what I can see), but for cgroups that feels like it's falling a bit short of what we should aim for.
But dunno, maybe I'm just dreaming too much :-) -Daniel
Alex
Since time-sharing is the first thing that's done for virtualization I think it's probably also the most reasonable to start with for containers.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Fri, May 7, 2021 at 12:26 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 07, 2021 at 12:19:13PM -0400, Alex Deucher wrote:
On Fri, May 7, 2021 at 12:13 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
On Fri, May 7, 2021 at 4:59 AM Daniel Vetter daniel@ffwll.ch wrote:
Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu cgroups controler to get started, since it's much closer to other cgroups that control bandwidth of some kind. Whether it's i/o bandwidth or compute bandwidht is kinda a wash.
sriov/time-sliced-of-a-whole gpu does not really need a cgroup interface since each slice appears as a stand alone device. This is already in production (not using cgroup) with users. The cgroup proposal has always been parallel to that in many sense: 1) spatial partitioning as an independent but equally valid use case as time sharing, 2) sub-device resource control as opposed to full device control motivated by the workload characterization paper. It was never about time vs space in terms of use cases but having new API for users to be able to do spatial subdevice partitioning.
CU mask feels a lot more like an isolation/guaranteed forward progress kind of thing, and I suspect that's always going to be a lot more gpu hw specific than anything we can reasonably put into a general cgroups controller.
The first half is correct but I disagree with the conclusion. The analogy I would use is multi-core CPU. The capability of individual CPU cores, core count and core arrangement may be hw specific but there are general interfaces to support selection of these cores. CU mask may be hw specific but spatial partitioning as an idea is not. Most gpu vendors have the concept of sub-device compute units (EU, SE, etc.); OpenCL has the concept of subdevice in the language. I don't see any obstacle for vendors to implement spatial partitioning just like many CPU vendors support the idea of multi-core.
Also for the time slice cgroups thing, can you pls give me pointers to these old patches that had it, and how it's done? I very obviously missed that part.
I think you misunderstood what I wrote earlier. The original proposal was about spatial partitioning of subdevice resources not time sharing using cgroup (since time sharing is already supported elsewhere.)
Well SRIOV time-sharing is for virtualization. cgroups is for containerization, which is just virtualization but with less overhead and more security bugs.
More or less.
So either I get things still wrong, or we'll get time-sharing for virtualization, and partitioning of CU for containerization. That doesn't make that much sense to me.
You could still potentially do SR-IOV for containerization. You'd just pass one of the PCI VFs (virtual functions) to the container and you'd automatically get the time slice. I don't see why cgroups would be a factor there.
Standard interface to manage that time-slicing. I guess for SRIOV it's all vendor sauce (intel as guilty as anyone else from what I can see), but for cgroups that feels like it's falling a bit short of what we should aim for.
But dunno, maybe I'm just dreaming too much :-)
I don't disagree, I'm just not sure how it would apply to SR-IOV. Once you've created the virtual functions, you've already created the partitioning (regardless of whether it's spatial or temporal) so where would cgroups come into play?
Alex
-Daniel
Alex
Since time-sharing is the first thing that's done for virtualization I think it's probably also the most reasonable to start with for containers.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Fri, May 7, 2021 at 12:31 PM Alex Deucher alexdeucher@gmail.com wrote:
On Fri, May 7, 2021 at 12:26 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 07, 2021 at 12:19:13PM -0400, Alex Deucher wrote:
On Fri, May 7, 2021 at 12:13 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
On Fri, May 7, 2021 at 4:59 AM Daniel Vetter daniel@ffwll.ch wrote:
Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu cgroups controler to get started, since it's much closer to other cgroups that control bandwidth of some kind. Whether it's i/o bandwidth or compute bandwidht is kinda a wash.
sriov/time-sliced-of-a-whole gpu does not really need a cgroup interface since each slice appears as a stand alone device. This is already in production (not using cgroup) with users. The cgroup proposal has always been parallel to that in many sense: 1) spatial partitioning as an independent but equally valid use case as time sharing, 2) sub-device resource control as opposed to full device control motivated by the workload characterization paper. It was never about time vs space in terms of use cases but having new API for users to be able to do spatial subdevice partitioning.
CU mask feels a lot more like an isolation/guaranteed forward progress kind of thing, and I suspect that's always going to be a lot more gpu hw specific than anything we can reasonably put into a general cgroups controller.
The first half is correct but I disagree with the conclusion. The analogy I would use is multi-core CPU. The capability of individual CPU cores, core count and core arrangement may be hw specific but there are general interfaces to support selection of these cores. CU mask may be hw specific but spatial partitioning as an idea is not. Most gpu vendors have the concept of sub-device compute units (EU, SE, etc.); OpenCL has the concept of subdevice in the language. I don't see any obstacle for vendors to implement spatial partitioning just like many CPU vendors support the idea of multi-core.
Also for the time slice cgroups thing, can you pls give me pointers to these old patches that had it, and how it's done? I very obviously missed that part.
I think you misunderstood what I wrote earlier. The original proposal was about spatial partitioning of subdevice resources not time sharing using cgroup (since time sharing is already supported elsewhere.)
Well SRIOV time-sharing is for virtualization. cgroups is for containerization, which is just virtualization but with less overhead and more security bugs.
More or less.
So either I get things still wrong, or we'll get time-sharing for virtualization, and partitioning of CU for containerization. That doesn't make that much sense to me.
You could still potentially do SR-IOV for containerization. You'd just pass one of the PCI VFs (virtual functions) to the container and you'd automatically get the time slice. I don't see why cgroups would be a factor there.
Standard interface to manage that time-slicing. I guess for SRIOV it's all vendor sauce (intel as guilty as anyone else from what I can see), but for cgroups that feels like it's falling a bit short of what we should aim for.
But dunno, maybe I'm just dreaming too much :-)
I don't disagree, I'm just not sure how it would apply to SR-IOV. Once you've created the virtual functions, you've already created the partitioning (regardless of whether it's spatial or temporal) so where would cgroups come into play?
For some background, the SR-IOV virtual functions show up like actual PCI endpoints on the bus, so SR-IOV is sort of like cgroups implemented in hardware. When you enable SR-IOV, the endpoints that are created are the partitions.
Alex
Alex
-Daniel
Alex
Since time-sharing is the first thing that's done for virtualization I think it's probably also the most reasonable to start with for containers.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Fri, May 07, 2021 at 12:50:07PM -0400, Alex Deucher wrote:
On Fri, May 7, 2021 at 12:31 PM Alex Deucher alexdeucher@gmail.com wrote:
On Fri, May 7, 2021 at 12:26 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 07, 2021 at 12:19:13PM -0400, Alex Deucher wrote:
On Fri, May 7, 2021 at 12:13 PM Daniel Vetter daniel@ffwll.ch wrote:
On Fri, May 07, 2021 at 11:33:46AM -0400, Kenny Ho wrote:
On Fri, May 7, 2021 at 4:59 AM Daniel Vetter daniel@ffwll.ch wrote: > > Hm I missed that. I feel like time-sliced-of-a-whole gpu is the easier gpu > cgroups controler to get started, since it's much closer to other cgroups > that control bandwidth of some kind. Whether it's i/o bandwidth or compute > bandwidht is kinda a wash. sriov/time-sliced-of-a-whole gpu does not really need a cgroup interface since each slice appears as a stand alone device. This is already in production (not using cgroup) with users. The cgroup proposal has always been parallel to that in many sense: 1) spatial partitioning as an independent but equally valid use case as time sharing, 2) sub-device resource control as opposed to full device control motivated by the workload characterization paper. It was never about time vs space in terms of use cases but having new API for users to be able to do spatial subdevice partitioning.
> CU mask feels a lot more like an isolation/guaranteed forward progress > kind of thing, and I suspect that's always going to be a lot more gpu hw > specific than anything we can reasonably put into a general cgroups > controller. The first half is correct but I disagree with the conclusion. The analogy I would use is multi-core CPU. The capability of individual CPU cores, core count and core arrangement may be hw specific but there are general interfaces to support selection of these cores. CU mask may be hw specific but spatial partitioning as an idea is not. Most gpu vendors have the concept of sub-device compute units (EU, SE, etc.); OpenCL has the concept of subdevice in the language. I don't see any obstacle for vendors to implement spatial partitioning just like many CPU vendors support the idea of multi-core.
> Also for the time slice cgroups thing, can you pls give me pointers to > these old patches that had it, and how it's done? I very obviously missed > that part. I think you misunderstood what I wrote earlier. The original proposal was about spatial partitioning of subdevice resources not time sharing using cgroup (since time sharing is already supported elsewhere.)
Well SRIOV time-sharing is for virtualization. cgroups is for containerization, which is just virtualization but with less overhead and more security bugs.
More or less.
So either I get things still wrong, or we'll get time-sharing for virtualization, and partitioning of CU for containerization. That doesn't make that much sense to me.
You could still potentially do SR-IOV for containerization. You'd just pass one of the PCI VFs (virtual functions) to the container and you'd automatically get the time slice. I don't see why cgroups would be a factor there.
Standard interface to manage that time-slicing. I guess for SRIOV it's all vendor sauce (intel as guilty as anyone else from what I can see), but for cgroups that feels like it's falling a bit short of what we should aim for.
But dunno, maybe I'm just dreaming too much :-)
I don't disagree, I'm just not sure how it would apply to SR-IOV. Once you've created the virtual functions, you've already created the partitioning (regardless of whether it's spatial or temporal) so where would cgroups come into play?
For some background, the SR-IOV virtual functions show up like actual PCI endpoints on the bus, so SR-IOV is sort of like cgroups implemented in hardware. When you enable SR-IOV, the endpoints that are created are the partitions.
Yeah I think we're massively agreeing right now :-)
SRIOV is kinda by design vendor specific. You set up the VF endpoint, it shows up, it's all hw+fw magic. Nothing for cgroups to manage here at all.
All I meant is that for the container/cgroups world starting out with time-sharing feels like the best fit, least because your SRIOV designers also seem to think that's the best first cut for cloud-y computing. Whether it's virtualized or containerized is a distinction that's getting ever more blurry, with virtualization become a lot more dynamic and container runtimes als possibly using hw virtualization underneath. -Daniel
Alex
Alex
-Daniel
Alex
Since time-sharing is the first thing that's done for virtualization I think it's probably also the most reasonable to start with for containers.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Fri, May 7, 2021 at 12:54 PM Daniel Vetter daniel@ffwll.ch wrote:
SRIOV is kinda by design vendor specific. You set up the VF endpoint, it shows up, it's all hw+fw magic. Nothing for cgroups to manage here at all.
Right, so in theory you just use the device cgroup with the VF endpoints.
All I meant is that for the container/cgroups world starting out with time-sharing feels like the best fit, least because your SRIOV designers also seem to think that's the best first cut for cloud-y computing. Whether it's virtualized or containerized is a distinction that's getting ever more blurry, with virtualization become a lot more dynamic and container runtimes als possibly using hw virtualization underneath.
I disagree. By the same logic, the existence of CU mask would imply it being the preferred way for sub-device control per process.
Kenny
Hello,
On Fri, May 07, 2021 at 06:54:13PM +0200, Daniel Vetter wrote:
All I meant is that for the container/cgroups world starting out with time-sharing feels like the best fit, least because your SRIOV designers also seem to think that's the best first cut for cloud-y computing. Whether it's virtualized or containerized is a distinction that's getting ever more blurry, with virtualization become a lot more dynamic and container runtimes als possibly using hw virtualization underneath.
FWIW, I'm completely on the same boat. There are two fundamental issues with hardware-mask based control - control granularity and work conservation. Combined, they make it a significantly more difficult interface to use which requires hardware-specific tuning rather than simply being able to say "I wanna prioritize this job twice over that one".
My knoweldge of gpus is really limited but my understanding is also that the gpu cores and threads aren't as homogeneous as the CPU counterparts across the vendors, product generations and possibly even within a single chip, which makes the problem even worse.
Given that GPUs are time-shareable to begin with, the most universal solution seems pretty clear.
Thanks.
On Fri, May 7, 2021 at 3:33 PM Tejun Heo tj@kernel.org wrote:
Hello,
On Fri, May 07, 2021 at 06:54:13PM +0200, Daniel Vetter wrote:
All I meant is that for the container/cgroups world starting out with time-sharing feels like the best fit, least because your SRIOV designers also seem to think that's the best first cut for cloud-y computing. Whether it's virtualized or containerized is a distinction that's getting ever more blurry, with virtualization become a lot more dynamic and container runtimes als possibly using hw virtualization underneath.
FWIW, I'm completely on the same boat. There are two fundamental issues with hardware-mask based control - control granularity and work conservation. Combined, they make it a significantly more difficult interface to use which requires hardware-specific tuning rather than simply being able to say "I wanna prioritize this job twice over that one".
My knoweldge of gpus is really limited but my understanding is also that the gpu cores and threads aren't as homogeneous as the CPU counterparts across the vendors, product generations and possibly even within a single chip, which makes the problem even worse.
Given that GPUs are time-shareable to begin with, the most universal solution seems pretty clear.
The problem is temporal partitioning on GPUs is much harder to enforce unless you have a special case like SR-IOV. Spatial partitioning, on AMD GPUs at least, is widely available and easily enforced. What is the point of implementing temporal style cgroups if no one can enforce it effectively?
Alex
Thanks.
-- tejun
Hello,
On Fri, May 07, 2021 at 03:55:39PM -0400, Alex Deucher wrote:
The problem is temporal partitioning on GPUs is much harder to enforce unless you have a special case like SR-IOV. Spatial partitioning, on AMD GPUs at least, is widely available and easily enforced. What is the point of implementing temporal style cgroups if no one can enforce it effectively?
So, if generic fine-grained partitioning can't be implemented, the right thing to do is stopping pushing for full-blown cgroup interface for it. The hardware simply isn't capable of being managed in a way which allows generic fine-grained hierarchical scheduling and there's no point in bloating the interface with half baked hardware dependent features.
This isn't to say that there's no way to support them, but what have been being proposed is way too generic and ambitious in terms of interface while being poorly developed on the internal abstraction and mechanism front. If the hardware can't do generic, either implement the barest minimum interface (e.g. be a part of misc controller) or go driver-specific - the feature is hardware specific anyway. I've repeated this multiple times in these discussions now but it'd be really helpful to try to minimize the interace while concentrating more on internal abstractions and actual control mechanisms.
Thanks.
On Fri, May 7, 2021 at 4:59 PM Tejun Heo tj@kernel.org wrote:
Hello,
On Fri, May 07, 2021 at 03:55:39PM -0400, Alex Deucher wrote:
The problem is temporal partitioning on GPUs is much harder to enforce unless you have a special case like SR-IOV. Spatial partitioning, on AMD GPUs at least, is widely available and easily enforced. What is the point of implementing temporal style cgroups if no one can enforce it effectively?
So, if generic fine-grained partitioning can't be implemented, the right thing to do is stopping pushing for full-blown cgroup interface for it. The hardware simply isn't capable of being managed in a way which allows generic fine-grained hierarchical scheduling and there's no point in bloating the interface with half baked hardware dependent features.
This isn't to say that there's no way to support them, but what have been being proposed is way too generic and ambitious in terms of interface while being poorly developed on the internal abstraction and mechanism front. If the hardware can't do generic, either implement the barest minimum interface (e.g. be a part of misc controller) or go driver-specific - the feature is hardware specific anyway. I've repeated this multiple times in these discussions now but it'd be really helpful to try to minimize the interace while concentrating more on internal abstractions and actual control mechanisms.
Maybe we are speaking past each other. I'm not following. We got here because a device specific cgroup didn't make sense. With my Linux user hat on, that makes sense. I don't want to write code to a bunch of device specific interfaces if I can avoid it. But as for temporal vs spatial partitioning of the GPU, the argument seems to be a sort of hand-wavy one that both spatial and temporal partitioning make sense on CPUs, but only temporal partitioning makes sense on GPUs. I'm trying to understand that assertion. There are some GPUs that can more easily be temporally partitioned and some that can be more easily spatially partitioned. It doesn't seem any different than CPUs.
Alex
Hello,
On Fri, May 07, 2021 at 06:30:56PM -0400, Alex Deucher wrote:
Maybe we are speaking past each other. I'm not following. We got here because a device specific cgroup didn't make sense. With my Linux user hat on, that makes sense. I don't want to write code to a bunch of device specific interfaces if I can avoid it. But as for temporal vs spatial partitioning of the GPU, the argument seems to be a sort of hand-wavy one that both spatial and temporal partitioning make sense on CPUs, but only temporal partitioning makes sense on GPUs. I'm trying to understand that assertion. There are some GPUs
Spatial partitioning as implemented in cpuset isn't a desirable model. It's there partly because it has historically been there. It doesn't really require dynamic hierarchical distribution of anything and is more of a way to batch-update per-task configuration, which is how it's actually implemented. It's broken too in that it interferes with per-task affinity settings. So, not exactly a good example to follow. In addition, this sort of partitioning requires more hardware knowledge and GPUs are worse than CPUs in that hardwares differ more.
Features like this are trivial to implement from userland side by making per-process settings inheritable and restricting who can update the settings.
that can more easily be temporally partitioned and some that can be more easily spatially partitioned. It doesn't seem any different than CPUs.
Right, it doesn't really matter how the resource is distributed. What matters is how granular and generic the distribution can be. If gpus can implement work-conserving proportional distribution, that's something which is widely useful and inherently requires dynamic scheduling from kernel side. If it's about setting per-vendor affinities, this is way too much cgroup interface for a feature which can be easily implemented outside cgroup. Just do per-process (or whatever handles gpus use) and confine their configurations from cgroup side however way.
While the specific theme changes a bit, we're basically having the same discussion with the same conclusion over the past however many months. Hopefully, the point is clear by now.
Thanks.
On Fri, May 7, 2021 at 7:45 PM Tejun Heo tj@kernel.org wrote:
Hello,
On Fri, May 07, 2021 at 06:30:56PM -0400, Alex Deucher wrote:
Maybe we are speaking past each other. I'm not following. We got here because a device specific cgroup didn't make sense. With my Linux user hat on, that makes sense. I don't want to write code to a bunch of device specific interfaces if I can avoid it. But as for temporal vs spatial partitioning of the GPU, the argument seems to be a sort of hand-wavy one that both spatial and temporal partitioning make sense on CPUs, but only temporal partitioning makes sense on GPUs. I'm trying to understand that assertion. There are some GPUs
Spatial partitioning as implemented in cpuset isn't a desirable model. It's there partly because it has historically been there. It doesn't really require dynamic hierarchical distribution of anything and is more of a way to batch-update per-task configuration, which is how it's actually implemented. It's broken too in that it interferes with per-task affinity settings. So, not exactly a good example to follow. In addition, this sort of partitioning requires more hardware knowledge and GPUs are worse than CPUs in that hardwares differ more.
Features like this are trivial to implement from userland side by making per-process settings inheritable and restricting who can update the settings.
that can more easily be temporally partitioned and some that can be more easily spatially partitioned. It doesn't seem any different than CPUs.
Right, it doesn't really matter how the resource is distributed. What matters is how granular and generic the distribution can be. If gpus can implement work-conserving proportional distribution, that's something which is widely useful and inherently requires dynamic scheduling from kernel side. If it's about setting per-vendor affinities, this is way too much cgroup interface for a feature which can be easily implemented outside cgroup. Just do per-process (or whatever handles gpus use) and confine their configurations from cgroup side however way.
While the specific theme changes a bit, we're basically having the same discussion with the same conclusion over the past however many months. Hopefully, the point is clear by now.
Thanks, that helps a lot.
Alex
dri-devel@lists.freedesktop.org