On Wed, Apr 13, 2022 at 06:29:46PM +0200, Christoph Hellwig wrote:
On Wed, Apr 13, 2022 at 01:18:14PM -0300, Jason Gunthorpe wrote:
Yeah, I was thinking about that too, but on the other hand I think it is completely wrong that gvt requires kvm at all. A vfio_device is not supposed to be tightly linked to KVM - the only exception possibly being s390..
So i915/gvt uses it for:
- poking into the KVM GFN translations
- using the KVM page track notifier
No idea how these could be solved in a more generic way.
TBH I'm not sure how any of this works fully correctly..
I see this code getting something it calls a GFN and then passing them to vfio - which makes no sense. Either a value is a GFN - the physical memory address of the VM, or it is an IOVA. VFIO only takes in IOVA and kvm only takes in GFN. So these are probably IOVAs really..
But then, I see this code taking GFNs (which are probably IOVAs?) and passing them to the kvm page track notifier? That can't be right, VFIO needs to translate the IOVA to a GFN, not assume 1:1...
It seems the purpose is to shadow a page table, and it is capturing user space CPU writes to this page table memory I guess?
GFN's seems to come from gen8_gtt_get_pfn which seems to be parsing some guest page table?
Jason
On 4/13/22 5:37 PM, Jason Gunthorpe wrote:
On Wed, Apr 13, 2022 at 06:29:46PM +0200, Christoph Hellwig wrote:
On Wed, Apr 13, 2022 at 01:18:14PM -0300, Jason Gunthorpe wrote:
Yeah, I was thinking about that too, but on the other hand I think it is completely wrong that gvt requires kvm at all. A vfio_device is not supposed to be tightly linked to KVM - the only exception possibly being s390..
So i915/gvt uses it for:
- poking into the KVM GFN translations
- using the KVM page track notifier
No idea how these could be solved in a more generic way.
TBH I'm not sure how any of this works fully correctly..
I see this code getting something it calls a GFN and then passing them to vfio - which makes no sense. Either a value is a GFN - the physical memory address of the VM, or it is an IOVA. VFIO only takes in IOVA and kvm only takes in GFN. So these are probably IOVAs really..
Can you let me know the place? So that I can take a look.
But then, I see this code taking GFNs (which are probably IOVAs?) and passing them to the kvm page track notifier? That can't be right, VFIO needs to translate the IOVA to a GFN, not assume 1:1...
GFNs are from the guest page table. It takes the GFN from a entry belongs to a guest page table and request the kvm_page_track to track it, so that the shadow page table can be updated accordingly.
It seems the purpose is to shadow a page table, and it is capturing user space CPU writes to this page table memory I guess?
Yes.The shadow page will be built according to the guest GPU page table. When a guest workload is executed in the GPU, the root pointer of the shadow page table in the shadow GPU context is used. If the host enables the IOMMU, the pages used by the shadow page table needs to be mapped as IOVA, and the PFNs in the shadow entries are IOVAs.
GFN's seems to come from gen8_gtt_get_pfn which seems to be parsing some guest page table?
Yes. It's to extract the PFNs from a page table entry.
Jason
On Wed, Apr 13, 2022 at 07:17:52PM +0000, Wang, Zhi A wrote:
On 4/13/22 5:37 PM, Jason Gunthorpe wrote:
On Wed, Apr 13, 2022 at 06:29:46PM +0200, Christoph Hellwig wrote:
On Wed, Apr 13, 2022 at 01:18:14PM -0300, Jason Gunthorpe wrote:
Yeah, I was thinking about that too, but on the other hand I think it is completely wrong that gvt requires kvm at all. A vfio_device is not supposed to be tightly linked to KVM - the only exception possibly being s390..
So i915/gvt uses it for:
- poking into the KVM GFN translations
- using the KVM page track notifier
No idea how these could be solved in a more generic way.
TBH I'm not sure how any of this works fully correctly..
I see this code getting something it calls a GFN and then passing them to vfio - which makes no sense. Either a value is a GFN - the physical memory address of the VM, or it is an IOVA. VFIO only takes in IOVA and kvm only takes in GFN. So these are probably IOVAs really..
Can you let me know the place? So that I can take a look.
Well, for instance:
static int gvt_pin_guest_page(struct intel_vgpu *vgpu, unsigned long gfn, unsigned long size, struct page **page)
There is no way that is a GFN, it is an IOVA.
It seems the purpose is to shadow a page table, and it is capturing user space CPU writes to this page table memory I guess?
Yes.The shadow page will be built according to the guest GPU page table. When a guest workload is executed in the GPU, the root pointer of the shadow page table in the shadow GPU context is used. If the host enables the IOMMU, the pages used by the shadow page table needs to be mapped as IOVA, and the PFNs in the shadow entries are IOVAs.
So if the page table in the guest has IOVA addreses then why can you use them as GFNs?
Or is it that only the page table levels themselves are GFNs and the actual DMA's are IOVA? The unclear mixing of GFN as IOVA in the code makes it inscrutible.
Jason
On 4/13/22 8:04 PM, Jason Gunthorpe wrote:
On Wed, Apr 13, 2022 at 07:17:52PM +0000, Wang, Zhi A wrote:
On 4/13/22 5:37 PM, Jason Gunthorpe wrote:
On Wed, Apr 13, 2022 at 06:29:46PM +0200, Christoph Hellwig wrote:
On Wed, Apr 13, 2022 at 01:18:14PM -0300, Jason Gunthorpe wrote:
Yeah, I was thinking about that too, but on the other hand I think it is completely wrong that gvt requires kvm at all. A vfio_device is not supposed to be tightly linked to KVM - the only exception possibly being s390..
So i915/gvt uses it for:
- poking into the KVM GFN translations
- using the KVM page track notifier
No idea how these could be solved in a more generic way.
TBH I'm not sure how any of this works fully correctly..
I see this code getting something it calls a GFN and then passing them to vfio - which makes no sense. Either a value is a GFN - the physical memory address of the VM, or it is an IOVA. VFIO only takes in IOVA and kvm only takes in GFN. So these are probably IOVAs really..
Can you let me know the place? So that I can take a look.
Well, for instance:
static int gvt_pin_guest_page(struct intel_vgpu *vgpu, unsigned long gfn, unsigned long size, struct page **page)
There is no way that is a GFN, it is an IOVA.
I see. The name is vague. There is an promised 1:1 mapping between guest GFN and host IOVA when a PCI device is passed to a VM, I guess mdev is just leveraging it as they are sharing the same code path in QEMU. It's in a function called vfio_listener_region_add() in the source code of QEMU. Are you planning to change the architecture? It would be nice to know your plan.
It seems the purpose is to shadow a page table, and it is capturing user space CPU writes to this page table memory I guess?
Yes.The shadow page will be built according to the guest GPU page table. When a guest workload is executed in the GPU, the root pointer of the shadow page table in the shadow GPU context is used. If the host enables the IOMMU, the pages used by the shadow page table needs to be mapped as IOVA, and the PFNs in the shadow entries are IOVAs.
So if the page table in the guest has IOVA addreses then why can you use them as GFNs?
That's another problem. We don't support a guess enabling the guest IOMMU (aka virtual IOMMU). The guest/virtual IOMMU is implemented in QEMU, so does the translation between guest IOVA and GFN. For a mdev model implemented in the kernel, there isn't any mechanism so far to reach there.
People were discussing it before. But none agreement was achieved. Is it possible to implement it in the kernel? Would like to discuss more about it if there is any good idea.
Or is it that only the page table levels themselves are GFNs and the actual DMA's are IOVA? The unclear mixing of GFN as IOVA in the code makes it inscrutible.
No. Even the HW is capable of controlling the level of translation, but it's not used like this in the existing driver. It's definitely an architecture open.
Jason
On Wed, Apr 13, 2022 at 09:08:40PM +0000, Wang, Zhi A wrote:
On 4/13/22 8:04 PM, Jason Gunthorpe wrote:
On Wed, Apr 13, 2022 at 07:17:52PM +0000, Wang, Zhi A wrote:
On 4/13/22 5:37 PM, Jason Gunthorpe wrote:
On Wed, Apr 13, 2022 at 06:29:46PM +0200, Christoph Hellwig wrote:
On Wed, Apr 13, 2022 at 01:18:14PM -0300, Jason Gunthorpe wrote:
Yeah, I was thinking about that too, but on the other hand I think it is completely wrong that gvt requires kvm at all. A vfio_device is not supposed to be tightly linked to KVM - the only exception possibly being s390..
So i915/gvt uses it for:
- poking into the KVM GFN translations
- using the KVM page track notifier
No idea how these could be solved in a more generic way.
TBH I'm not sure how any of this works fully correctly..
I see this code getting something it calls a GFN and then passing them to vfio - which makes no sense. Either a value is a GFN - the physical memory address of the VM, or it is an IOVA. VFIO only takes in IOVA and kvm only takes in GFN. So these are probably IOVAs really..
Can you let me know the place? So that I can take a look.
Well, for instance:
static int gvt_pin_guest_page(struct intel_vgpu *vgpu, unsigned long gfn, unsigned long size, struct page **page)
There is no way that is a GFN, it is an IOVA.
I see. The name is vague. There is an promised 1:1 mapping between guest GFN and host IOVA when a PCI device is passed to a VM, I guess mdev is just leveraging it as they are sharing the same code path in QEMU.
That has never been true. It happens to be the case in some common scenarios.
So if the page table in the guest has IOVA addreses then why can you use them as GFNs?
That's another problem. We don't support a guess enabling the guest IOMMU (aka virtual IOMMU). The guest/virtual IOMMU is implemented in QEMU, so does the translation between guest IOVA and GFN. For a mdev model implemented in the kernel, there isn't any mechanism so far to reach there.
And this is the uncommon scenario, there is no way for the mdev driver to know if viommu is turned on, and AFAIK, no way to block it from VFIO.
People were discussing it before. But none agreement was achieved. Is it possible to implement it in the kernel? Would like to discuss more about it if there is any good idea.
I don't know of anything, VFIO and kvm are not intended to be tightly linked like this, they don't have the same view of the world.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, April 14, 2022 7:12 AM
On Wed, Apr 13, 2022 at 09:08:40PM +0000, Wang, Zhi A wrote:
On 4/13/22 8:04 PM, Jason Gunthorpe wrote:
On Wed, Apr 13, 2022 at 07:17:52PM +0000, Wang, Zhi A wrote:
On 4/13/22 5:37 PM, Jason Gunthorpe wrote:
On Wed, Apr 13, 2022 at 06:29:46PM +0200, Christoph Hellwig wrote:
On Wed, Apr 13, 2022 at 01:18:14PM -0300, Jason Gunthorpe wrote: > Yeah, I was thinking about that too, but on the other hand I think it > is completely wrong that gvt requires kvm at all. A vfio_device is not > supposed to be tightly linked to KVM - the only exception possibly > being s390..
So i915/gvt uses it for:
- poking into the KVM GFN translations
The only user of this is is_2MB_gtt_possible() which I suppose should go through vfio instead of kvm as it actually means IOVA here.
- using the KVM page track notifier
This is the real reason which causes the mess as write-protecting CPU access to certain guest memory has to go through KVM.
No idea how these could be solved in a more generic way.
TBH I'm not sure how any of this works fully correctly..
I see this code getting something it calls a GFN and then passing them to vfio - which makes no sense. Either a value is a GFN - the physical memory address of the VM, or it is an IOVA. VFIO only takes in IOVA and kvm only takes in GFN. So these are probably IOVAs really..
Can you let me know the place? So that I can take a look.
Well, for instance:
static int gvt_pin_guest_page(struct intel_vgpu *vgpu, unsigned long gfn, unsigned long size, struct page **page)
There is no way that is a GFN, it is an IOVA.
I see. The name is vague. There is an promised 1:1 mapping between guest
GFN
and host IOVA when a PCI device is passed to a VM, I guess mdev is just leveraging it as they are sharing the same code path in QEMU.
That has never been true. It happens to be the case in some common scenarios.
So if the page table in the guest has IOVA addreses then why can you use them as GFNs?
That's another problem. We don't support a guess enabling the guest
IOMMU
(aka virtual IOMMU). The guest/virtual IOMMU is implemented in QEMU,
so
does the translation between guest IOVA and GFN. For a mdev model implemented in the kernel, there isn't any mechanism so far to reach there.
And this is the uncommon scenario, there is no way for the mdev driver to know if viommu is turned on, and AFAIK, no way to block it from VFIO.
People were discussing it before. But none agreement was achieved. Is it possible to implement it in the kernel? Would like to discuss more about it if there is any good idea.
I don't know of anything, VFIO and kvm are not intended to be tightly linked like this, they don't have the same view of the world.
Yes this is the main problem. VFIO only cares about IOVA and KVM only cares about GPA. GVT as a mdev driver should follow VFIO in concept but due to the requirement of gpu page table shadowing it needs call into KVM for write-protecting CPU access to GPA.
What about extending KVM page tracking interface to accept HVA? This is probably the only common denominator between VFIO and KVM to allow dissolve this conceptual disconnection...
Thanks Kevin
From: Wang, Zhi A zhi.a.wang@intel.com Sent: Thursday, April 14, 2022 5:09 AM
Or is it that only the page table levels themselves are GFNs and the actual DMA's are IOVA? The unclear mixing of GFN as IOVA in the code makes it inscrutible.
No. Even the HW is capable of controlling the level of translation, but it's not used like this in the existing driver. It's definitely an architecture open.
There is no open on this. Any guest memory that vGPU accesses must be IOVA including page table levels. There is only one address space per vRID.
dri-devel@lists.freedesktop.org