This patch set implements a Heterogeneous System Architecture (HSA) driver for radeon-family GPUs.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system resources more effectively via HW features including shared pageable memory, userspace-accessible work queues, and platform-level atomics. In addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of GPUs also performs HW-level validation of commands passed in through the queues (aka rings).
The code in this patch set is intended to serve both as a sample driver for other HSA-compatible hardware devices and as a production driver for radeon-family processors. The code is architected to support multiple CPUs each with connected GPUs, although the current implementation focuses on a single Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics driver (kgd).
AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware functionality between HSA compute and regular gfx/compute (memory, interrupts, registers), while other functionality has been added specifically for HSA compute (hw scheduler for virtualized compute rings). All shared hardware is owned by the radeon graphics driver, and an interface between kfd and kgd allows the kfd to make use of those shared resources, while HSA-specific functionality is managed directly by kfd by submitting packets into an HSA-specific command queue (the "HIQ").
During kfd module initialization a char device node (/dev/kfd) is created (surviving until module exit), with ioctls for queue creation & management, and data structures are initialized for managing HSA device topology.
The rest of the initialization is driven by calls from the radeon kgd at the following points :
- radeon_init (kfd_init) - radeon_exit (kfd_fini) - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) - radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established which connect to the associated graphics kernel driver. This information is exposed to userspace via sysfs, along with a version number allowing userspace to determine if a topology change has occurred while it was reading from sysfs.
The interface between kfd and kgd also allows the kfd to request buffer management services from kgd, and allows kgd to route interrupt requests to kfd code since the interrupt block is shared between regular graphics/compute and HSA compute subsystems in the GPU.
The kfd code works with an open source usermode library ("libhsakmt") which is in the final stages of IP review and should be published in a separate repo over the next few days.
The code operates in one of three modes, selectable via the sched_policy module parameter :
- sched_policy=0 uses a hardware scheduler running in the MEC block within CP, and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow oversubscription, so create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver manually assigns queues to HW slots by programming registers
The "no HW scheduling" option is for debug & new hardware bringup only, so has less test coverage than the other options. Default in the current code is "HW scheduling without oversubscription" since that is where we have the most test coverage but we expect to change the default to "HW scheduling with oversubscription" after further testing. This effectively removes the HW limit on the number of work queues available to applications.
Programs running on the GPU are associated with an address space through the VMID field, which is translated to a unique PASID at access time via a set of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) are partitioned (under control of the radeon kgd) between current gfx/compute and HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping registers are updated by the HW scheduler when used, and by driver code if HW scheduling is not being used.
The Sea Islands compute queues use a new "doorbell" mechanism instead of the earlier kernel-managed write pointer registers. Doorbells use a separate BAR dedicated for this purpose, and pages within the doorbell aperture are mapped to userspace (each page mapped to only one user address space). Writes to the doorbell aperture are intercepted by GPU hardware, allowing userspace code to safely manage work queues (rings) without requiring a kernel call for every ring update.
First step for an application process is to open the kfd device. Calls to open create a kfd "process" structure only for the first thread of the process. Subsequent open calls are checked to see if they are from processes using the same mm_struct and, if so, don't do anything. The kfd per-process data lives as long as the mm_struct exists. Each mm_struct is associated with a unique PASID, allowing the IOMMUv2 to make userspace process memory accessible to the GPU.
Next step is for the application to collect topology information via sysfs. This gives userspace enough information to be able to identify specific nodes (processors) in subsequent queue management calls. Application processes can create queues on multiple processors, and processors support queues from multiple processes.
At this point the application can create work queues in userspace memory and pass them through the usermode library to kfd to have them mapped onto HW queue slots so that commands written to the queues can be executed by the GPU. Queue operations specify a processor node, and so the bulk of this code is device-specific.
Written by John Bridgman John.Bridgman@amd.com
Alexey Skidanov (4): hsa/radeon: 32-bit processes support hsa/radeon: NULL pointer dereference bug workaround hsa/radeon: HSA64/HSA32 modes support hsa/radeon: Add local memory to topology
Andrew Lewycky (3): hsa/radeon: Make binding of process to device permanent hsa/radeon: Implement hsaKmtSetMemoryPolicy mm: Change timing of notification to IOMMUs about a page to be invalidated
Ben Goz (20): hsa/radeon: Add queue and hw_pointer_store modules hsa/radeon: Add support allocating kernel doorbells hsa/radeon: Add mqd_manager module hsa/radeon: Add kernel queue support for KFD hsa/radeon: Add module parameter of scheduling policy hsa/radeon: Add packet manager module hsa/radeon: Add process queue manager module hsa/radeon: Add device queue manager module hsa/radeon: Switch to new queue scheduler hsa/radeon: Add IOCTL for update queue hsa/radeon: Queue Management integration with Memory Management hsa/radeon: update queue fault handling hsa/radeon: fixing a bug to support 32b processes hsa/radeon: Fix number of pipes per ME hsa/radeon: Removing hw pointer store module hsa/radeon: Adding some error messages hsa/radeon: Fixing minor issues with kernel queues (DIQ) drm/radeon: Add register access functions to kfd2kgd interface hsa/radeon: Eliminating all direct register accesses drm/radeon: Remove lock functions from kfd2kgd interface
Evgeny Pinchuk (9): hsa/radeon: fix the OEMID assignment in kfd_topology drm/radeon: extending kfd-kgd interface hsa/radeon: implementing IOCTL for clock counters drm/radeon: adding synchronization for GRBM GFX hsa/radeon: fixing clock counters bug drm/radeon: Extending kfd interface hsa/radeon: Adding max clock speeds to topology hsa/radeon: Alternating the source of max clock hsa/radeon: Exclusive access for perf. counters
Michael Varga (1): hsa/radeon: debugging print statements
Oded Gabbay (45): mm: Add kfd_process pointer to mm_struct drm/radeon: reduce number of free VMIDs and pipes in KV drm/radeon: Report doorbell configuration to kfd drm/radeon: Add radeon <--> kfd interface drm/radeon: Add kfd-->kgd interface to get virtual ram size drm/radeon: Add kfd-->kgd interfaces of memory allocation/mapping drm/radeon: Add kfd-->kgd interface of locking srbm_gfx_cntl register drm/radeon: Add calls to initialize and finalize kfd from radeon hsa/radeon: Add code base of hsa driver for AMD's GPUs hsa/radeon: Add initialization and unmapping of doorbell aperture hsa/radeon: Add scheduler code hsa/radeon: Add kfd mmap handler hsa/radeon: Add 2 new IOCTL to kfd, CREATE_QUEUE and DESTROY_QUEUE hsa/radeon: Update MAINTAINERS and CREDITS files hsa/radeon: Add interrupt handling module hsa/radeon: Add the isr function of the KFD scehduler hsa/radeon: Handle deactivation of queues using interrupts hsa/radeon: Enable interrupts in KFD scheduler hsa/radeon: Enable/Disable KFD interrupt module hsa/radeon: Add interrupt callback function to kgd2kfd interface hsa/radeon: Add kgd-->kfd interfaces for suspend and resume drm/radeon: Add calls to suspend and resume of kfd driver drm/radeon/cik: Don't touch int of pipes 1-7 drm/radeon/cik: Call kfd isr function hsa/radeon: Fix memory size allocated for HPD hsa/radeon: Fix list of supported devices hsa/radeon: Fix coding style in cik_int.h hsa/radeon: Print ioctl commnad only in debug mode hsa/radeon: Print ISR info only in debug mode hsa/radeon: Workaround for a bug in amd_iommu hsa/radeon: Eliminate warnings in compilation hsa/radeon: Various kernel styling fixes hsa/radeon: Rearrange structures in kfd_ioctl.h hsa/radeon: change another pr_info to pr_debug hsa/radeon: Fix timeout calculation in sync_with_hw hsa/radeon: Update module information and version hsa/radeon: Update module version to 0.6.0 hsa/radeon: Fix initialization of sh_mem registers hsa/radeon: Fix compilation warnings hsa/radeon: Remove old scheduler code hsa/radeon: Static analysis (smatch) fixes hsa/radeon: Check oversubscription before destroying runlist hsa/radeon: Don't verify cksum when parsing CRAT table hsa/radeon: Update module version to 0.6.1 hsa/radeon: Update module version to 0.6.2
Yair Shachar (1): hsa/radeon: Adding qcm fence return status
CREDITS | 7 + MAINTAINERS | 8 + drivers/Kconfig | 2 + drivers/gpu/Makefile | 1 + drivers/gpu/drm/radeon/Makefile | 1 + drivers/gpu/drm/radeon/cik.c | 156 +-- drivers/gpu/drm/radeon/cikd.h | 51 +- drivers/gpu/drm/radeon/radeon.h | 9 + drivers/gpu/drm/radeon/radeon_device.c | 32 + drivers/gpu/drm/radeon/radeon_drv.c | 6 + drivers/gpu/drm/radeon/radeon_kfd.c | 630 ++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 9 + drivers/gpu/hsa/Kconfig | 20 + drivers/gpu/hsa/Makefile | 1 + drivers/gpu/hsa/radeon/Makefile | 12 + drivers/gpu/hsa/radeon/cik_int.h | 50 + drivers/gpu/hsa/radeon/cik_mqds.h | 250 ++++ drivers/gpu/hsa/radeon/cik_regs.h | 220 ++++ drivers/gpu/hsa/radeon/kfd_aperture.c | 123 ++ drivers/gpu/hsa/radeon/kfd_chardev.c | 530 +++++++++ drivers/gpu/hsa/radeon/kfd_crat.h | 294 +++++ drivers/gpu/hsa/radeon/kfd_device.c | 244 ++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.c | 981 ++++++++++++++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.h | 101 ++ drivers/gpu/hsa/radeon/kfd_doorbell.c | 242 ++++ drivers/gpu/hsa/radeon/kfd_interrupt.c | 177 +++ drivers/gpu/hsa/radeon/kfd_kernel_queue.c | 305 +++++ drivers/gpu/hsa/radeon/kfd_kernel_queue.h | 66 ++ drivers/gpu/hsa/radeon/kfd_module.c | 130 +++ drivers/gpu/hsa/radeon/kfd_mqd_manager.c | 290 +++++ drivers/gpu/hsa/radeon/kfd_mqd_manager.h | 54 + drivers/gpu/hsa/radeon/kfd_packet_manager.c | 488 ++++++++ drivers/gpu/hsa/radeon/kfd_pasid.c | 97 ++ drivers/gpu/hsa/radeon/kfd_pm4_headers.h | 682 +++++++++++ drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h | 107 ++ drivers/gpu/hsa/radeon/kfd_priv.h | 475 ++++++++ drivers/gpu/hsa/radeon/kfd_process.c | 391 +++++++ drivers/gpu/hsa/radeon/kfd_process_queue_manager.c | 343 ++++++ drivers/gpu/hsa/radeon/kfd_queue.c | 109 ++ drivers/gpu/hsa/radeon/kfd_scheduler.h | 72 ++ drivers/gpu/hsa/radeon/kfd_topology.c | 1207 ++++++++++++++++++++ drivers/gpu/hsa/radeon/kfd_topology.h | 168 +++ drivers/gpu/hsa/radeon/kfd_vidmem.c | 97 ++ include/linux/mm_types.h | 14 + include/linux/radeon_kfd.h | 106 ++ include/uapi/linux/kfd_ioctl.h | 133 +++ mm/rmap.c | 8 +- 47 files changed, 9402 insertions(+), 97 deletions(-) create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c create mode 100644 drivers/gpu/hsa/Kconfig create mode 100644 drivers/gpu/hsa/Makefile create mode 100644 drivers/gpu/hsa/radeon/Makefile create mode 100644 drivers/gpu/hsa/radeon/cik_int.h create mode 100644 drivers/gpu/hsa/radeon/cik_mqds.h create mode 100644 drivers/gpu/hsa/radeon/cik_regs.h create mode 100644 drivers/gpu/hsa/radeon/kfd_aperture.c create mode 100644 drivers/gpu/hsa/radeon/kfd_chardev.c create mode 100644 drivers/gpu/hsa/radeon/kfd_crat.h create mode 100644 drivers/gpu/hsa/radeon/kfd_device.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_doorbell.c create mode 100644 drivers/gpu/hsa/radeon/kfd_interrupt.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.h create mode 100644 drivers/gpu/hsa/radeon/kfd_module.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_packet_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pasid.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_headers.h create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h create mode 100644 drivers/gpu/hsa/radeon/kfd_priv.h create mode 100644 drivers/gpu/hsa/radeon/kfd_process.c create mode 100644 drivers/gpu/hsa/radeon/kfd_process_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_scheduler.h create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.c create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.h create mode 100644 drivers/gpu/hsa/radeon/kfd_vidmem.c create mode 100644 include/linux/radeon_kfd.h create mode 100644 include/uapi/linux/kfd_ioctl.h
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
This patch set implements a Heterogeneous System Architecture (HSA) driver for radeon-family GPUs.
This is just quick comments on few things. Given size of this, people will need to have time to review things.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system resources more effectively via HW features including shared pageable memory, userspace-accessible work queues, and platform-level atomics. In addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of GPUs also performs HW-level validation of commands passed in through the queues (aka rings).
The code in this patch set is intended to serve both as a sample driver for other HSA-compatible hardware devices and as a production driver for radeon-family processors. The code is architected to support multiple CPUs each with connected GPUs, although the current implementation focuses on a single Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics driver (kgd).
AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware functionality between HSA compute and regular gfx/compute (memory, interrupts, registers), while other functionality has been added specifically for HSA compute (hw scheduler for virtualized compute rings). All shared hardware is owned by the radeon graphics driver, and an interface between kfd and kgd allows the kfd to make use of those shared resources, while HSA-specific functionality is managed directly by kfd by submitting packets into an HSA-specific command queue (the "HIQ").
During kfd module initialization a char device node (/dev/kfd) is created (surviving until module exit), with ioctls for queue creation & management, and data structures are initialized for managing HSA device topology.
The rest of the initialization is driven by calls from the radeon kgd at the following points :
- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established which connect to the associated graphics kernel driver. This information is exposed to userspace via sysfs, along with a version number allowing userspace to determine if a topology change has occurred while it was reading from sysfs.
The interface between kfd and kgd also allows the kfd to request buffer management services from kgd, and allows kgd to route interrupt requests to kfd code since the interrupt block is shared between regular graphics/compute and HSA compute subsystems in the GPU.
The kfd code works with an open source usermode library ("libhsakmt") which is in the final stages of IP review and should be published in a separate repo over the next few days.
The code operates in one of three modes, selectable via the sched_policy module parameter :
- sched_policy=0 uses a hardware scheduler running in the MEC block within
CP, and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow
oversubscription, so create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver manually assigns
queues to HW slots by programming registers
The "no HW scheduling" option is for debug & new hardware bringup only, so has less test coverage than the other options. Default in the current code is "HW scheduling without oversubscription" since that is where we have the most test coverage but we expect to change the default to "HW scheduling with oversubscription" after further testing. This effectively removes the HW limit on the number of work queues available to applications.
Programs running on the GPU are associated with an address space through the VMID field, which is translated to a unique PASID at access time via a set of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) are partitioned (under control of the radeon kgd) between current gfx/compute and HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping registers are updated by the HW scheduler when used, and by driver code if HW scheduling is not being used.
The Sea Islands compute queues use a new "doorbell" mechanism instead of the earlier kernel-managed write pointer registers. Doorbells use a separate BAR dedicated for this purpose, and pages within the doorbell aperture are mapped to userspace (each page mapped to only one user address space). Writes to the doorbell aperture are intercepted by GPU hardware, allowing userspace code to safely manage work queues (rings) without requiring a kernel call for every ring update.
First step for an application process is to open the kfd device. Calls to open create a kfd "process" structure only for the first thread of the process. Subsequent open calls are checked to see if they are from processes using the same mm_struct and, if so, don't do anything. The kfd per-process data lives as long as the mm_struct exists. Each mm_struct is associated with a unique PASID, allowing the IOMMUv2 to make userspace process memory accessible to the GPU.
Next step is for the application to collect topology information via sysfs. This gives userspace enough information to be able to identify specific nodes (processors) in subsequent queue management calls. Application processes can create queues on multiple processors, and processors support queues from multiple processes.
I am not a fan to use sysfs to discover topoly.
At this point the application can create work queues in userspace memory and pass them through the usermode library to kfd to have them mapped onto HW queue slots so that commands written to the queues can be executed by the GPU. Queue operations specify a processor node, and so the bulk of this code is device-specific.
Written by John Bridgman John.Bridgman@amd.com
So general comment is you need to collapse many patches things like 58 fixing kernel style should be collapsed ie fix all previous patch that have broken style.
Even worse is thing like 71, removing code you just added few patch earlier in the patchset. This means that if i start reviewing following patch order i might spend time on code that is later deleted/modified/fixed ie time i spend understanding and reading some code might be just wasted.
I will come back with individual patch comments and also general remarks.
Cheers, Jérôme
Alexey Skidanov (4): hsa/radeon: 32-bit processes support hsa/radeon: NULL pointer dereference bug workaround hsa/radeon: HSA64/HSA32 modes support hsa/radeon: Add local memory to topology
Andrew Lewycky (3): hsa/radeon: Make binding of process to device permanent hsa/radeon: Implement hsaKmtSetMemoryPolicy mm: Change timing of notification to IOMMUs about a page to be invalidated
Ben Goz (20): hsa/radeon: Add queue and hw_pointer_store modules hsa/radeon: Add support allocating kernel doorbells hsa/radeon: Add mqd_manager module hsa/radeon: Add kernel queue support for KFD hsa/radeon: Add module parameter of scheduling policy hsa/radeon: Add packet manager module hsa/radeon: Add process queue manager module hsa/radeon: Add device queue manager module hsa/radeon: Switch to new queue scheduler hsa/radeon: Add IOCTL for update queue hsa/radeon: Queue Management integration with Memory Management hsa/radeon: update queue fault handling hsa/radeon: fixing a bug to support 32b processes hsa/radeon: Fix number of pipes per ME hsa/radeon: Removing hw pointer store module hsa/radeon: Adding some error messages hsa/radeon: Fixing minor issues with kernel queues (DIQ) drm/radeon: Add register access functions to kfd2kgd interface hsa/radeon: Eliminating all direct register accesses drm/radeon: Remove lock functions from kfd2kgd interface
Evgeny Pinchuk (9): hsa/radeon: fix the OEMID assignment in kfd_topology drm/radeon: extending kfd-kgd interface hsa/radeon: implementing IOCTL for clock counters drm/radeon: adding synchronization for GRBM GFX hsa/radeon: fixing clock counters bug drm/radeon: Extending kfd interface hsa/radeon: Adding max clock speeds to topology hsa/radeon: Alternating the source of max clock hsa/radeon: Exclusive access for perf. counters
Michael Varga (1): hsa/radeon: debugging print statements
Oded Gabbay (45): mm: Add kfd_process pointer to mm_struct drm/radeon: reduce number of free VMIDs and pipes in KV drm/radeon: Report doorbell configuration to kfd drm/radeon: Add radeon <--> kfd interface drm/radeon: Add kfd-->kgd interface to get virtual ram size drm/radeon: Add kfd-->kgd interfaces of memory allocation/mapping drm/radeon: Add kfd-->kgd interface of locking srbm_gfx_cntl register drm/radeon: Add calls to initialize and finalize kfd from radeon hsa/radeon: Add code base of hsa driver for AMD's GPUs hsa/radeon: Add initialization and unmapping of doorbell aperture hsa/radeon: Add scheduler code hsa/radeon: Add kfd mmap handler hsa/radeon: Add 2 new IOCTL to kfd, CREATE_QUEUE and DESTROY_QUEUE hsa/radeon: Update MAINTAINERS and CREDITS files hsa/radeon: Add interrupt handling module hsa/radeon: Add the isr function of the KFD scehduler hsa/radeon: Handle deactivation of queues using interrupts hsa/radeon: Enable interrupts in KFD scheduler hsa/radeon: Enable/Disable KFD interrupt module hsa/radeon: Add interrupt callback function to kgd2kfd interface hsa/radeon: Add kgd-->kfd interfaces for suspend and resume drm/radeon: Add calls to suspend and resume of kfd driver drm/radeon/cik: Don't touch int of pipes 1-7 drm/radeon/cik: Call kfd isr function hsa/radeon: Fix memory size allocated for HPD hsa/radeon: Fix list of supported devices hsa/radeon: Fix coding style in cik_int.h hsa/radeon: Print ioctl commnad only in debug mode hsa/radeon: Print ISR info only in debug mode hsa/radeon: Workaround for a bug in amd_iommu hsa/radeon: Eliminate warnings in compilation hsa/radeon: Various kernel styling fixes hsa/radeon: Rearrange structures in kfd_ioctl.h hsa/radeon: change another pr_info to pr_debug hsa/radeon: Fix timeout calculation in sync_with_hw hsa/radeon: Update module information and version hsa/radeon: Update module version to 0.6.0 hsa/radeon: Fix initialization of sh_mem registers hsa/radeon: Fix compilation warnings hsa/radeon: Remove old scheduler code hsa/radeon: Static analysis (smatch) fixes hsa/radeon: Check oversubscription before destroying runlist hsa/radeon: Don't verify cksum when parsing CRAT table hsa/radeon: Update module version to 0.6.1 hsa/radeon: Update module version to 0.6.2
Yair Shachar (1): hsa/radeon: Adding qcm fence return status
CREDITS | 7 + MAINTAINERS | 8 + drivers/Kconfig | 2 + drivers/gpu/Makefile | 1 + drivers/gpu/drm/radeon/Makefile | 1 + drivers/gpu/drm/radeon/cik.c | 156 +-- drivers/gpu/drm/radeon/cikd.h | 51 +- drivers/gpu/drm/radeon/radeon.h | 9 + drivers/gpu/drm/radeon/radeon_device.c | 32 + drivers/gpu/drm/radeon/radeon_drv.c | 6 + drivers/gpu/drm/radeon/radeon_kfd.c | 630 ++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 9 + drivers/gpu/hsa/Kconfig | 20 + drivers/gpu/hsa/Makefile | 1 + drivers/gpu/hsa/radeon/Makefile | 12 + drivers/gpu/hsa/radeon/cik_int.h | 50 + drivers/gpu/hsa/radeon/cik_mqds.h | 250 ++++ drivers/gpu/hsa/radeon/cik_regs.h | 220 ++++ drivers/gpu/hsa/radeon/kfd_aperture.c | 123 ++ drivers/gpu/hsa/radeon/kfd_chardev.c | 530 +++++++++ drivers/gpu/hsa/radeon/kfd_crat.h | 294 +++++ drivers/gpu/hsa/radeon/kfd_device.c | 244 ++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.c | 981 ++++++++++++++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.h | 101 ++ drivers/gpu/hsa/radeon/kfd_doorbell.c | 242 ++++ drivers/gpu/hsa/radeon/kfd_interrupt.c | 177 +++ drivers/gpu/hsa/radeon/kfd_kernel_queue.c | 305 +++++ drivers/gpu/hsa/radeon/kfd_kernel_queue.h | 66 ++ drivers/gpu/hsa/radeon/kfd_module.c | 130 +++ drivers/gpu/hsa/radeon/kfd_mqd_manager.c | 290 +++++ drivers/gpu/hsa/radeon/kfd_mqd_manager.h | 54 + drivers/gpu/hsa/radeon/kfd_packet_manager.c | 488 ++++++++ drivers/gpu/hsa/radeon/kfd_pasid.c | 97 ++ drivers/gpu/hsa/radeon/kfd_pm4_headers.h | 682 +++++++++++ drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h | 107 ++ drivers/gpu/hsa/radeon/kfd_priv.h | 475 ++++++++ drivers/gpu/hsa/radeon/kfd_process.c | 391 +++++++ drivers/gpu/hsa/radeon/kfd_process_queue_manager.c | 343 ++++++ drivers/gpu/hsa/radeon/kfd_queue.c | 109 ++ drivers/gpu/hsa/radeon/kfd_scheduler.h | 72 ++ drivers/gpu/hsa/radeon/kfd_topology.c | 1207 ++++++++++++++++++++ drivers/gpu/hsa/radeon/kfd_topology.h | 168 +++ drivers/gpu/hsa/radeon/kfd_vidmem.c | 97 ++ include/linux/mm_types.h | 14 + include/linux/radeon_kfd.h | 106 ++ include/uapi/linux/kfd_ioctl.h | 133 +++ mm/rmap.c | 8 +- 47 files changed, 9402 insertions(+), 97 deletions(-) create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c create mode 100644 drivers/gpu/hsa/Kconfig create mode 100644 drivers/gpu/hsa/Makefile create mode 100644 drivers/gpu/hsa/radeon/Makefile create mode 100644 drivers/gpu/hsa/radeon/cik_int.h create mode 100644 drivers/gpu/hsa/radeon/cik_mqds.h create mode 100644 drivers/gpu/hsa/radeon/cik_regs.h create mode 100644 drivers/gpu/hsa/radeon/kfd_aperture.c create mode 100644 drivers/gpu/hsa/radeon/kfd_chardev.c create mode 100644 drivers/gpu/hsa/radeon/kfd_crat.h create mode 100644 drivers/gpu/hsa/radeon/kfd_device.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_doorbell.c create mode 100644 drivers/gpu/hsa/radeon/kfd_interrupt.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.h create mode 100644 drivers/gpu/hsa/radeon/kfd_module.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_packet_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pasid.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_headers.h create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h create mode 100644 drivers/gpu/hsa/radeon/kfd_priv.h create mode 100644 drivers/gpu/hsa/radeon/kfd_process.c create mode 100644 drivers/gpu/hsa/radeon/kfd_process_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_scheduler.h create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.c create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.h create mode 100644 drivers/gpu/hsa/radeon/kfd_vidmem.c create mode 100644 include/linux/radeon_kfd.h create mode 100644 include/uapi/linux/kfd_ioctl.h
-- 1.9.1
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
This patch set implements a Heterogeneous System Architecture (HSA) driver for radeon-family GPUs.
This is just quick comments on few things. Given size of this, people will need to have time to review things.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system resources more effectively via HW features including shared pageable memory, userspace-accessible work queues, and platform-level atomics. In addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of GPUs also performs HW-level validation of commands passed in through the queues (aka rings). The code in this patch set is intended to serve both as a sample driver for other HSA-compatible hardware devices and as a production driver for radeon-family processors. The code is architected to support multiple CPUs each with connected GPUs, although the current implementation focuses on a single Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics driver (kgd). AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware functionality between HSA compute and regular gfx/compute (memory, interrupts, registers), while other functionality has been added specifically for HSA compute (hw scheduler for virtualized compute rings). All shared hardware is owned by the radeon graphics driver, and an interface between kfd and kgd allows the kfd to make use of those shared resources, while HSA-specific functionality is managed directly by kfd by submitting packets into an HSA-specific command queue (the "HIQ"). During kfd module initialization a char device node (/dev/kfd) is created (surviving until module exit), with ioctls for queue creation & management, and data structures are initialized for managing HSA device topology. The rest of the initialization is driven by calls from the radeon kgd at the following points :
- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established which connect to the associated graphics kernel driver. This information is exposed to userspace via sysfs, along with a version number allowing userspace to determine if a topology change has occurred while it was reading from sysfs. The interface between kfd and kgd also allows the kfd to request buffer management services from kgd, and allows kgd to route interrupt requests to kfd code since the interrupt block is shared between regular graphics/compute and HSA compute subsystems in the GPU. The kfd code works with an open source usermode library ("libhsakmt") which is in the final stages of IP review and should be published in a separate repo over the next few days. The code operates in one of three modes, selectable via the sched_policy module parameter :
- sched_policy=0 uses a hardware scheduler running in the MEC
block within CP, and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow
oversubscription, so create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver
manually assigns queues to HW slots by programming registers The "no HW scheduling" option is for debug & new hardware bringup only, so has less test coverage than the other options. Default in the current code is "HW scheduling without oversubscription" since that is where we have the most test coverage but we expect to change the default to "HW scheduling with oversubscription" after further testing. This effectively removes the HW limit on the number of work queues available to applications. Programs running on the GPU are associated with an address space through the VMID field, which is translated to a unique PASID at access time via a set of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) are partitioned (under control of the radeon kgd) between current gfx/compute and HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping registers are updated by the HW scheduler when used, and by driver code if HW scheduling is not being used. The Sea Islands compute queues use a new "doorbell" mechanism instead of the earlier kernel-managed write pointer registers. Doorbells use a separate BAR dedicated for this purpose, and pages within the doorbell aperture are mapped to userspace (each page mapped to only one user address space). Writes to the doorbell aperture are intercepted by GPU hardware, allowing userspace code to safely manage work queues (rings) without requiring a kernel call for every ring update. First step for an application process is to open the kfd device. Calls to open create a kfd "process" structure only for the first thread of the process. Subsequent open calls are checked to see if they are from processes using the same mm_struct and, if so, don't do anything. The kfd per-process data lives as long as the mm_struct exists. Each mm_struct is associated with a unique PASID, allowing the IOMMUv2 to make userspace process memory accessible to the GPU. Next step is for the application to collect topology information via sysfs. This gives userspace enough information to be able to identify specific nodes (processors) in subsequent queue management calls. Application processes can create queues on multiple processors, and processors support queues from multiple processes.
I am not a fan to use sysfs to discover topoly.
At this point the application can create work queues in userspace memory and pass them through the usermode library to kfd to have them mapped onto HW queue slots so that commands written to the queues can be executed by the GPU. Queue operations specify a processor node, and so the bulk of this code is device-specific. Written by John Bridgman John.Bridgman@amd.com
So general comment is you need to collapse many patches things like 58 fixing kernel style should be collapsed ie fix all previous patch that have broken style.
Even worse is thing like 71, removing code you just added few patch earlier in the patchset.
Not quite, the code was added on patch 11.
This means that if i start reviewing following patch order i might spend time on code that is later deleted/modified/fixed ie time i spend understanding and reading some code might be just wasted.
Quick answer is that you are of course right, but having said that, I think it would be not simple at all to do that squashing. I squashed what I could, and probably I can do a little more (like 58), but the major problem is that one of the main modules of the driver - the scheduler (QCM) - was completely re-written (patches 46-53). Therefore, from patch 1 to 53, we use the old scheduler code and from 54 we use the new QCM (and the old scheduler code was completely remove at 71). So I could maybe squash 71 into 54, but that won't help you much, I think.
So, the current advice I can give is to completely ignore the following files because they do not exist in the final commit: - kfd_sched_cik_static.c (stopped using in 54) - kfd_registers.c (stopped using in 81 because we moved all register writes to radeon) - kfd_hw_pointer_store.c (alive from 46 to 67) - kfd_hw_pointer_store.h (alive from 46 to 67)
Oded
I will come back with individual patch comments and also general remarks.
Cheers, Jérôme
Alexey Skidanov (4): hsa/radeon: 32-bit processes support hsa/radeon: NULL pointer dereference bug workaround hsa/radeon: HSA64/HSA32 modes support hsa/radeon: Add local memory to topology Andrew Lewycky (3): hsa/radeon: Make binding of process to device permanent hsa/radeon: Implement hsaKmtSetMemoryPolicy mm: Change timing of notification to IOMMUs about a page to be invalidated Ben Goz (20): hsa/radeon: Add queue and hw_pointer_store modules hsa/radeon: Add support allocating kernel doorbells hsa/radeon: Add mqd_manager module hsa/radeon: Add kernel queue support for KFD hsa/radeon: Add module parameter of scheduling policy hsa/radeon: Add packet manager module hsa/radeon: Add process queue manager module hsa/radeon: Add device queue manager module hsa/radeon: Switch to new queue scheduler hsa/radeon: Add IOCTL for update queue hsa/radeon: Queue Management integration with Memory Management hsa/radeon: update queue fault handling hsa/radeon: fixing a bug to support 32b processes hsa/radeon: Fix number of pipes per ME hsa/radeon: Removing hw pointer store module hsa/radeon: Adding some error messages hsa/radeon: Fixing minor issues with kernel queues (DIQ) drm/radeon: Add register access functions to kfd2kgd interface hsa/radeon: Eliminating all direct register accesses drm/radeon: Remove lock functions from kfd2kgd interface Evgeny Pinchuk (9): hsa/radeon: fix the OEMID assignment in kfd_topology drm/radeon: extending kfd-kgd interface hsa/radeon: implementing IOCTL for clock counters drm/radeon: adding synchronization for GRBM GFX hsa/radeon: fixing clock counters bug drm/radeon: Extending kfd interface hsa/radeon: Adding max clock speeds to topology hsa/radeon: Alternating the source of max clock hsa/radeon: Exclusive access for perf. counters Michael Varga (1): hsa/radeon: debugging print statements Oded Gabbay (45): mm: Add kfd_process pointer to mm_struct drm/radeon: reduce number of free VMIDs and pipes in KV drm/radeon: Report doorbell configuration to kfd drm/radeon: Add radeon <--> kfd interface drm/radeon: Add kfd-->kgd interface to get virtual ram size drm/radeon: Add kfd-->kgd interfaces of memory allocation/mapping drm/radeon: Add kfd-->kgd interface of locking srbm_gfx_cntl register drm/radeon: Add calls to initialize and finalize kfd from radeon hsa/radeon: Add code base of hsa driver for AMD's GPUs hsa/radeon: Add initialization and unmapping of doorbell aperture hsa/radeon: Add scheduler code hsa/radeon: Add kfd mmap handler hsa/radeon: Add 2 new IOCTL to kfd, CREATE_QUEUE and DESTROY_QUEUE hsa/radeon: Update MAINTAINERS and CREDITS files hsa/radeon: Add interrupt handling module hsa/radeon: Add the isr function of the KFD scehduler hsa/radeon: Handle deactivation of queues using interrupts hsa/radeon: Enable interrupts in KFD scheduler hsa/radeon: Enable/Disable KFD interrupt module hsa/radeon: Add interrupt callback function to kgd2kfd interface hsa/radeon: Add kgd-->kfd interfaces for suspend and resume drm/radeon: Add calls to suspend and resume of kfd driver drm/radeon/cik: Don't touch int of pipes 1-7 drm/radeon/cik: Call kfd isr function hsa/radeon: Fix memory size allocated for HPD hsa/radeon: Fix list of supported devices hsa/radeon: Fix coding style in cik_int.h hsa/radeon: Print ioctl commnad only in debug mode hsa/radeon: Print ISR info only in debug mode hsa/radeon: Workaround for a bug in amd_iommu hsa/radeon: Eliminate warnings in compilation hsa/radeon: Various kernel styling fixes hsa/radeon: Rearrange structures in kfd_ioctl.h hsa/radeon: change another pr_info to pr_debug hsa/radeon: Fix timeout calculation in sync_with_hw hsa/radeon: Update module information and version hsa/radeon: Update module version to 0.6.0 hsa/radeon: Fix initialization of sh_mem registers hsa/radeon: Fix compilation warnings hsa/radeon: Remove old scheduler code hsa/radeon: Static analysis (smatch) fixes hsa/radeon: Check oversubscription before destroying runlist hsa/radeon: Don't verify cksum when parsing CRAT table hsa/radeon: Update module version to 0.6.1 hsa/radeon: Update module version to 0.6.2 Yair Shachar (1): hsa/radeon: Adding qcm fence return status CREDITS | 7 + MAINTAINERS | 8 + drivers/Kconfig | 2 + drivers/gpu/Makefile | 1 + drivers/gpu/drm/radeon/Makefile | 1 + drivers/gpu/drm/radeon/cik.c | 156 +-- drivers/gpu/drm/radeon/cikd.h | 51 +- drivers/gpu/drm/radeon/radeon.h | 9 + drivers/gpu/drm/radeon/radeon_device.c | 32 + drivers/gpu/drm/radeon/radeon_drv.c | 6 + drivers/gpu/drm/radeon/radeon_kfd.c | 630 ++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 9 + drivers/gpu/hsa/Kconfig | 20 + drivers/gpu/hsa/Makefile | 1 + drivers/gpu/hsa/radeon/Makefile | 12 + drivers/gpu/hsa/radeon/cik_int.h | 50 + drivers/gpu/hsa/radeon/cik_mqds.h | 250 ++++ drivers/gpu/hsa/radeon/cik_regs.h | 220 ++++ drivers/gpu/hsa/radeon/kfd_aperture.c | 123 ++ drivers/gpu/hsa/radeon/kfd_chardev.c | 530 +++++++++ drivers/gpu/hsa/radeon/kfd_crat.h | 294 +++++ drivers/gpu/hsa/radeon/kfd_device.c | 244 ++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.c | 981 ++++++++++++++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.h | 101 ++ drivers/gpu/hsa/radeon/kfd_doorbell.c | 242 ++++ drivers/gpu/hsa/radeon/kfd_interrupt.c | 177 +++ drivers/gpu/hsa/radeon/kfd_kernel_queue.c | 305 +++++ drivers/gpu/hsa/radeon/kfd_kernel_queue.h | 66 ++ drivers/gpu/hsa/radeon/kfd_module.c | 130 +++ drivers/gpu/hsa/radeon/kfd_mqd_manager.c | 290 +++++ drivers/gpu/hsa/radeon/kfd_mqd_manager.h | 54 + drivers/gpu/hsa/radeon/kfd_packet_manager.c | 488 ++++++++ drivers/gpu/hsa/radeon/kfd_pasid.c | 97 ++ drivers/gpu/hsa/radeon/kfd_pm4_headers.h | 682 +++++++++++ drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h | 107 ++ drivers/gpu/hsa/radeon/kfd_priv.h | 475 ++++++++ drivers/gpu/hsa/radeon/kfd_process.c | 391 +++++++ drivers/gpu/hsa/radeon/kfd_process_queue_manager.c | 343 ++++++ drivers/gpu/hsa/radeon/kfd_queue.c | 109 ++ drivers/gpu/hsa/radeon/kfd_scheduler.h | 72 ++ drivers/gpu/hsa/radeon/kfd_topology.c | 1207 ++++++++++++++++++++ drivers/gpu/hsa/radeon/kfd_topology.h | 168 +++ drivers/gpu/hsa/radeon/kfd_vidmem.c | 97 ++ include/linux/mm_types.h | 14 + include/linux/radeon_kfd.h | 106 ++ include/uapi/linux/kfd_ioctl.h | 133 +++ mm/rmap.c | 8 +- 47 files changed, 9402 insertions(+), 97 deletions(-) create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c create mode 100644 drivers/gpu/hsa/Kconfig create mode 100644 drivers/gpu/hsa/Makefile create mode 100644 drivers/gpu/hsa/radeon/Makefile create mode 100644 drivers/gpu/hsa/radeon/cik_int.h create mode 100644 drivers/gpu/hsa/radeon/cik_mqds.h create mode 100644 drivers/gpu/hsa/radeon/cik_regs.h create mode 100644 drivers/gpu/hsa/radeon/kfd_aperture.c create mode 100644 drivers/gpu/hsa/radeon/kfd_chardev.c create mode 100644 drivers/gpu/hsa/radeon/kfd_crat.h create mode 100644 drivers/gpu/hsa/radeon/kfd_device.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_doorbell.c create mode 100644 drivers/gpu/hsa/radeon/kfd_interrupt.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.h create mode 100644 drivers/gpu/hsa/radeon/kfd_module.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_packet_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pasid.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_headers.h create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h create mode 100644 drivers/gpu/hsa/radeon/kfd_priv.h create mode 100644 drivers/gpu/hsa/radeon/kfd_process.c create mode 100644 drivers/gpu/hsa/radeon/kfd_process_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_scheduler.h create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.c create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.h create mode 100644 drivers/gpu/hsa/radeon/kfd_vidmem.c create mode 100644 include/linux/radeon_kfd.h create mode 100644 include/uapi/linux/kfd_ioctl.h -- 1.9.1
On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
This patch set implements a Heterogeneous System Architecture (HSA) driver for radeon-family GPUs.
This is just quick comments on few things. Given size of this, people will need to have time to review things.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system resources more effectively via HW features including shared pageable memory, userspace-accessible work queues, and platform-level atomics. In addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of GPUs also performs HW-level validation of commands passed in through the queues (aka rings). The code in this patch set is intended to serve both as a sample driver for other HSA-compatible hardware devices and as a production driver for radeon-family processors. The code is architected to support multiple CPUs each with connected GPUs, although the current implementation focuses on a single Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics driver (kgd). AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware functionality between HSA compute and regular gfx/compute (memory, interrupts, registers), while other functionality has been added specifically for HSA compute (hw scheduler for virtualized compute rings). All shared hardware is owned by the radeon graphics driver, and an interface between kfd and kgd allows the kfd to make use of those shared resources, while HSA-specific functionality is managed directly by kfd by submitting packets into an HSA-specific command queue (the "HIQ"). During kfd module initialization a char device node (/dev/kfd) is created (surviving until module exit), with ioctls for queue creation & management, and data structures are initialized for managing HSA device topology. The rest of the initialization is driven by calls from the radeon kgd at the following points :
- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established which connect to the associated graphics kernel driver. This information is exposed to userspace via sysfs, along with a version number allowing userspace to determine if a topology change has occurred while it was reading from sysfs. The interface between kfd and kgd also allows the kfd to request buffer management services from kgd, and allows kgd to route interrupt requests to kfd code since the interrupt block is shared between regular graphics/compute and HSA compute subsystems in the GPU. The kfd code works with an open source usermode library ("libhsakmt") which is in the final stages of IP review and should be published in a separate repo over the next few days. The code operates in one of three modes, selectable via the sched_policy module parameter :
- sched_policy=0 uses a hardware scheduler running in the MEC
block within CP, and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow
oversubscription, so create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver
manually assigns queues to HW slots by programming registers The "no HW scheduling" option is for debug & new hardware bringup only, so has less test coverage than the other options. Default in the current code is "HW scheduling without oversubscription" since that is where we have the most test coverage but we expect to change the default to "HW scheduling with oversubscription" after further testing. This effectively removes the HW limit on the number of work queues available to applications. Programs running on the GPU are associated with an address space through the VMID field, which is translated to a unique PASID at access time via a set of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) are partitioned (under control of the radeon kgd) between current gfx/compute and HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping registers are updated by the HW scheduler when used, and by driver code if HW scheduling is not being used. The Sea Islands compute queues use a new "doorbell" mechanism instead of the earlier kernel-managed write pointer registers. Doorbells use a separate BAR dedicated for this purpose, and pages within the doorbell aperture are mapped to userspace (each page mapped to only one user address space). Writes to the doorbell aperture are intercepted by GPU hardware, allowing userspace code to safely manage work queues (rings) without requiring a kernel call for every ring update. First step for an application process is to open the kfd device. Calls to open create a kfd "process" structure only for the first thread of the process. Subsequent open calls are checked to see if they are from processes using the same mm_struct and, if so, don't do anything. The kfd per-process data lives as long as the mm_struct exists. Each mm_struct is associated with a unique PASID, allowing the IOMMUv2 to make userspace process memory accessible to the GPU. Next step is for the application to collect topology information via sysfs. This gives userspace enough information to be able to identify specific nodes (processors) in subsequent queue management calls. Application processes can create queues on multiple processors, and processors support queues from multiple processes.
I am not a fan to use sysfs to discover topoly.
At this point the application can create work queues in userspace memory and pass them through the usermode library to kfd to have them mapped onto HW queue slots so that commands written to the queues can be executed by the GPU. Queue operations specify a processor node, and so the bulk of this code is device-specific. Written by John Bridgman John.Bridgman@amd.com
So general comment is you need to collapse many patches things like 58 fixing kernel style should be collapsed ie fix all previous patch that have broken style.
Even worse is thing like 71, removing code you just added few patch earlier in the patchset.
Not quite, the code was added on patch 11.
This means that if i start reviewing following patch order i might spend time on code that is later deleted/modified/fixed ie time i spend understanding and reading some code might be just wasted.
Quick answer is that you are of course right, but having said that, I think it would be not simple at all to do that squashing. I squashed what I could, and probably I can do a little more (like 58), but the major problem is that one of the main modules of the driver - the scheduler (QCM) - was completely re-written (patches 46-53). Therefore, from patch 1 to 53, we use the old scheduler code and from 54 we use the new QCM (and the old scheduler code was completely remove at 71). So I could maybe squash 71 into 54, but that won't help you much, I think.
So, the current advice I can give is to completely ignore the following files because they do not exist in the final commit:
- kfd_sched_cik_static.c (stopped using in 54)
- kfd_registers.c (stopped using in 81 because we moved all register
writes to radeon)
kfd_hw_pointer_store.c (alive from 46 to 67)
kfd_hw_pointer_store.h (alive from 46 to 67)
Oded
Ok i try but this is driving me crazy, i am jungling btw full applied patchset and individual patch going back and forth trying to find which patch is the one i want to comment on. Not even mentioning that after that people would need to ack bad patch just because they know a latter good patch fix the badness.
This patchset can be reduced to less then 10 big patches. I would first do core patches that touch core things like iommu and are preparatory to the patchset. Then kfd patches that do not touch anything in radeon but implement the skeleton and core infrastructure. Then radeon specific patches.
This lead me to the design, all the core kfd helper should be in hsa directory, and should be helper, while all specific bits to radeon should most likely be part of the radeon gfx kernel module. I am not against a radeon_kfd but realy i do not see the point. There should be a hsa.ko like there is a drm.ko.
Also i think it would be benefical to rename kfd to hsa because it is the most common name and the name that will bring valid hit with web search.
Each of the patch should have a proper commit message that is not too cryptic without being too chatty either.
Cheers, Jérôme
I will come back with individual patch comments and also general remarks.
Cheers, Jérôme
Alexey Skidanov (4): hsa/radeon: 32-bit processes support hsa/radeon: NULL pointer dereference bug workaround hsa/radeon: HSA64/HSA32 modes support hsa/radeon: Add local memory to topology Andrew Lewycky (3): hsa/radeon: Make binding of process to device permanent hsa/radeon: Implement hsaKmtSetMemoryPolicy mm: Change timing of notification to IOMMUs about a page to be invalidated Ben Goz (20): hsa/radeon: Add queue and hw_pointer_store modules hsa/radeon: Add support allocating kernel doorbells hsa/radeon: Add mqd_manager module hsa/radeon: Add kernel queue support for KFD hsa/radeon: Add module parameter of scheduling policy hsa/radeon: Add packet manager module hsa/radeon: Add process queue manager module hsa/radeon: Add device queue manager module hsa/radeon: Switch to new queue scheduler hsa/radeon: Add IOCTL for update queue hsa/radeon: Queue Management integration with Memory Management hsa/radeon: update queue fault handling hsa/radeon: fixing a bug to support 32b processes hsa/radeon: Fix number of pipes per ME hsa/radeon: Removing hw pointer store module hsa/radeon: Adding some error messages hsa/radeon: Fixing minor issues with kernel queues (DIQ) drm/radeon: Add register access functions to kfd2kgd interface hsa/radeon: Eliminating all direct register accesses drm/radeon: Remove lock functions from kfd2kgd interface Evgeny Pinchuk (9): hsa/radeon: fix the OEMID assignment in kfd_topology drm/radeon: extending kfd-kgd interface hsa/radeon: implementing IOCTL for clock counters drm/radeon: adding synchronization for GRBM GFX hsa/radeon: fixing clock counters bug drm/radeon: Extending kfd interface hsa/radeon: Adding max clock speeds to topology hsa/radeon: Alternating the source of max clock hsa/radeon: Exclusive access for perf. counters Michael Varga (1): hsa/radeon: debugging print statements Oded Gabbay (45): mm: Add kfd_process pointer to mm_struct drm/radeon: reduce number of free VMIDs and pipes in KV drm/radeon: Report doorbell configuration to kfd drm/radeon: Add radeon <--> kfd interface drm/radeon: Add kfd-->kgd interface to get virtual ram size drm/radeon: Add kfd-->kgd interfaces of memory allocation/mapping drm/radeon: Add kfd-->kgd interface of locking srbm_gfx_cntl register drm/radeon: Add calls to initialize and finalize kfd from radeon hsa/radeon: Add code base of hsa driver for AMD's GPUs hsa/radeon: Add initialization and unmapping of doorbell aperture hsa/radeon: Add scheduler code hsa/radeon: Add kfd mmap handler hsa/radeon: Add 2 new IOCTL to kfd, CREATE_QUEUE and DESTROY_QUEUE hsa/radeon: Update MAINTAINERS and CREDITS files hsa/radeon: Add interrupt handling module hsa/radeon: Add the isr function of the KFD scehduler hsa/radeon: Handle deactivation of queues using interrupts hsa/radeon: Enable interrupts in KFD scheduler hsa/radeon: Enable/Disable KFD interrupt module hsa/radeon: Add interrupt callback function to kgd2kfd interface hsa/radeon: Add kgd-->kfd interfaces for suspend and resume drm/radeon: Add calls to suspend and resume of kfd driver drm/radeon/cik: Don't touch int of pipes 1-7 drm/radeon/cik: Call kfd isr function hsa/radeon: Fix memory size allocated for HPD hsa/radeon: Fix list of supported devices hsa/radeon: Fix coding style in cik_int.h hsa/radeon: Print ioctl commnad only in debug mode hsa/radeon: Print ISR info only in debug mode hsa/radeon: Workaround for a bug in amd_iommu hsa/radeon: Eliminate warnings in compilation hsa/radeon: Various kernel styling fixes hsa/radeon: Rearrange structures in kfd_ioctl.h hsa/radeon: change another pr_info to pr_debug hsa/radeon: Fix timeout calculation in sync_with_hw hsa/radeon: Update module information and version hsa/radeon: Update module version to 0.6.0 hsa/radeon: Fix initialization of sh_mem registers hsa/radeon: Fix compilation warnings hsa/radeon: Remove old scheduler code hsa/radeon: Static analysis (smatch) fixes hsa/radeon: Check oversubscription before destroying runlist hsa/radeon: Don't verify cksum when parsing CRAT table hsa/radeon: Update module version to 0.6.1 hsa/radeon: Update module version to 0.6.2 Yair Shachar (1): hsa/radeon: Adding qcm fence return status CREDITS | 7 + MAINTAINERS | 8 + drivers/Kconfig | 2 + drivers/gpu/Makefile | 1 + drivers/gpu/drm/radeon/Makefile | 1 + drivers/gpu/drm/radeon/cik.c | 156 +-- drivers/gpu/drm/radeon/cikd.h | 51 +- drivers/gpu/drm/radeon/radeon.h | 9 + drivers/gpu/drm/radeon/radeon_device.c | 32 + drivers/gpu/drm/radeon/radeon_drv.c | 6 + drivers/gpu/drm/radeon/radeon_kfd.c | 630 ++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 9 + drivers/gpu/hsa/Kconfig | 20 + drivers/gpu/hsa/Makefile | 1 + drivers/gpu/hsa/radeon/Makefile | 12 + drivers/gpu/hsa/radeon/cik_int.h | 50 + drivers/gpu/hsa/radeon/cik_mqds.h | 250 ++++ drivers/gpu/hsa/radeon/cik_regs.h | 220 ++++ drivers/gpu/hsa/radeon/kfd_aperture.c | 123 ++ drivers/gpu/hsa/radeon/kfd_chardev.c | 530 +++++++++ drivers/gpu/hsa/radeon/kfd_crat.h | 294 +++++ drivers/gpu/hsa/radeon/kfd_device.c | 244 ++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.c | 981 ++++++++++++++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.h | 101 ++ drivers/gpu/hsa/radeon/kfd_doorbell.c | 242 ++++ drivers/gpu/hsa/radeon/kfd_interrupt.c | 177 +++ drivers/gpu/hsa/radeon/kfd_kernel_queue.c | 305 +++++ drivers/gpu/hsa/radeon/kfd_kernel_queue.h | 66 ++ drivers/gpu/hsa/radeon/kfd_module.c | 130 +++ drivers/gpu/hsa/radeon/kfd_mqd_manager.c | 290 +++++ drivers/gpu/hsa/radeon/kfd_mqd_manager.h | 54 + drivers/gpu/hsa/radeon/kfd_packet_manager.c | 488 ++++++++ drivers/gpu/hsa/radeon/kfd_pasid.c | 97 ++ drivers/gpu/hsa/radeon/kfd_pm4_headers.h | 682 +++++++++++ drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h | 107 ++ drivers/gpu/hsa/radeon/kfd_priv.h | 475 ++++++++ drivers/gpu/hsa/radeon/kfd_process.c | 391 +++++++ drivers/gpu/hsa/radeon/kfd_process_queue_manager.c | 343 ++++++ drivers/gpu/hsa/radeon/kfd_queue.c | 109 ++ drivers/gpu/hsa/radeon/kfd_scheduler.h | 72 ++ drivers/gpu/hsa/radeon/kfd_topology.c | 1207 ++++++++++++++++++++ drivers/gpu/hsa/radeon/kfd_topology.h | 168 +++ drivers/gpu/hsa/radeon/kfd_vidmem.c | 97 ++ include/linux/mm_types.h | 14 + include/linux/radeon_kfd.h | 106 ++ include/uapi/linux/kfd_ioctl.h | 133 +++ mm/rmap.c | 8 +- 47 files changed, 9402 insertions(+), 97 deletions(-) create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c create mode 100644 drivers/gpu/hsa/Kconfig create mode 100644 drivers/gpu/hsa/Makefile create mode 100644 drivers/gpu/hsa/radeon/Makefile create mode 100644 drivers/gpu/hsa/radeon/cik_int.h create mode 100644 drivers/gpu/hsa/radeon/cik_mqds.h create mode 100644 drivers/gpu/hsa/radeon/cik_regs.h create mode 100644 drivers/gpu/hsa/radeon/kfd_aperture.c create mode 100644 drivers/gpu/hsa/radeon/kfd_chardev.c create mode 100644 drivers/gpu/hsa/radeon/kfd_crat.h create mode 100644 drivers/gpu/hsa/radeon/kfd_device.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_doorbell.c create mode 100644 drivers/gpu/hsa/radeon/kfd_interrupt.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.h create mode 100644 drivers/gpu/hsa/radeon/kfd_module.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_packet_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pasid.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_headers.h create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h create mode 100644 drivers/gpu/hsa/radeon/kfd_priv.h create mode 100644 drivers/gpu/hsa/radeon/kfd_process.c create mode 100644 drivers/gpu/hsa/radeon/kfd_process_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_scheduler.h create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.c create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.h create mode 100644 drivers/gpu/hsa/radeon/kfd_vidmem.c create mode 100644 include/linux/radeon_kfd.h create mode 100644 include/uapi/linux/kfd_ioctl.h -- 1.9.1
Am 11.07.2014 23:18, schrieb Jerome Glisse:
On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
This patch set implements a Heterogeneous System Architecture (HSA) driver for radeon-family GPUs.
This is just quick comments on few things. Given size of this, people will need to have time to review things.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system resources more effectively via HW features including shared pageable memory, userspace-accessible work queues, and platform-level atomics. In addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of GPUs also performs HW-level validation of commands passed in through the queues (aka rings). The code in this patch set is intended to serve both as a sample driver for other HSA-compatible hardware devices and as a production driver for radeon-family processors. The code is architected to support multiple CPUs each with connected GPUs, although the current implementation focuses on a single Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics driver (kgd). AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware functionality between HSA compute and regular gfx/compute (memory, interrupts, registers), while other functionality has been added specifically for HSA compute (hw scheduler for virtualized compute rings). All shared hardware is owned by the radeon graphics driver, and an interface between kfd and kgd allows the kfd to make use of those shared resources, while HSA-specific functionality is managed directly by kfd by submitting packets into an HSA-specific command queue (the "HIQ"). During kfd module initialization a char device node (/dev/kfd) is created (surviving until module exit), with ioctls for queue creation & management, and data structures are initialized for managing HSA device topology. The rest of the initialization is driven by calls from the radeon kgd at the following points :
- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established which connect to the associated graphics kernel driver. This information is exposed to userspace via sysfs, along with a version number allowing userspace to determine if a topology change has occurred while it was reading from sysfs. The interface between kfd and kgd also allows the kfd to request buffer management services from kgd, and allows kgd to route interrupt requests to kfd code since the interrupt block is shared between regular graphics/compute and HSA compute subsystems in the GPU. The kfd code works with an open source usermode library ("libhsakmt") which is in the final stages of IP review and should be published in a separate repo over the next few days. The code operates in one of three modes, selectable via the sched_policy module parameter :
- sched_policy=0 uses a hardware scheduler running in the MEC
block within CP, and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow
oversubscription, so create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver
manually assigns queues to HW slots by programming registers The "no HW scheduling" option is for debug & new hardware bringup only, so has less test coverage than the other options. Default in the current code is "HW scheduling without oversubscription" since that is where we have the most test coverage but we expect to change the default to "HW scheduling with oversubscription" after further testing. This effectively removes the HW limit on the number of work queues available to applications. Programs running on the GPU are associated with an address space through the VMID field, which is translated to a unique PASID at access time via a set of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) are partitioned (under control of the radeon kgd) between current gfx/compute and HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping registers are updated by the HW scheduler when used, and by driver code if HW scheduling is not being used. The Sea Islands compute queues use a new "doorbell" mechanism instead of the earlier kernel-managed write pointer registers. Doorbells use a separate BAR dedicated for this purpose, and pages within the doorbell aperture are mapped to userspace (each page mapped to only one user address space). Writes to the doorbell aperture are intercepted by GPU hardware, allowing userspace code to safely manage work queues (rings) without requiring a kernel call for every ring update. First step for an application process is to open the kfd device. Calls to open create a kfd "process" structure only for the first thread of the process. Subsequent open calls are checked to see if they are from processes using the same mm_struct and, if so, don't do anything. The kfd per-process data lives as long as the mm_struct exists. Each mm_struct is associated with a unique PASID, allowing the IOMMUv2 to make userspace process memory accessible to the GPU. Next step is for the application to collect topology information via sysfs. This gives userspace enough information to be able to identify specific nodes (processors) in subsequent queue management calls. Application processes can create queues on multiple processors, and processors support queues from multiple processes.
I am not a fan to use sysfs to discover topoly.
At this point the application can create work queues in userspace memory and pass them through the usermode library to kfd to have them mapped onto HW queue slots so that commands written to the queues can be executed by the GPU. Queue operations specify a processor node, and so the bulk of this code is device-specific. Written by John Bridgman John.Bridgman@amd.com
So general comment is you need to collapse many patches things like 58 fixing kernel style should be collapsed ie fix all previous patch that have broken style.
Even worse is thing like 71, removing code you just added few patch earlier in the patchset.
Not quite, the code was added on patch 11.
This means that if i start reviewing following patch order i might spend time on code that is later deleted/modified/fixed ie time i spend understanding and reading some code might be just wasted.
Quick answer is that you are of course right, but having said that, I think it would be not simple at all to do that squashing. I squashed what I could, and probably I can do a little more (like 58), but the major problem is that one of the main modules of the driver - the scheduler (QCM) - was completely re-written (patches 46-53). Therefore, from patch 1 to 53, we use the old scheduler code and from 54 we use the new QCM (and the old scheduler code was completely remove at 71). So I could maybe squash 71 into 54, but that won't help you much, I think.
So, the current advice I can give is to completely ignore the following files because they do not exist in the final commit:
- kfd_sched_cik_static.c (stopped using in 54)
- kfd_registers.c (stopped using in 81 because we moved all register
writes to radeon)
kfd_hw_pointer_store.c (alive from 46 to 67)
kfd_hw_pointer_store.h (alive from 46 to 67)
Oded
Ok i try but this is driving me crazy, i am jungling btw full applied patchset and individual patch going back and forth trying to find which patch is the one i want to comment on. Not even mentioning that after that people would need to ack bad patch just because they know a latter good patch fix the badness.
This patchset can be reduced to less then 10 big patches. I would first do core patches that touch core things like iommu and are preparatory to the patchset. Then kfd patches that do not touch anything in radeon but implement the skeleton and core infrastructure. Then radeon specific patches.
This lead me to the design, all the core kfd helper should be in hsa directory, and should be helper, while all specific bits to radeon should most likely be part of the radeon gfx kernel module. I am not against a radeon_kfd but realy i do not see the point. There should be a hsa.ko like there is a drm.ko.
Yeah totally agree with Jerome here.
Maybe I can explain a bit further what the difference in the design is. First of all for other good examples see not only the DRM infrastructure, but also V4L or other Linux driver subsystems as well.
In those subsystems it's not the helper functions that control the driver, but instead the driver that controls the helper functions. E.g. if HSA wants to be a new subsystem with a standardized IOCTL interface it should provide functions that assists drivers with implementing the interface instead of implementing it themselves and then trying to talk to the drivers for necessary resources/operations.
Coming back to the structure of the patches one of the very first patches should be the IOCTL definition a driver needs to implement to be HSA conform.
I think it would be a good idea to split out changes to core structures like IOMMU in it's own separately submitted patches, only with the notice that this functionality is needed by the following "core HSA" and "HSA for radeon" patchsets.
Regards, Christian.
Also i think it would be benefical to rename kfd to hsa because it is the most common name and the name that will bring valid hit with web search.
Each of the patch should have a proper commit message that is not too cryptic without being too chatty either.
Cheers, Jérôme
I will come back with individual patch comments and also general remarks.
Cheers, Jérôme
Alexey Skidanov (4): hsa/radeon: 32-bit processes support hsa/radeon: NULL pointer dereference bug workaround hsa/radeon: HSA64/HSA32 modes support hsa/radeon: Add local memory to topology Andrew Lewycky (3): hsa/radeon: Make binding of process to device permanent hsa/radeon: Implement hsaKmtSetMemoryPolicy mm: Change timing of notification to IOMMUs about a page to be invalidated Ben Goz (20): hsa/radeon: Add queue and hw_pointer_store modules hsa/radeon: Add support allocating kernel doorbells hsa/radeon: Add mqd_manager module hsa/radeon: Add kernel queue support for KFD hsa/radeon: Add module parameter of scheduling policy hsa/radeon: Add packet manager module hsa/radeon: Add process queue manager module hsa/radeon: Add device queue manager module hsa/radeon: Switch to new queue scheduler hsa/radeon: Add IOCTL for update queue hsa/radeon: Queue Management integration with Memory Management hsa/radeon: update queue fault handling hsa/radeon: fixing a bug to support 32b processes hsa/radeon: Fix number of pipes per ME hsa/radeon: Removing hw pointer store module hsa/radeon: Adding some error messages hsa/radeon: Fixing minor issues with kernel queues (DIQ) drm/radeon: Add register access functions to kfd2kgd interface hsa/radeon: Eliminating all direct register accesses drm/radeon: Remove lock functions from kfd2kgd interface Evgeny Pinchuk (9): hsa/radeon: fix the OEMID assignment in kfd_topology drm/radeon: extending kfd-kgd interface hsa/radeon: implementing IOCTL for clock counters drm/radeon: adding synchronization for GRBM GFX hsa/radeon: fixing clock counters bug drm/radeon: Extending kfd interface hsa/radeon: Adding max clock speeds to topology hsa/radeon: Alternating the source of max clock hsa/radeon: Exclusive access for perf. counters Michael Varga (1): hsa/radeon: debugging print statements Oded Gabbay (45): mm: Add kfd_process pointer to mm_struct drm/radeon: reduce number of free VMIDs and pipes in KV drm/radeon: Report doorbell configuration to kfd drm/radeon: Add radeon <--> kfd interface drm/radeon: Add kfd-->kgd interface to get virtual ram size drm/radeon: Add kfd-->kgd interfaces of memory allocation/mapping drm/radeon: Add kfd-->kgd interface of locking srbm_gfx_cntl register drm/radeon: Add calls to initialize and finalize kfd from radeon hsa/radeon: Add code base of hsa driver for AMD's GPUs hsa/radeon: Add initialization and unmapping of doorbell aperture hsa/radeon: Add scheduler code hsa/radeon: Add kfd mmap handler hsa/radeon: Add 2 new IOCTL to kfd, CREATE_QUEUE and DESTROY_QUEUE hsa/radeon: Update MAINTAINERS and CREDITS files hsa/radeon: Add interrupt handling module hsa/radeon: Add the isr function of the KFD scehduler hsa/radeon: Handle deactivation of queues using interrupts hsa/radeon: Enable interrupts in KFD scheduler hsa/radeon: Enable/Disable KFD interrupt module hsa/radeon: Add interrupt callback function to kgd2kfd interface hsa/radeon: Add kgd-->kfd interfaces for suspend and resume drm/radeon: Add calls to suspend and resume of kfd driver drm/radeon/cik: Don't touch int of pipes 1-7 drm/radeon/cik: Call kfd isr function hsa/radeon: Fix memory size allocated for HPD hsa/radeon: Fix list of supported devices hsa/radeon: Fix coding style in cik_int.h hsa/radeon: Print ioctl commnad only in debug mode hsa/radeon: Print ISR info only in debug mode hsa/radeon: Workaround for a bug in amd_iommu hsa/radeon: Eliminate warnings in compilation hsa/radeon: Various kernel styling fixes hsa/radeon: Rearrange structures in kfd_ioctl.h hsa/radeon: change another pr_info to pr_debug hsa/radeon: Fix timeout calculation in sync_with_hw hsa/radeon: Update module information and version hsa/radeon: Update module version to 0.6.0 hsa/radeon: Fix initialization of sh_mem registers hsa/radeon: Fix compilation warnings hsa/radeon: Remove old scheduler code hsa/radeon: Static analysis (smatch) fixes hsa/radeon: Check oversubscription before destroying runlist hsa/radeon: Don't verify cksum when parsing CRAT table hsa/radeon: Update module version to 0.6.1 hsa/radeon: Update module version to 0.6.2 Yair Shachar (1): hsa/radeon: Adding qcm fence return status CREDITS | 7 + MAINTAINERS | 8 + drivers/Kconfig | 2 + drivers/gpu/Makefile | 1 + drivers/gpu/drm/radeon/Makefile | 1 + drivers/gpu/drm/radeon/cik.c | 156 +-- drivers/gpu/drm/radeon/cikd.h | 51 +- drivers/gpu/drm/radeon/radeon.h | 9 + drivers/gpu/drm/radeon/radeon_device.c | 32 + drivers/gpu/drm/radeon/radeon_drv.c | 6 + drivers/gpu/drm/radeon/radeon_kfd.c | 630 ++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 9 + drivers/gpu/hsa/Kconfig | 20 + drivers/gpu/hsa/Makefile | 1 + drivers/gpu/hsa/radeon/Makefile | 12 + drivers/gpu/hsa/radeon/cik_int.h | 50 + drivers/gpu/hsa/radeon/cik_mqds.h | 250 ++++ drivers/gpu/hsa/radeon/cik_regs.h | 220 ++++ drivers/gpu/hsa/radeon/kfd_aperture.c | 123 ++ drivers/gpu/hsa/radeon/kfd_chardev.c | 530 +++++++++ drivers/gpu/hsa/radeon/kfd_crat.h | 294 +++++ drivers/gpu/hsa/radeon/kfd_device.c | 244 ++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.c | 981 ++++++++++++++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.h | 101 ++ drivers/gpu/hsa/radeon/kfd_doorbell.c | 242 ++++ drivers/gpu/hsa/radeon/kfd_interrupt.c | 177 +++ drivers/gpu/hsa/radeon/kfd_kernel_queue.c | 305 +++++ drivers/gpu/hsa/radeon/kfd_kernel_queue.h | 66 ++ drivers/gpu/hsa/radeon/kfd_module.c | 130 +++ drivers/gpu/hsa/radeon/kfd_mqd_manager.c | 290 +++++ drivers/gpu/hsa/radeon/kfd_mqd_manager.h | 54 + drivers/gpu/hsa/radeon/kfd_packet_manager.c | 488 ++++++++ drivers/gpu/hsa/radeon/kfd_pasid.c | 97 ++ drivers/gpu/hsa/radeon/kfd_pm4_headers.h | 682 +++++++++++ drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h | 107 ++ drivers/gpu/hsa/radeon/kfd_priv.h | 475 ++++++++ drivers/gpu/hsa/radeon/kfd_process.c | 391 +++++++ drivers/gpu/hsa/radeon/kfd_process_queue_manager.c | 343 ++++++ drivers/gpu/hsa/radeon/kfd_queue.c | 109 ++ drivers/gpu/hsa/radeon/kfd_scheduler.h | 72 ++ drivers/gpu/hsa/radeon/kfd_topology.c | 1207 ++++++++++++++++++++ drivers/gpu/hsa/radeon/kfd_topology.h | 168 +++ drivers/gpu/hsa/radeon/kfd_vidmem.c | 97 ++ include/linux/mm_types.h | 14 + include/linux/radeon_kfd.h | 106 ++ include/uapi/linux/kfd_ioctl.h | 133 +++ mm/rmap.c | 8 +- 47 files changed, 9402 insertions(+), 97 deletions(-) create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c create mode 100644 drivers/gpu/hsa/Kconfig create mode 100644 drivers/gpu/hsa/Makefile create mode 100644 drivers/gpu/hsa/radeon/Makefile create mode 100644 drivers/gpu/hsa/radeon/cik_int.h create mode 100644 drivers/gpu/hsa/radeon/cik_mqds.h create mode 100644 drivers/gpu/hsa/radeon/cik_regs.h create mode 100644 drivers/gpu/hsa/radeon/kfd_aperture.c create mode 100644 drivers/gpu/hsa/radeon/kfd_chardev.c create mode 100644 drivers/gpu/hsa/radeon/kfd_crat.h create mode 100644 drivers/gpu/hsa/radeon/kfd_device.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_doorbell.c create mode 100644 drivers/gpu/hsa/radeon/kfd_interrupt.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.h create mode 100644 drivers/gpu/hsa/radeon/kfd_module.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_packet_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pasid.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_headers.h create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h create mode 100644 drivers/gpu/hsa/radeon/kfd_priv.h create mode 100644 drivers/gpu/hsa/radeon/kfd_process.c create mode 100644 drivers/gpu/hsa/radeon/kfd_process_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_scheduler.h create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.c create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.h create mode 100644 drivers/gpu/hsa/radeon/kfd_vidmem.c create mode 100644 include/linux/radeon_kfd.h create mode 100644 include/uapi/linux/kfd_ioctl.h -- 1.9.1
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Sat, Jul 12, 2014 at 11:24:49AM +0200, Christian König wrote:
Am 11.07.2014 23:18, schrieb Jerome Glisse:
On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
This patch set implements a Heterogeneous System Architecture (HSA) driver for radeon-family GPUs.
This is just quick comments on few things. Given size of this, people will need to have time to review things.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system resources more effectively via HW features including shared pageable memory, userspace-accessible work queues, and platform-level atomics. In addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of GPUs also performs HW-level validation of commands passed in through the queues (aka rings). The code in this patch set is intended to serve both as a sample driver for other HSA-compatible hardware devices and as a production driver for radeon-family processors. The code is architected to support multiple CPUs each with connected GPUs, although the current implementation focuses on a single Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics driver (kgd). AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware functionality between HSA compute and regular gfx/compute (memory, interrupts, registers), while other functionality has been added specifically for HSA compute (hw scheduler for virtualized compute rings). All shared hardware is owned by the radeon graphics driver, and an interface between kfd and kgd allows the kfd to make use of those shared resources, while HSA-specific functionality is managed directly by kfd by submitting packets into an HSA-specific command queue (the "HIQ"). During kfd module initialization a char device node (/dev/kfd) is created (surviving until module exit), with ioctls for queue creation & management, and data structures are initialized for managing HSA device topology. The rest of the initialization is driven by calls from the radeon kgd at the following points :
- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established which connect to the associated graphics kernel driver. This information is exposed to userspace via sysfs, along with a version number allowing userspace to determine if a topology change has occurred while it was reading from sysfs. The interface between kfd and kgd also allows the kfd to request buffer management services from kgd, and allows kgd to route interrupt requests to kfd code since the interrupt block is shared between regular graphics/compute and HSA compute subsystems in the GPU. The kfd code works with an open source usermode library ("libhsakmt") which is in the final stages of IP review and should be published in a separate repo over the next few days. The code operates in one of three modes, selectable via the sched_policy module parameter :
- sched_policy=0 uses a hardware scheduler running in the MEC
block within CP, and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow
oversubscription, so create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver
manually assigns queues to HW slots by programming registers The "no HW scheduling" option is for debug & new hardware bringup only, so has less test coverage than the other options. Default in the current code is "HW scheduling without oversubscription" since that is where we have the most test coverage but we expect to change the default to "HW scheduling with oversubscription" after further testing. This effectively removes the HW limit on the number of work queues available to applications. Programs running on the GPU are associated with an address space through the VMID field, which is translated to a unique PASID at access time via a set of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) are partitioned (under control of the radeon kgd) between current gfx/compute and HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping registers are updated by the HW scheduler when used, and by driver code if HW scheduling is not being used. The Sea Islands compute queues use a new "doorbell" mechanism instead of the earlier kernel-managed write pointer registers. Doorbells use a separate BAR dedicated for this purpose, and pages within the doorbell aperture are mapped to userspace (each page mapped to only one user address space). Writes to the doorbell aperture are intercepted by GPU hardware, allowing userspace code to safely manage work queues (rings) without requiring a kernel call for every ring update. First step for an application process is to open the kfd device. Calls to open create a kfd "process" structure only for the first thread of the process. Subsequent open calls are checked to see if they are from processes using the same mm_struct and, if so, don't do anything. The kfd per-process data lives as long as the mm_struct exists. Each mm_struct is associated with a unique PASID, allowing the IOMMUv2 to make userspace process memory accessible to the GPU. Next step is for the application to collect topology information via sysfs. This gives userspace enough information to be able to identify specific nodes (processors) in subsequent queue management calls. Application processes can create queues on multiple processors, and processors support queues from multiple processes.
I am not a fan to use sysfs to discover topoly.
At this point the application can create work queues in userspace memory and pass them through the usermode library to kfd to have them mapped onto HW queue slots so that commands written to the queues can be executed by the GPU. Queue operations specify a processor node, and so the bulk of this code is device-specific. Written by John Bridgman John.Bridgman@amd.com
So general comment is you need to collapse many patches things like 58 fixing kernel style should be collapsed ie fix all previous patch that have broken style. Even worse is thing like 71, removing code you just added few patch earlier in the patchset.
Not quite, the code was added on patch 11.
This means that if i start reviewing following patch order i might spend time on code that is later deleted/modified/fixed ie time i spend understanding and reading some code might be just wasted.
Quick answer is that you are of course right, but having said that, I think it would be not simple at all to do that squashing. I squashed what I could, and probably I can do a little more (like 58), but the major problem is that one of the main modules of the driver - the scheduler (QCM) - was completely re-written (patches 46-53). Therefore, from patch 1 to 53, we use the old scheduler code and from 54 we use the new QCM (and the old scheduler code was completely remove at 71). So I could maybe squash 71 into 54, but that won't help you much, I think.
So, the current advice I can give is to completely ignore the following files because they do not exist in the final commit:
- kfd_sched_cik_static.c (stopped using in 54)
- kfd_registers.c (stopped using in 81 because we moved all register
writes to radeon)
kfd_hw_pointer_store.c (alive from 46 to 67)
kfd_hw_pointer_store.h (alive from 46 to 67)
Oded
Ok i try but this is driving me crazy, i am jungling btw full applied patchset and individual patch going back and forth trying to find which patch is the one i want to comment on. Not even mentioning that after that people would need to ack bad patch just because they know a latter good patch fix the badness.
This patchset can be reduced to less then 10 big patches. I would first do core patches that touch core things like iommu and are preparatory to the patchset. Then kfd patches that do not touch anything in radeon but implement the skeleton and core infrastructure. Then radeon specific patches.
This lead me to the design, all the core kfd helper should be in hsa directory, and should be helper, while all specific bits to radeon should most likely be part of the radeon gfx kernel module. I am not against a radeon_kfd but realy i do not see the point. There should be a hsa.ko like there is a drm.ko.
Yeah totally agree with Jerome here.
Maybe I can explain a bit further what the difference in the design is. First of all for other good examples see not only the DRM infrastructure, but also V4L or other Linux driver subsystems as well.
In those subsystems it's not the helper functions that control the driver, but instead the driver that controls the helper functions. E.g. if HSA wants to be a new subsystem with a standardized IOCTL interface it should provide functions that assists drivers with implementing the interface instead of implementing it themselves and then trying to talk to the drivers for necessary resources/operations.
Coming back to the structure of the patches one of the very first patches should be the IOCTL definition a driver needs to implement to be HSA conform.
I think it would be a good idea to split out changes to core structures like IOMMU in it's own separately submitted patches, only with the notice that this functionality is needed by the following "core HSA" and "HSA for radeon" patchsets.
Hm, so the hsa part is a completely new driver/subsystem, not just an additional ioctl tacked onto radoen? The history of drm is littered with "generic" ioctls that turned out to be useful for exactly one driver. Which is why _all_ the command submission is now done with driver-private ioctls.
I'd be quite a bit surprised if that suddenly works differently, so before we bless a generic hsa interface I really want to see some implementation from a different vendor (i.e. nvdidia or intel) using the same ioctls. Otherwise we just repeat history and I'm not terribly inclined to keep on cleanup up cruft forever - one drm legacy is enough ;-)
Jesse is the guy from our side to talk to about this. -Daniel
On Sat, Jul 12, 2014 at 01:10:32PM +0200, Daniel Vetter wrote:
On Sat, Jul 12, 2014 at 11:24:49AM +0200, Christian König wrote:
Am 11.07.2014 23:18, schrieb Jerome Glisse:
On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
This patch set implements a Heterogeneous System Architecture (HSA) driver for radeon-family GPUs.
This is just quick comments on few things. Given size of this, people will need to have time to review things.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system resources more effectively via HW features including shared pageable memory, userspace-accessible work queues, and platform-level atomics. In addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of GPUs also performs HW-level validation of commands passed in through the queues (aka rings). The code in this patch set is intended to serve both as a sample driver for other HSA-compatible hardware devices and as a production driver for radeon-family processors. The code is architected to support multiple CPUs each with connected GPUs, although the current implementation focuses on a single Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics driver (kgd). AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware functionality between HSA compute and regular gfx/compute (memory, interrupts, registers), while other functionality has been added specifically for HSA compute (hw scheduler for virtualized compute rings). All shared hardware is owned by the radeon graphics driver, and an interface between kfd and kgd allows the kfd to make use of those shared resources, while HSA-specific functionality is managed directly by kfd by submitting packets into an HSA-specific command queue (the "HIQ"). During kfd module initialization a char device node (/dev/kfd) is created (surviving until module exit), with ioctls for queue creation & management, and data structures are initialized for managing HSA device topology. The rest of the initialization is driven by calls from the radeon kgd at the following points :
- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established which connect to the associated graphics kernel driver. This information is exposed to userspace via sysfs, along with a version number allowing userspace to determine if a topology change has occurred while it was reading from sysfs. The interface between kfd and kgd also allows the kfd to request buffer management services from kgd, and allows kgd to route interrupt requests to kfd code since the interrupt block is shared between regular graphics/compute and HSA compute subsystems in the GPU. The kfd code works with an open source usermode library ("libhsakmt") which is in the final stages of IP review and should be published in a separate repo over the next few days. The code operates in one of three modes, selectable via the sched_policy module parameter :
- sched_policy=0 uses a hardware scheduler running in the MEC
block within CP, and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow
oversubscription, so create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver
manually assigns queues to HW slots by programming registers The "no HW scheduling" option is for debug & new hardware bringup only, so has less test coverage than the other options. Default in the current code is "HW scheduling without oversubscription" since that is where we have the most test coverage but we expect to change the default to "HW scheduling with oversubscription" after further testing. This effectively removes the HW limit on the number of work queues available to applications. Programs running on the GPU are associated with an address space through the VMID field, which is translated to a unique PASID at access time via a set of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) are partitioned (under control of the radeon kgd) between current gfx/compute and HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping registers are updated by the HW scheduler when used, and by driver code if HW scheduling is not being used. The Sea Islands compute queues use a new "doorbell" mechanism instead of the earlier kernel-managed write pointer registers. Doorbells use a separate BAR dedicated for this purpose, and pages within the doorbell aperture are mapped to userspace (each page mapped to only one user address space). Writes to the doorbell aperture are intercepted by GPU hardware, allowing userspace code to safely manage work queues (rings) without requiring a kernel call for every ring update. First step for an application process is to open the kfd device. Calls to open create a kfd "process" structure only for the first thread of the process. Subsequent open calls are checked to see if they are from processes using the same mm_struct and, if so, don't do anything. The kfd per-process data lives as long as the mm_struct exists. Each mm_struct is associated with a unique PASID, allowing the IOMMUv2 to make userspace process memory accessible to the GPU. Next step is for the application to collect topology information via sysfs. This gives userspace enough information to be able to identify specific nodes (processors) in subsequent queue management calls. Application processes can create queues on multiple processors, and processors support queues from multiple processes.
I am not a fan to use sysfs to discover topoly.
At this point the application can create work queues in userspace memory and pass them through the usermode library to kfd to have them mapped onto HW queue slots so that commands written to the queues can be executed by the GPU. Queue operations specify a processor node, and so the bulk of this code is device-specific. Written by John Bridgman John.Bridgman@amd.com
So general comment is you need to collapse many patches things like 58 fixing kernel style should be collapsed ie fix all previous patch that have broken style. Even worse is thing like 71, removing code you just added few patch earlier in the patchset.
Not quite, the code was added on patch 11.
This means that if i start reviewing following patch order i might spend time on code that is later deleted/modified/fixed ie time i spend understanding and reading some code might be just wasted.
Quick answer is that you are of course right, but having said that, I think it would be not simple at all to do that squashing. I squashed what I could, and probably I can do a little more (like 58), but the major problem is that one of the main modules of the driver - the scheduler (QCM) - was completely re-written (patches 46-53). Therefore, from patch 1 to 53, we use the old scheduler code and from 54 we use the new QCM (and the old scheduler code was completely remove at 71). So I could maybe squash 71 into 54, but that won't help you much, I think.
So, the current advice I can give is to completely ignore the following files because they do not exist in the final commit:
- kfd_sched_cik_static.c (stopped using in 54)
- kfd_registers.c (stopped using in 81 because we moved all register
writes to radeon)
kfd_hw_pointer_store.c (alive from 46 to 67)
kfd_hw_pointer_store.h (alive from 46 to 67)
Oded
Ok i try but this is driving me crazy, i am jungling btw full applied patchset and individual patch going back and forth trying to find which patch is the one i want to comment on. Not even mentioning that after that people would need to ack bad patch just because they know a latter good patch fix the badness.
This patchset can be reduced to less then 10 big patches. I would first do core patches that touch core things like iommu and are preparatory to the patchset. Then kfd patches that do not touch anything in radeon but implement the skeleton and core infrastructure. Then radeon specific patches.
This lead me to the design, all the core kfd helper should be in hsa directory, and should be helper, while all specific bits to radeon should most likely be part of the radeon gfx kernel module. I am not against a radeon_kfd but realy i do not see the point. There should be a hsa.ko like there is a drm.ko.
Yeah totally agree with Jerome here.
Maybe I can explain a bit further what the difference in the design is. First of all for other good examples see not only the DRM infrastructure, but also V4L or other Linux driver subsystems as well.
In those subsystems it's not the helper functions that control the driver, but instead the driver that controls the helper functions. E.g. if HSA wants to be a new subsystem with a standardized IOCTL interface it should provide functions that assists drivers with implementing the interface instead of implementing it themselves and then trying to talk to the drivers for necessary resources/operations.
Coming back to the structure of the patches one of the very first patches should be the IOCTL definition a driver needs to implement to be HSA conform.
I think it would be a good idea to split out changes to core structures like IOMMU in it's own separately submitted patches, only with the notice that this functionality is needed by the following "core HSA" and "HSA for radeon" patchsets.
Hm, so the hsa part is a completely new driver/subsystem, not just an additional ioctl tacked onto radoen? The history of drm is littered with "generic" ioctls that turned out to be useful for exactly one driver. Which is why _all_ the command submission is now done with driver-private ioctls.
I'd be quite a bit surprised if that suddenly works differently, so before we bless a generic hsa interface I really want to see some implementation from a different vendor (i.e. nvdidia or intel) using the same ioctls. Otherwise we just repeat history and I'm not terribly inclined to keep on cleanup up cruft forever - one drm legacy is enough ;-)
Jesse is the guy from our side to talk to about this. -Daniel
I am not worried about that side, the hsa foundation has pretty strict guidelines on what is hsa compliant hardware ie the hw needs to understand the pm4 packet format of radeon (well small subset of it). But of course this require hsa compliant hardware and from member i am guessing ARM Mali, ImgTech, Qualcomm, ... so unless Intel and NVidia joins hsa you will not see it for those hardware.
So yes for once same ioctl would apply to different hardware. The only things that is different is the shader isa. The hsafoundation site has some pdf explaining all that but someone thought that slideshare would be a good idea personnaly i would not register to any of the website just to get the pdf.
So to sumup i am ok with having a new device file that present uniform set of ioctl. It would actualy be lot easier for userspace, just open this fix device file and ask for list of compliant hardware.
Then radeon kernel driver would register itself as a provider. So all ioctl decoding marshalling would be share which makes sense.
Cheers, Jérôme Glisse
On Sat, Jul 12, 2014 at 6:49 PM, Jerome Glisse j.glisse@gmail.com wrote:
Hm, so the hsa part is a completely new driver/subsystem, not just an additional ioctl tacked onto radoen? The history of drm is littered with "generic" ioctls that turned out to be useful for exactly one driver. Which is why _all_ the command submission is now done with driver-private ioctls.
I'd be quite a bit surprised if that suddenly works differently, so before we bless a generic hsa interface I really want to see some implementation from a different vendor (i.e. nvdidia or intel) using the same ioctls. Otherwise we just repeat history and I'm not terribly inclined to keep on cleanup up cruft forever - one drm legacy is enough ;-)
Jesse is the guy from our side to talk to about this. -Daniel
I am not worried about that side, the hsa foundation has pretty strict guidelines on what is hsa compliant hardware ie the hw needs to understand the pm4 packet format of radeon (well small subset of it). But of course this require hsa compliant hardware and from member i am guessing ARM Mali, ImgTech, Qualcomm, ... so unless Intel and NVidia joins hsa you will not see it for those hardware.
So yes for once same ioctl would apply to different hardware. The only things that is different is the shader isa. The hsafoundation site has some pdf explaining all that but someone thought that slideshare would be a good idea personnaly i would not register to any of the website just to get the pdf.
So to sumup i am ok with having a new device file that present uniform set of ioctl. It would actualy be lot easier for userspace, just open this fix device file and ask for list of compliant hardware.
Then radeon kernel driver would register itself as a provider. So all ioctl decoding marshalling would be share which makes sense.
There's also the other side namely that preparing the cp ring in userspace and submitting the entire pile through a doorbell to the hw scheduler isn't really hsa exclusive. And for a solid platform with seamless gpu/cpu integration that means we need standard ways to set gpu context priorities and get at useful stats like gpu time used by a given context.
To get there I guess intel/nvidia need to reuse the hsa subsystem with the command submission adjusted a bit. Kinda like drm where kms and buffer sharing is common and cs driver specific. -Daniel
On Sun, Jul 13, 2014 at 11:42:58AM +0200, Daniel Vetter wrote:
On Sat, Jul 12, 2014 at 6:49 PM, Jerome Glisse j.glisse@gmail.com wrote:
Hm, so the hsa part is a completely new driver/subsystem, not just an additional ioctl tacked onto radoen? The history of drm is littered with "generic" ioctls that turned out to be useful for exactly one driver. Which is why _all_ the command submission is now done with driver-private ioctls.
I'd be quite a bit surprised if that suddenly works differently, so before we bless a generic hsa interface I really want to see some implementation from a different vendor (i.e. nvdidia or intel) using the same ioctls. Otherwise we just repeat history and I'm not terribly inclined to keep on cleanup up cruft forever - one drm legacy is enough ;-)
Jesse is the guy from our side to talk to about this. -Daniel
I am not worried about that side, the hsa foundation has pretty strict guidelines on what is hsa compliant hardware ie the hw needs to understand the pm4 packet format of radeon (well small subset of it). But of course this require hsa compliant hardware and from member i am guessing ARM Mali, ImgTech, Qualcomm, ... so unless Intel and NVidia joins hsa you will not see it for those hardware.
So yes for once same ioctl would apply to different hardware. The only things that is different is the shader isa. The hsafoundation site has some pdf explaining all that but someone thought that slideshare would be a good idea personnaly i would not register to any of the website just to get the pdf.
So to sumup i am ok with having a new device file that present uniform set of ioctl. It would actualy be lot easier for userspace, just open this fix device file and ask for list of compliant hardware.
Then radeon kernel driver would register itself as a provider. So all ioctl decoding marshalling would be share which makes sense.
There's also the other side namely that preparing the cp ring in userspace and submitting the entire pile through a doorbell to the hw scheduler isn't really hsa exclusive. And for a solid platform with seamless gpu/cpu integration that means we need standard ways to set gpu context priorities and get at useful stats like gpu time used by a given context.
To get there I guess intel/nvidia need to reuse the hsa subsystem with the command submission adjusted a bit. Kinda like drm where kms and buffer sharing is common and cs driver specific.
HSA module would be for HSA compliant hardware and thus hardware would need to follow HSA specification which again is pretty clear on what the hardware need to provide. So if Intel and NVidia wants to join HSA i am sure they would be welcome, the more the merrier :)
So i would not block HSA kernel ioctl design in order to please non HSA hardware especialy if at this point in time nor Intel or NVidia can share anything concret on the design and how this things should be setup for there hardware.
When Intel or NVidia present their own API they should provide their own set of ioctl through their own platform.
Cheers, Jérôme Glisse
On Sun, 13 Jul 2014 12:40:32 -0400 j.glisse at gmail.com (Jerome Glisse) wrote:
On Sun, Jul 13, 2014 at 11:42:58AM +0200, Daniel Vetter wrote:
On Sat, Jul 12, 2014 at 6:49 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
Hm, so the hsa part is a completely new driver/subsystem, not just an additional ioctl tacked onto radoen? The history of drm is littered with "generic" ioctls that turned out to be useful for exactly one driver. Which is why _all_ the command submission is now done with driver-private ioctls.
I'd be quite a bit surprised if that suddenly works differently, so before we bless a generic hsa interface I really want to see some implementation from a different vendor (i.e. nvdidia or intel) using the same ioctls. Otherwise we just repeat history and I'm not terribly inclined to keep on cleanup up cruft forever - one drm legacy is enough ;-)
Jesse is the guy from our side to talk to about this. -Daniel
I am not worried about that side, the hsa foundation has pretty strict guidelines on what is hsa compliant hardware ie the hw needs to understand the pm4 packet format of radeon (well small subset of it). But of course this require hsa compliant hardware and from member i am guessing ARM Mali, ImgTech, Qualcomm, ... so unless Intel and NVidia joins hsa you will not see it for those hardware.
So yes for once same ioctl would apply to different hardware. The only things that is different is the shader isa. The hsafoundation site has some pdf explaining all that but someone thought that slideshare would be a good idea personnaly i would not register to any of the website just to get the pdf.
So to sumup i am ok with having a new device file that present uniform set of ioctl. It would actualy be lot easier for userspace, just open this fix device file and ask for list of compliant hardware.
Then radeon kernel driver would register itself as a provider. So all ioctl decoding marshalling would be share which makes sense.
There's also the other side namely that preparing the cp ring in userspace and submitting the entire pile through a doorbell to the hw scheduler isn't really hsa exclusive. And for a solid platform with seamless gpu/cpu integration that means we need standard ways to set gpu context priorities and get at useful stats like gpu time used by a given context.
To get there I guess intel/nvidia need to reuse the hsa subsystem with the command submission adjusted a bit. Kinda like drm where kms and buffer sharing is common and cs driver specific.
HSA module would be for HSA compliant hardware and thus hardware would need to follow HSA specification which again is pretty clear on what the hardware need to provide. So if Intel and NVidia wants to join HSA i am sure they would be welcome, the more the merrier :)
So i would not block HSA kernel ioctl design in order to please non HSA hardware especialy if at this point in time nor Intel or NVidia can share anything concret on the design and how this things should be setup for there hardware.
When Intel or NVidia present their own API they should provide their own set of ioctl through their own platform.
Yeah things are different enough that a uniform ioctl doesn't make sense. If/when all the vendors decide on a single standard, we can use that, but until then I don't see a nice way to share our doorbell & submission scheme with HSA, and I assume nvidia is the same.
Using HSA as a basis for non-HSA systems seems like it would add a lot of complexity, since non-HSA hardware would have to intercept the queue writes and manage the submission requests etc as bytecodes in the kernel driver, or maybe as a shim layer library that wraps that stuff.
Probably not worth the effort given that the command sets themselves are all custom as well, driven by specific user level drivers like GL, CL, and libva.
On Wed, Jul 23, 2014 at 04:57:48PM -0700, Jesse Barnes wrote:
On Sun, 13 Jul 2014 12:40:32 -0400 j.glisse at gmail.com (Jerome Glisse) wrote:
On Sun, Jul 13, 2014 at 11:42:58AM +0200, Daniel Vetter wrote:
On Sat, Jul 12, 2014 at 6:49 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
Hm, so the hsa part is a completely new driver/subsystem, not just an additional ioctl tacked onto radoen? The history of drm is littered with "generic" ioctls that turned out to be useful for exactly one driver. Which is why _all_ the command submission is now done with driver-private ioctls.
I'd be quite a bit surprised if that suddenly works differently, so before we bless a generic hsa interface I really want to see some implementation from a different vendor (i.e. nvdidia or intel) using the same ioctls. Otherwise we just repeat history and I'm not terribly inclined to keep on cleanup up cruft forever - one drm legacy is enough ;-)
Jesse is the guy from our side to talk to about this. -Daniel
I am not worried about that side, the hsa foundation has pretty strict guidelines on what is hsa compliant hardware ie the hw needs to understand the pm4 packet format of radeon (well small subset of it). But of course this require hsa compliant hardware and from member i am guessing ARM Mali, ImgTech, Qualcomm, ... so unless Intel and NVidia joins hsa you will not see it for those hardware.
So yes for once same ioctl would apply to different hardware. The only things that is different is the shader isa. The hsafoundation site has some pdf explaining all that but someone thought that slideshare would be a good idea personnaly i would not register to any of the website just to get the pdf.
So to sumup i am ok with having a new device file that present uniform set of ioctl. It would actualy be lot easier for userspace, just open this fix device file and ask for list of compliant hardware.
Then radeon kernel driver would register itself as a provider. So all ioctl decoding marshalling would be share which makes sense.
There's also the other side namely that preparing the cp ring in userspace and submitting the entire pile through a doorbell to the hw scheduler isn't really hsa exclusive. And for a solid platform with seamless gpu/cpu integration that means we need standard ways to set gpu context priorities and get at useful stats like gpu time used by a given context.
To get there I guess intel/nvidia need to reuse the hsa subsystem with the command submission adjusted a bit. Kinda like drm where kms and buffer sharing is common and cs driver specific.
HSA module would be for HSA compliant hardware and thus hardware would need to follow HSA specification which again is pretty clear on what the hardware need to provide. So if Intel and NVidia wants to join HSA i am sure they would be welcome, the more the merrier :)
So i would not block HSA kernel ioctl design in order to please non HSA hardware especialy if at this point in time nor Intel or NVidia can share anything concret on the design and how this things should be setup for there hardware.
When Intel or NVidia present their own API they should provide their own set of ioctl through their own platform.
Yeah things are different enough that a uniform ioctl doesn't make sense. If/when all the vendors decide on a single standard, we can use that, but until then I don't see a nice way to share our doorbell & submission scheme with HSA, and I assume nvidia is the same.
Using HSA as a basis for non-HSA systems seems like it would add a lot of complexity, since non-HSA hardware would have to intercept the queue writes and manage the submission requests etc as bytecodes in the kernel driver, or maybe as a shim layer library that wraps that stuff.
Probably not worth the effort given that the command sets themselves are all custom as well, driven by specific user level drivers like GL, CL, and libva.
Well I know that - drm also has the split between shared management stuff like prime and driver private cmd submission. I still think that some common interfaces would benefit us. I want things like a gputop (and also perf counters and all that) to work the same way with the same tooling on all svm/gpgpu stuff. So a shared namespace/create for svm contexts or something like that. -Daniel
On Fri, 2014-07-11 at 17:18 -0400, Jerome Glisse wrote:
On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
This patch set implements a Heterogeneous System Architecture (HSA) driver for radeon-family GPUs.
This is just quick comments on few things. Given size of this, people will need to have time to review things.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system resources more effectively via HW features including shared pageable memory, userspace-accessible work queues, and platform-level atomics. In addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of GPUs also performs HW-level validation of commands passed in through the queues (aka rings). The code in this patch set is intended to serve both as a sample driver for other HSA-compatible hardware devices and as a production driver for radeon-family processors. The code is architected to support multiple CPUs each with connected GPUs, although the current implementation focuses on a single Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics driver (kgd). AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware functionality between HSA compute and regular gfx/compute (memory, interrupts, registers), while other functionality has been added specifically for HSA compute (hw scheduler for virtualized compute rings). All shared hardware is owned by the radeon graphics driver, and an interface between kfd and kgd allows the kfd to make use of those shared resources, while HSA-specific functionality is managed directly by kfd by submitting packets into an HSA-specific command queue (the "HIQ"). During kfd module initialization a char device node (/dev/kfd) is created (surviving until module exit), with ioctls for queue creation & management, and data structures are initialized for managing HSA device topology. The rest of the initialization is driven by calls from the radeon kgd at the following points :
- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established which connect to the associated graphics kernel driver. This information is exposed to userspace via sysfs, along with a version number allowing userspace to determine if a topology change has occurred while it was reading from sysfs. The interface between kfd and kgd also allows the kfd to request buffer management services from kgd, and allows kgd to route interrupt requests to kfd code since the interrupt block is shared between regular graphics/compute and HSA compute subsystems in the GPU. The kfd code works with an open source usermode library ("libhsakmt") which is in the final stages of IP review and should be published in a separate repo over the next few days. The code operates in one of three modes, selectable via the sched_policy module parameter :
- sched_policy=0 uses a hardware scheduler running in the MEC
block within CP, and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow
oversubscription, so create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver
manually assigns queues to HW slots by programming registers The "no HW scheduling" option is for debug & new hardware bringup only, so has less test coverage than the other options. Default in the current code is "HW scheduling without oversubscription" since that is where we have the most test coverage but we expect to change the default to "HW scheduling with oversubscription" after further testing. This effectively removes the HW limit on the number of work queues available to applications. Programs running on the GPU are associated with an address space through the VMID field, which is translated to a unique PASID at access time via a set of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) are partitioned (under control of the radeon kgd) between current gfx/compute and HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping registers are updated by the HW scheduler when used, and by driver code if HW scheduling is not being used. The Sea Islands compute queues use a new "doorbell" mechanism instead of the earlier kernel-managed write pointer registers. Doorbells use a separate BAR dedicated for this purpose, and pages within the doorbell aperture are mapped to userspace (each page mapped to only one user address space). Writes to the doorbell aperture are intercepted by GPU hardware, allowing userspace code to safely manage work queues (rings) without requiring a kernel call for every ring update. First step for an application process is to open the kfd device. Calls to open create a kfd "process" structure only for the first thread of the process. Subsequent open calls are checked to see if they are from processes using the same mm_struct and, if so, don't do anything. The kfd per-process data lives as long as the mm_struct exists. Each mm_struct is associated with a unique PASID, allowing the IOMMUv2 to make userspace process memory accessible to the GPU. Next step is for the application to collect topology information via sysfs. This gives userspace enough information to be able to identify specific nodes (processors) in subsequent queue management calls. Application processes can create queues on multiple processors, and processors support queues from multiple processes.
I am not a fan to use sysfs to discover topoly.
At this point the application can create work queues in userspace memory and pass them through the usermode library to kfd to have them mapped onto HW queue slots so that commands written to the queues can be executed by the GPU. Queue operations specify a processor node, and so the bulk of this code is device-specific. Written by John Bridgman John.Bridgman@amd.com
So general comment is you need to collapse many patches things like 58 fixing kernel style should be collapsed ie fix all previous patch that have broken style. Even worse is thing like 71, removing code you just added few patch earlier in the patchset.
Not quite, the code was added on patch 11.
This means that if i start reviewing following patch order i might spend time on code that is later deleted/modified/fixed ie time i spend understanding and reading some code might be just wasted.
Quick answer is that you are of course right, but having said that, I think it would be not simple at all to do that squashing. I squashed what I could, and probably I can do a little more (like 58), but the major problem is that one of the main modules of the driver - the scheduler (QCM) - was completely re-written (patches 46-53). Therefore, from patch 1 to 53, we use the old scheduler code and from 54 we use the new QCM (and the old scheduler code was completely remove at 71). So I could maybe squash 71 into 54, but that won't help you much, I think. So, the current advice I can give is to completely ignore the following files because they do not exist in the final commit:
- kfd_sched_cik_static.c (stopped using in 54)
- kfd_registers.c (stopped using in 81 because we moved all
register writes to radeon)
- kfd_hw_pointer_store.c (alive from 46 to 67)
- kfd_hw_pointer_store.h (alive from 46 to 67) Oded
Ok i try but this is driving me crazy, i am jungling btw full applied patchset and individual patch going back and forth trying to find which patch is the one i want to comment on. Not even mentioning that after that people would need to ack bad patch just because they know a latter good patch fix the badness.
This patchset can be reduced to less then 10 big patches. I would first do core patches that touch core things like iommu and are preparatory to the patchset. Then kfd patches that do not touch anything in radeon but implement the skeleton and core infrastructure. Then radeon specific patches.
Hi Jerome, I thought about what you said and I would like to make a suggestion. I believe I can restructure the patchset into 10-20 patches, organized logically and will be easier to review.
The downside is that you will lose all references to the current patchset and hence, all the review work you (and other people) have done so far will be somewhat wasted. Also, the entire git history of the driver development will be lost (at least externally).
John suggested something a bit different in the email thread of PATCH 07/83. He said to squash everything from patch 2 to 54 (including 71) and leave the remaining patches as they were, with maybe some additional small squashing.
Please tell me which option do you prefer:
A. Continue with the review of the current patchset. B. Go with my suggestion. C. Go with John's suggestion.
I'm going tomorrow (Sunday) to prepare options B & C, but I need your input before publishing anything. So, if you could reply by Monday morning my time, that would be great as it will allow me to publish (if chosen) the new set by Monday morning, EST.
And thanks again for the time you dedicate to this review. This is highly appreciated.
Oded
This lead me to the design, all the core kfd helper should be in hsa directory, and should be helper, while all specific bits to radeon should most likely be part of the radeon gfx kernel module. I am not against a radeon_kfd but realy i do not see the point. There should be a hsa.ko like there is a drm.ko.
Also i think it would be benefical to rename kfd to hsa because it is the most common name and the name that will bring valid hit with web search.
Each of the patch should have a proper commit message that is not too cryptic without being too chatty either.
Cheers, Jérôme
I will come back with individual patch comments and also general remarks. Cheers, Jérôme
Alexey Skidanov (4): hsa/radeon: 32-bit processes support hsa/radeon: NULL pointer dereference bug workaround hsa/radeon: HSA64/HSA32 modes support hsa/radeon: Add local memory to topology Andrew Lewycky (3): hsa/radeon: Make binding of process to device permanent hsa/radeon: Implement hsaKmtSetMemoryPolicy mm: Change timing of notification to IOMMUs about a page to be invalidated Ben Goz (20): hsa/radeon: Add queue and hw_pointer_store modules hsa/radeon: Add support allocating kernel doorbells hsa/radeon: Add mqd_manager module hsa/radeon: Add kernel queue support for KFD hsa/radeon: Add module parameter of scheduling policy hsa/radeon: Add packet manager module hsa/radeon: Add process queue manager module hsa/radeon: Add device queue manager module hsa/radeon: Switch to new queue scheduler hsa/radeon: Add IOCTL for update queue hsa/radeon: Queue Management integration with Memory Management hsa/radeon: update queue fault handling hsa/radeon: fixing a bug to support 32b processes hsa/radeon: Fix number of pipes per ME hsa/radeon: Removing hw pointer store module hsa/radeon: Adding some error messages hsa/radeon: Fixing minor issues with kernel queues (DIQ) drm/radeon: Add register access functions to kfd2kgd interface hsa/radeon: Eliminating all direct register accesses drm/radeon: Remove lock functions from kfd2kgd interface Evgeny Pinchuk (9): hsa/radeon: fix the OEMID assignment in kfd_topology drm/radeon: extending kfd-kgd interface hsa/radeon: implementing IOCTL for clock counters drm/radeon: adding synchronization for GRBM GFX hsa/radeon: fixing clock counters bug drm/radeon: Extending kfd interface hsa/radeon: Adding max clock speeds to topology hsa/radeon: Alternating the source of max clock hsa/radeon: Exclusive access for perf. counters Michael Varga (1): hsa/radeon: debugging print statements Oded Gabbay (45): mm: Add kfd_process pointer to mm_struct drm/radeon: reduce number of free VMIDs and pipes in KV drm/radeon: Report doorbell configuration to kfd drm/radeon: Add radeon <--> kfd interface drm/radeon: Add kfd-->kgd interface to get virtual ram size drm/radeon: Add kfd-->kgd interfaces of memory allocation/mapping drm/radeon: Add kfd-->kgd interface of locking srbm_gfx_cntl register drm/radeon: Add calls to initialize and finalize kfd from radeon hsa/radeon: Add code base of hsa driver for AMD's GPUs hsa/radeon: Add initialization and unmapping of doorbell aperture hsa/radeon: Add scheduler code hsa/radeon: Add kfd mmap handler hsa/radeon: Add 2 new IOCTL to kfd, CREATE_QUEUE and DESTROY_QUEUE hsa/radeon: Update MAINTAINERS and CREDITS files hsa/radeon: Add interrupt handling module hsa/radeon: Add the isr function of the KFD scehduler hsa/radeon: Handle deactivation of queues using interrupts hsa/radeon: Enable interrupts in KFD scheduler hsa/radeon: Enable/Disable KFD interrupt module hsa/radeon: Add interrupt callback function to kgd2kfd interface hsa/radeon: Add kgd-->kfd interfaces for suspend and resume drm/radeon: Add calls to suspend and resume of kfd driver drm/radeon/cik: Don't touch int of pipes 1-7 drm/radeon/cik: Call kfd isr function hsa/radeon: Fix memory size allocated for HPD hsa/radeon: Fix list of supported devices hsa/radeon: Fix coding style in cik_int.h hsa/radeon: Print ioctl commnad only in debug mode hsa/radeon: Print ISR info only in debug mode hsa/radeon: Workaround for a bug in amd_iommu hsa/radeon: Eliminate warnings in compilation hsa/radeon: Various kernel styling fixes hsa/radeon: Rearrange structures in kfd_ioctl.h hsa/radeon: change another pr_info to pr_debug hsa/radeon: Fix timeout calculation in sync_with_hw hsa/radeon: Update module information and version hsa/radeon: Update module version to 0.6.0 hsa/radeon: Fix initialization of sh_mem registers hsa/radeon: Fix compilation warnings hsa/radeon: Remove old scheduler code hsa/radeon: Static analysis (smatch) fixes hsa/radeon: Check oversubscription before destroying runlist hsa/radeon: Don't verify cksum when parsing CRAT table hsa/radeon: Update module version to 0.6.1 hsa/radeon: Update module version to 0.6.2 Yair Shachar (1): hsa/radeon: Adding qcm fence return status CREDITS | 7 + MAINTAINERS | 8 + drivers/Kconfig | 2 + drivers/gpu/Makefile | 1 + drivers/gpu/drm/radeon/Makefile | 1 + drivers/gpu/drm/radeon/cik.c | 156 +-- drivers/gpu/drm/radeon/cikd.h | 51 +- drivers/gpu/drm/radeon/radeon.h | 9 + drivers/gpu/drm/radeon/radeon_device.c | 32 + drivers/gpu/drm/radeon/radeon_drv.c | 6 + drivers/gpu/drm/radeon/radeon_kfd.c | 630 ++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 9 + drivers/gpu/hsa/Kconfig | 20 + drivers/gpu/hsa/Makefile | 1 + drivers/gpu/hsa/radeon/Makefile | 12 + drivers/gpu/hsa/radeon/cik_int.h | 50 + drivers/gpu/hsa/radeon/cik_mqds.h | 250 ++++ drivers/gpu/hsa/radeon/cik_regs.h | 220 ++++ drivers/gpu/hsa/radeon/kfd_aperture.c | 123 ++ drivers/gpu/hsa/radeon/kfd_chardev.c | 530 +++++++++ drivers/gpu/hsa/radeon/kfd_crat.h | 294 +++++ drivers/gpu/hsa/radeon/kfd_device.c | 244 ++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.c | 981 ++++++++++++++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.h | 101 ++ drivers/gpu/hsa/radeon/kfd_doorbell.c | 242 ++++ drivers/gpu/hsa/radeon/kfd_interrupt.c | 177 +++ drivers/gpu/hsa/radeon/kfd_kernel_queue.c | 305 +++++ drivers/gpu/hsa/radeon/kfd_kernel_queue.h | 66 ++ drivers/gpu/hsa/radeon/kfd_module.c | 130 +++ drivers/gpu/hsa/radeon/kfd_mqd_manager.c | 290 +++++ drivers/gpu/hsa/radeon/kfd_mqd_manager.h | 54 + drivers/gpu/hsa/radeon/kfd_packet_manager.c | 488 ++++++++ drivers/gpu/hsa/radeon/kfd_pasid.c | 97 ++ drivers/gpu/hsa/radeon/kfd_pm4_headers.h | 682 +++++++++++ drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h | 107 ++ drivers/gpu/hsa/radeon/kfd_priv.h | 475 ++++++++ drivers/gpu/hsa/radeon/kfd_process.c | 391 +++++++ drivers/gpu/hsa/radeon/kfd_process_queue_manager.c | 343 ++++++ drivers/gpu/hsa/radeon/kfd_queue.c | 109 ++ drivers/gpu/hsa/radeon/kfd_scheduler.h | 72 ++ drivers/gpu/hsa/radeon/kfd_topology.c | 1207 ++++++++++++++++++++ drivers/gpu/hsa/radeon/kfd_topology.h | 168 +++ drivers/gpu/hsa/radeon/kfd_vidmem.c | 97 ++ include/linux/mm_types.h | 14 + include/linux/radeon_kfd.h | 106 ++ include/uapi/linux/kfd_ioctl.h | 133 +++ mm/rmap.c | 8 +- 47 files changed, 9402 insertions(+), 97 deletions(-) create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c create mode 100644 drivers/gpu/hsa/Kconfig create mode 100644 drivers/gpu/hsa/Makefile create mode 100644 drivers/gpu/hsa/radeon/Makefile create mode 100644 drivers/gpu/hsa/radeon/cik_int.h create mode 100644 drivers/gpu/hsa/radeon/cik_mqds.h create mode 100644 drivers/gpu/hsa/radeon/cik_regs.h create mode 100644 drivers/gpu/hsa/radeon/kfd_aperture.c create mode 100644 drivers/gpu/hsa/radeon/kfd_chardev.c create mode 100644 drivers/gpu/hsa/radeon/kfd_crat.h create mode 100644 drivers/gpu/hsa/radeon/kfd_device.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_doorbell.c create mode 100644 drivers/gpu/hsa/radeon/kfd_interrupt.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.h create mode 100644 drivers/gpu/hsa/radeon/kfd_module.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_packet_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pasid.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_headers.h create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h create mode 100644 drivers/gpu/hsa/radeon/kfd_priv.h create mode 100644 drivers/gpu/hsa/radeon/kfd_process.c create mode 100644 drivers/gpu/hsa/radeon/kfd_process_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_scheduler.h create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.c create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.h create mode 100644 drivers/gpu/hsa/radeon/kfd_vidmem.c create mode 100644 include/linux/radeon_kfd.h create mode 100644 include/uapi/linux/kfd_ioctl.h -- 1.9.1
On Sat, Jul 12, 2014 at 09:55:49PM +0000, Gabbay, Oded wrote:
On Fri, 2014-07-11 at 17:18 -0400, Jerome Glisse wrote:
On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
This patch set implements a Heterogeneous System Architecture (HSA) driver for radeon-family GPUs.
This is just quick comments on few things. Given size of this, people will need to have time to review things.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system resources more effectively via HW features including shared pageable memory, userspace-accessible work queues, and platform-level atomics. In addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of GPUs also performs HW-level validation of commands passed in through the queues (aka rings). The code in this patch set is intended to serve both as a sample driver for other HSA-compatible hardware devices and as a production driver for radeon-family processors. The code is architected to support multiple CPUs each with connected GPUs, although the current implementation focuses on a single Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics driver (kgd). AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware functionality between HSA compute and regular gfx/compute (memory, interrupts, registers), while other functionality has been added specifically for HSA compute (hw scheduler for virtualized compute rings). All shared hardware is owned by the radeon graphics driver, and an interface between kfd and kgd allows the kfd to make use of those shared resources, while HSA-specific functionality is managed directly by kfd by submitting packets into an HSA-specific command queue (the "HIQ"). During kfd module initialization a char device node (/dev/kfd) is created (surviving until module exit), with ioctls for queue creation & management, and data structures are initialized for managing HSA device topology. The rest of the initialization is driven by calls from the radeon kgd at the following points :
- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established which connect to the associated graphics kernel driver. This information is exposed to userspace via sysfs, along with a version number allowing userspace to determine if a topology change has occurred while it was reading from sysfs. The interface between kfd and kgd also allows the kfd to request buffer management services from kgd, and allows kgd to route interrupt requests to kfd code since the interrupt block is shared between regular graphics/compute and HSA compute subsystems in the GPU. The kfd code works with an open source usermode library ("libhsakmt") which is in the final stages of IP review and should be published in a separate repo over the next few days. The code operates in one of three modes, selectable via the sched_policy module parameter :
- sched_policy=0 uses a hardware scheduler running in the MEC
block within CP, and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow
oversubscription, so create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver
manually assigns queues to HW slots by programming registers The "no HW scheduling" option is for debug & new hardware bringup only, so has less test coverage than the other options. Default in the current code is "HW scheduling without oversubscription" since that is where we have the most test coverage but we expect to change the default to "HW scheduling with oversubscription" after further testing. This effectively removes the HW limit on the number of work queues available to applications. Programs running on the GPU are associated with an address space through the VMID field, which is translated to a unique PASID at access time via a set of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) are partitioned (under control of the radeon kgd) between current gfx/compute and HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping registers are updated by the HW scheduler when used, and by driver code if HW scheduling is not being used. The Sea Islands compute queues use a new "doorbell" mechanism instead of the earlier kernel-managed write pointer registers. Doorbells use a separate BAR dedicated for this purpose, and pages within the doorbell aperture are mapped to userspace (each page mapped to only one user address space). Writes to the doorbell aperture are intercepted by GPU hardware, allowing userspace code to safely manage work queues (rings) without requiring a kernel call for every ring update. First step for an application process is to open the kfd device. Calls to open create a kfd "process" structure only for the first thread of the process. Subsequent open calls are checked to see if they are from processes using the same mm_struct and, if so, don't do anything. The kfd per-process data lives as long as the mm_struct exists. Each mm_struct is associated with a unique PASID, allowing the IOMMUv2 to make userspace process memory accessible to the GPU. Next step is for the application to collect topology information via sysfs. This gives userspace enough information to be able to identify specific nodes (processors) in subsequent queue management calls. Application processes can create queues on multiple processors, and processors support queues from multiple processes.
I am not a fan to use sysfs to discover topoly.
At this point the application can create work queues in userspace memory and pass them through the usermode library to kfd to have them mapped onto HW queue slots so that commands written to the queues can be executed by the GPU. Queue operations specify a processor node, and so the bulk of this code is device-specific. Written by John Bridgman John.Bridgman@amd.com
So general comment is you need to collapse many patches things like 58 fixing kernel style should be collapsed ie fix all previous patch that have broken style. Even worse is thing like 71, removing code you just added few patch earlier in the patchset.
Not quite, the code was added on patch 11.
This means that if i start reviewing following patch order i might spend time on code that is later deleted/modified/fixed ie time i spend understanding and reading some code might be just wasted.
Quick answer is that you are of course right, but having said that, I think it would be not simple at all to do that squashing. I squashed what I could, and probably I can do a little more (like 58), but the major problem is that one of the main modules of the driver - the scheduler (QCM) - was completely re-written (patches 46-53). Therefore, from patch 1 to 53, we use the old scheduler code and from 54 we use the new QCM (and the old scheduler code was completely remove at 71). So I could maybe squash 71 into 54, but that won't help you much, I think. So, the current advice I can give is to completely ignore the following files because they do not exist in the final commit:
- kfd_sched_cik_static.c (stopped using in 54)
- kfd_registers.c (stopped using in 81 because we moved all
register writes to radeon)
- kfd_hw_pointer_store.c (alive from 46 to 67)
- kfd_hw_pointer_store.h (alive from 46 to 67) Oded
Ok i try but this is driving me crazy, i am jungling btw full applied patchset and individual patch going back and forth trying to find which patch is the one i want to comment on. Not even mentioning that after that people would need to ack bad patch just because they know a latter good patch fix the badness.
This patchset can be reduced to less then 10 big patches. I would first do core patches that touch core things like iommu and are preparatory to the patchset. Then kfd patches that do not touch anything in radeon but implement the skeleton and core infrastructure. Then radeon specific patches.
Hi Jerome, I thought about what you said and I would like to make a suggestion. I believe I can restructure the patchset into 10-20 patches, organized logically and will be easier to review.
The downside is that you will lose all references to the current patchset and hence, all the review work you (and other people) have done so far will be somewhat wasted. Also, the entire git history of the driver development will be lost (at least externally).
This history does not matter much.
John suggested something a bit different in the email thread of PATCH 07/83. He said to squash everything from patch 2 to 54 (including 71) and leave the remaining patches as they were, with maybe some additional small squashing.
Please tell me which option do you prefer:
A. Continue with the review of the current patchset. B. Go with my suggestion. C. Go with John's suggestion.
I'm going tomorrow (Sunday) to prepare options B & C, but I need your input before publishing anything. So, if you could reply by Monday morning my time, that would be great as it will allow me to publish (if chosen) the new set by Monday morning, EST.
And thanks again for the time you dedicate to this review. This is highly appreciated.
Oded
Squashing patch together will not be enough, what really needs to happen is a hsa module like drm module and radeon module registering itself into this hsa module.
First patch whole ioctl set of hsa. Second patch core hsa code and ioctl sekeleton that returns ENODEV or alike. Third ioctl marshaling that parse ioctl parameter and handle common ioctl. Fourth add the driver hookup to core hsa and define the driver api. Fith implement said api for radeon.
Yes most of those 5 patches will be massive addition of single files but really that what makes sense here.
But before those 5 patches you most likely want to do some ground work on the iommu front.
Cheers, Jérôme
This lead me to the design, all the core kfd helper should be in hsa directory, and should be helper, while all specific bits to radeon should most likely be part of the radeon gfx kernel module. I am not against a radeon_kfd but realy i do not see the point. There should be a hsa.ko like there is a drm.ko.
Also i think it would be benefical to rename kfd to hsa because it is the most common name and the name that will bring valid hit with web search.
Each of the patch should have a proper commit message that is not too cryptic without being too chatty either.
Cheers, Jérôme
I will come back with individual patch comments and also general remarks. Cheers, Jérôme
Alexey Skidanov (4): hsa/radeon: 32-bit processes support hsa/radeon: NULL pointer dereference bug workaround hsa/radeon: HSA64/HSA32 modes support hsa/radeon: Add local memory to topology Andrew Lewycky (3): hsa/radeon: Make binding of process to device permanent hsa/radeon: Implement hsaKmtSetMemoryPolicy mm: Change timing of notification to IOMMUs about a page to be invalidated Ben Goz (20): hsa/radeon: Add queue and hw_pointer_store modules hsa/radeon: Add support allocating kernel doorbells hsa/radeon: Add mqd_manager module hsa/radeon: Add kernel queue support for KFD hsa/radeon: Add module parameter of scheduling policy hsa/radeon: Add packet manager module hsa/radeon: Add process queue manager module hsa/radeon: Add device queue manager module hsa/radeon: Switch to new queue scheduler hsa/radeon: Add IOCTL for update queue hsa/radeon: Queue Management integration with Memory Management hsa/radeon: update queue fault handling hsa/radeon: fixing a bug to support 32b processes hsa/radeon: Fix number of pipes per ME hsa/radeon: Removing hw pointer store module hsa/radeon: Adding some error messages hsa/radeon: Fixing minor issues with kernel queues (DIQ) drm/radeon: Add register access functions to kfd2kgd interface hsa/radeon: Eliminating all direct register accesses drm/radeon: Remove lock functions from kfd2kgd interface Evgeny Pinchuk (9): hsa/radeon: fix the OEMID assignment in kfd_topology drm/radeon: extending kfd-kgd interface hsa/radeon: implementing IOCTL for clock counters drm/radeon: adding synchronization for GRBM GFX hsa/radeon: fixing clock counters bug drm/radeon: Extending kfd interface hsa/radeon: Adding max clock speeds to topology hsa/radeon: Alternating the source of max clock hsa/radeon: Exclusive access for perf. counters Michael Varga (1): hsa/radeon: debugging print statements Oded Gabbay (45): mm: Add kfd_process pointer to mm_struct drm/radeon: reduce number of free VMIDs and pipes in KV drm/radeon: Report doorbell configuration to kfd drm/radeon: Add radeon <--> kfd interface drm/radeon: Add kfd-->kgd interface to get virtual ram size drm/radeon: Add kfd-->kgd interfaces of memory allocation/mapping drm/radeon: Add kfd-->kgd interface of locking srbm_gfx_cntl register drm/radeon: Add calls to initialize and finalize kfd from radeon hsa/radeon: Add code base of hsa driver for AMD's GPUs hsa/radeon: Add initialization and unmapping of doorbell aperture hsa/radeon: Add scheduler code hsa/radeon: Add kfd mmap handler hsa/radeon: Add 2 new IOCTL to kfd, CREATE_QUEUE and DESTROY_QUEUE hsa/radeon: Update MAINTAINERS and CREDITS files hsa/radeon: Add interrupt handling module hsa/radeon: Add the isr function of the KFD scehduler hsa/radeon: Handle deactivation of queues using interrupts hsa/radeon: Enable interrupts in KFD scheduler hsa/radeon: Enable/Disable KFD interrupt module hsa/radeon: Add interrupt callback function to kgd2kfd interface hsa/radeon: Add kgd-->kfd interfaces for suspend and resume drm/radeon: Add calls to suspend and resume of kfd driver drm/radeon/cik: Don't touch int of pipes 1-7 drm/radeon/cik: Call kfd isr function hsa/radeon: Fix memory size allocated for HPD hsa/radeon: Fix list of supported devices hsa/radeon: Fix coding style in cik_int.h hsa/radeon: Print ioctl commnad only in debug mode hsa/radeon: Print ISR info only in debug mode hsa/radeon: Workaround for a bug in amd_iommu hsa/radeon: Eliminate warnings in compilation hsa/radeon: Various kernel styling fixes hsa/radeon: Rearrange structures in kfd_ioctl.h hsa/radeon: change another pr_info to pr_debug hsa/radeon: Fix timeout calculation in sync_with_hw hsa/radeon: Update module information and version hsa/radeon: Update module version to 0.6.0 hsa/radeon: Fix initialization of sh_mem registers hsa/radeon: Fix compilation warnings hsa/radeon: Remove old scheduler code hsa/radeon: Static analysis (smatch) fixes hsa/radeon: Check oversubscription before destroying runlist hsa/radeon: Don't verify cksum when parsing CRAT table hsa/radeon: Update module version to 0.6.1 hsa/radeon: Update module version to 0.6.2 Yair Shachar (1): hsa/radeon: Adding qcm fence return status CREDITS | 7 + MAINTAINERS | 8 + drivers/Kconfig | 2 + drivers/gpu/Makefile | 1 + drivers/gpu/drm/radeon/Makefile | 1 + drivers/gpu/drm/radeon/cik.c | 156 +-- drivers/gpu/drm/radeon/cikd.h | 51 +- drivers/gpu/drm/radeon/radeon.h | 9 + drivers/gpu/drm/radeon/radeon_device.c | 32 + drivers/gpu/drm/radeon/radeon_drv.c | 6 + drivers/gpu/drm/radeon/radeon_kfd.c | 630 ++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 9 + drivers/gpu/hsa/Kconfig | 20 + drivers/gpu/hsa/Makefile | 1 + drivers/gpu/hsa/radeon/Makefile | 12 + drivers/gpu/hsa/radeon/cik_int.h | 50 + drivers/gpu/hsa/radeon/cik_mqds.h | 250 ++++ drivers/gpu/hsa/radeon/cik_regs.h | 220 ++++ drivers/gpu/hsa/radeon/kfd_aperture.c | 123 ++ drivers/gpu/hsa/radeon/kfd_chardev.c | 530 +++++++++ drivers/gpu/hsa/radeon/kfd_crat.h | 294 +++++ drivers/gpu/hsa/radeon/kfd_device.c | 244 ++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.c | 981 ++++++++++++++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.h | 101 ++ drivers/gpu/hsa/radeon/kfd_doorbell.c | 242 ++++ drivers/gpu/hsa/radeon/kfd_interrupt.c | 177 +++ drivers/gpu/hsa/radeon/kfd_kernel_queue.c | 305 +++++ drivers/gpu/hsa/radeon/kfd_kernel_queue.h | 66 ++ drivers/gpu/hsa/radeon/kfd_module.c | 130 +++ drivers/gpu/hsa/radeon/kfd_mqd_manager.c | 290 +++++ drivers/gpu/hsa/radeon/kfd_mqd_manager.h | 54 + drivers/gpu/hsa/radeon/kfd_packet_manager.c | 488 ++++++++ drivers/gpu/hsa/radeon/kfd_pasid.c | 97 ++ drivers/gpu/hsa/radeon/kfd_pm4_headers.h | 682 +++++++++++ drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h | 107 ++ drivers/gpu/hsa/radeon/kfd_priv.h | 475 ++++++++ drivers/gpu/hsa/radeon/kfd_process.c | 391 +++++++ drivers/gpu/hsa/radeon/kfd_process_queue_manager.c | 343 ++++++ drivers/gpu/hsa/radeon/kfd_queue.c | 109 ++ drivers/gpu/hsa/radeon/kfd_scheduler.h | 72 ++ drivers/gpu/hsa/radeon/kfd_topology.c | 1207 ++++++++++++++++++++ drivers/gpu/hsa/radeon/kfd_topology.h | 168 +++ drivers/gpu/hsa/radeon/kfd_vidmem.c | 97 ++ include/linux/mm_types.h | 14 + include/linux/radeon_kfd.h | 106 ++ include/uapi/linux/kfd_ioctl.h | 133 +++ mm/rmap.c | 8 +- 47 files changed, 9402 insertions(+), 97 deletions(-) create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c create mode 100644 drivers/gpu/hsa/Kconfig create mode 100644 drivers/gpu/hsa/Makefile create mode 100644 drivers/gpu/hsa/radeon/Makefile create mode 100644 drivers/gpu/hsa/radeon/cik_int.h create mode 100644 drivers/gpu/hsa/radeon/cik_mqds.h create mode 100644 drivers/gpu/hsa/radeon/cik_regs.h create mode 100644 drivers/gpu/hsa/radeon/kfd_aperture.c create mode 100644 drivers/gpu/hsa/radeon/kfd_chardev.c create mode 100644 drivers/gpu/hsa/radeon/kfd_crat.h create mode 100644 drivers/gpu/hsa/radeon/kfd_device.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_doorbell.c create mode 100644 drivers/gpu/hsa/radeon/kfd_interrupt.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.h create mode 100644 drivers/gpu/hsa/radeon/kfd_module.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_packet_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pasid.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_headers.h create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h create mode 100644 drivers/gpu/hsa/radeon/kfd_priv.h create mode 100644 drivers/gpu/hsa/radeon/kfd_process.c create mode 100644 drivers/gpu/hsa/radeon/kfd_process_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_scheduler.h create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.c create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.h create mode 100644 drivers/gpu/hsa/radeon/kfd_vidmem.c create mode 100644 include/linux/radeon_kfd.h create mode 100644 include/uapi/linux/kfd_ioctl.h -- 1.9.1
-----Original Message----- From: Jerome Glisse [mailto:j.glisse@gmail.com] Sent: Saturday, July 12, 2014 11:56 PM To: Gabbay, Oded Cc: linux-kernel@vger.kernel.org; Bridgman, John; Deucher, Alexander; Lewycky, Andrew; joro@8bytes.org; akpm@linux-foundation.org; dri- devel@lists.freedesktop.org; airlied@linux.ie; oded.gabbay@gmail.com Subject: Re: [PATCH 00/83] AMD HSA kernel driver
On Sat, Jul 12, 2014 at 09:55:49PM +0000, Gabbay, Oded wrote:
On Fri, 2014-07-11 at 17:18 -0400, Jerome Glisse wrote:
On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
This patch set implements a Heterogeneous System Architecture (HSA) driver for radeon-family GPUs.
This is just quick comments on few things. Given size of this, people will need to have time to review things.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system resources more effectively via HW features including shared pageable memory, userspace-accessible work queues, and platform-level atomics. In addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of GPUs also performs HW-level validation of commands passed in through the queues (aka rings). The code in this patch set is intended to serve both as a sample driver for other HSA-compatible hardware devices and as a production driver for radeon-family processors. The code is architected to support multiple CPUs each with connected GPUs, although the current implementation focuses on a single Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics driver (kgd). AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware functionality between HSA compute and regular gfx/compute (memory, interrupts, registers), while other functionality has been added specifically for HSA compute (hw scheduler for virtualized compute rings). All shared hardware is owned by the radeon graphics driver, and an interface between kfd and kgd allows the kfd to make use of those shared resources, while HSA-specific functionality is managed directly by kfd by submitting packets into an HSA-specific command queue (the "HIQ"). During kfd module initialization a char device node (/dev/kfd) is created (surviving until module exit), with ioctls for queue creation & management, and data structures are initialized for managing HSA device topology. The rest of the initialization is driven by calls from the radeon kgd at the following points :
- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)
During the probe and init processing per-device data structures are established which connect to the associated graphics kernel driver. This information is exposed to userspace via sysfs, along with a version number allowing userspace to determine if a topology change has occurred while it was reading from sysfs. The interface between kfd and kgd also allows the kfd to request buffer management services from kgd, and allows kgd to route interrupt requests to kfd code since the interrupt block is shared between regular graphics/compute and HSA compute subsystems in the GPU. The kfd code works with an open source usermode library ("libhsakmt") which is in the final stages of IP review and should be published in a separate repo over the next few days. The code operates in one of three modes, selectable via the sched_policy module parameter :
- sched_policy=0 uses a hardware scheduler running in the
MEC block within CP, and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow
oversubscription, so create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver
manually assigns queues to HW slots by programming registers The "no HW scheduling" option is for debug & new hardware bringup only, so has less test coverage than the other options. Default in the current code is "HW scheduling without oversubscription" since that is where we have the most test coverage but we expect to change the default to "HW scheduling with oversubscription" after further testing. This effectively removes the HW limit on the number of work queues available to applications. Programs running on the GPU are associated with an address space through the VMID field, which is translated to a unique PASID at access time via a set of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16) are partitioned (under control of the radeon kgd) between current gfx/compute and HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping registers are updated by the HW scheduler when used, and by driver code if HW scheduling is not being used. The Sea Islands compute queues use a new "doorbell" mechanism instead of the earlier kernel-managed write pointer registers. Doorbells use a separate BAR dedicated for this purpose, and pages within the doorbell aperture are mapped to userspace (each page mapped to only one user address space). Writes to the doorbell aperture are intercepted by GPU hardware, allowing userspace code to safely manage work queues (rings) without requiring a kernel call for every ring update. First step for an application process is to open the kfd device. Calls to open create a kfd "process" structure only for the first thread of the process. Subsequent open calls are checked to see if they are from processes using the same mm_struct and, if so, don't do anything. The kfd per-process data lives as long as the mm_struct exists. Each mm_struct is associated with a unique PASID, allowing the IOMMUv2 to make userspace process memory accessible to the GPU. Next step is for the application to collect topology information via sysfs. This gives userspace enough information to be able to identify specific nodes (processors) in subsequent queue management calls. Application processes can create queues on multiple processors, and processors support queues from multiple processes.
I am not a fan to use sysfs to discover topoly.
At this point the application can create work queues in userspace memory and pass them through the usermode library to kfd to have them mapped onto HW queue slots so that commands written to the queues can be executed by the GPU. Queue operations specify a processor node, and so the bulk of this code is device-specific. Written by John Bridgman John.Bridgman@amd.com
So general comment is you need to collapse many patches things like 58 fixing kernel style should be collapsed ie fix all previous patch that have broken style. Even worse is thing like 71, removing code you just added few patch earlier in the patchset.
Not quite, the code was added on patch 11.
This means that if i start reviewing following patch order i might spend time on code that is later deleted/modified/fixed ie time i spend understanding and reading some code might be just wasted.
Quick answer is that you are of course right, but having said that, I think it would be not simple at all to do that squashing. I squashed what I could, and probably I can do a little more (like 58), but the major problem is that one of the main modules of the driver - the scheduler (QCM) - was completely re-written (patches 46-53). Therefore, from patch 1 to 53, we use the old scheduler code and from 54 we use the new QCM (and the old scheduler code was completely remove at 71). So I could maybe squash 71 into 54, but that won't help you much, I think. So, the current advice I can give is to completely ignore the following files because they do not exist in the final commit:
- kfd_sched_cik_static.c (stopped using in 54)
- kfd_registers.c (stopped using in 81 because we moved all
register writes to radeon)
- kfd_hw_pointer_store.c (alive from 46 to 67)
- kfd_hw_pointer_store.h (alive from 46 to 67) Oded
Ok i try but this is driving me crazy, i am jungling btw full applied patchset and individual patch going back and forth trying to find which patch is the one i want to comment on. Not even mentioning that after that people would need to ack bad patch just because they know a latter good patch fix the badness.
This patchset can be reduced to less then 10 big patches. I would first do core patches that touch core things like iommu and are preparatory to the patchset. Then kfd patches that do not touch anything in radeon but implement the skeleton and core infrastructure. Then radeon specific patches.
Hi Jerome, I thought about what you said and I would like to make a suggestion. I believe I can restructure the patchset into 10-20 patches, organized logically and will be easier to review.
The downside is that you will lose all references to the current patchset and hence, all the review work you (and other people) have done so far will be somewhat wasted. Also, the entire git history of the driver development will be lost (at least externally).
This history does not matter much.
John suggested something a bit different in the email thread of PATCH 07/83. He said to squash everything from patch 2 to 54 (including 71) and leave the remaining patches as they were, with maybe some additional small squashing.
Please tell me which option do you prefer:
A. Continue with the review of the current patchset. B. Go with my suggestion. C. Go with John's suggestion.
I'm going tomorrow (Sunday) to prepare options B & C, but I need your input before publishing anything. So, if you could reply by Monday morning my time, that would be great as it will allow me to publish (if chosen) the new set by Monday morning, EST.
And thanks again for the time you dedicate to this review. This is highly appreciated.
Oded
Squashing patch together will not be enough, what really needs to happen is a hsa module like drm module and radeon module registering itself into this hsa module.
Hi Jerome;
Agree completely that this needs to end up as a cross-vendor core framework, but our HSAF partners are not ready to focus on standardizing the low-level implementation this point. That left us with two options -- (a) push out a framework now in the hope that other vendors would use it in the future (not a trivial thing given the wide range of hardware that can be used), or (b) push out a minimal driver with only the IOCTLs we felt could be maintained going forward, while (and this is the important part) making sure we avoided code and/or interfaces which would get in the way of making a standard "core HSA" framework in the future. Our feeling was that option (b) had a much greater chance of success and much lower risk of codifying things in the kernel which ended up not being used in the future.
Sorry for not including this in the initial writeup -- it was getting too long already but maybe this wasn't the right thing to leave out.
The current implementation (specifically the interface between kfd and kgd) is designed to support multiple gfx drivers for AMD hardware, and we tried to avoid any IOCTLs which would need to be replaced with multi-vendor-under-a-single-framework IOCTLs in the future, so I think we are meeting the abstract goal of not pushing something upstream that will leave litter in the kernel we all get stuck with later... in fact I really do believe this approach has the lowest risk of doing that.
That said, at minimum we should probably identify points in the code where a "core vs HW-specific" break would live, and comment that in the code so we have an RFC of sorts for that but without trying to implement and upstream a framework in advance of sufficient cross-vendor participation.
Would something like that be sufficient for now ?
Thanks, JB
First patch whole ioctl set of hsa. Second patch core hsa code and ioctl sekeleton that returns ENODEV or alike. Third ioctl marshaling that parse ioctl parameter and handle common ioctl. Fourth add the driver hookup to core hsa and define the driver api. Fith implement said api for radeon.
Yes most of those 5 patches will be massive addition of single files but really that what makes sense here.
But before those 5 patches you most likely want to do some ground work on the iommu front.
Cheers, Jérôme
This lead me to the design, all the core kfd helper should be in hsa directory, and should be helper, while all specific bits to radeon should most likely be part of the radeon gfx kernel module. I am not against a radeon_kfd but realy i do not see the point. There should be a hsa.ko like there is a drm.ko.
Also i think it would be benefical to rename kfd to hsa because it is the most common name and the name that will bring valid hit with web search.
Each of the patch should have a proper commit message that is not too cryptic without being too chatty either.
Cheers, Jérôme
I will come back with individual patch comments and also general remarks. Cheers, Jérôme
Alexey Skidanov (4): hsa/radeon: 32-bit processes support hsa/radeon: NULL pointer dereference bug workaround hsa/radeon: HSA64/HSA32 modes support hsa/radeon: Add local memory to topology Andrew Lewycky (3): hsa/radeon: Make binding of process to device permanent hsa/radeon: Implement hsaKmtSetMemoryPolicy mm: Change timing of notification to IOMMUs about a page to be invalidated Ben Goz (20): hsa/radeon: Add queue and hw_pointer_store modules hsa/radeon: Add support allocating kernel doorbells hsa/radeon: Add mqd_manager module hsa/radeon: Add kernel queue support for KFD hsa/radeon: Add module parameter of scheduling policy hsa/radeon: Add packet manager module hsa/radeon: Add process queue manager module hsa/radeon: Add device queue manager module hsa/radeon: Switch to new queue scheduler hsa/radeon: Add IOCTL for update queue hsa/radeon: Queue Management integration with Memory Management hsa/radeon: update queue fault handling hsa/radeon: fixing a bug to support 32b processes hsa/radeon: Fix number of pipes per ME hsa/radeon: Removing hw pointer store module hsa/radeon: Adding some error messages hsa/radeon: Fixing minor issues with kernel queues (DIQ) drm/radeon: Add register access functions to kfd2kgd interface hsa/radeon: Eliminating all direct register accesses drm/radeon: Remove lock functions from kfd2kgd interface Evgeny Pinchuk (9): hsa/radeon: fix the OEMID assignment in kfd_topology drm/radeon: extending kfd-kgd interface hsa/radeon: implementing IOCTL for clock counters drm/radeon: adding synchronization for GRBM GFX hsa/radeon: fixing clock counters bug drm/radeon: Extending kfd interface hsa/radeon: Adding max clock speeds to topology hsa/radeon: Alternating the source of max clock hsa/radeon: Exclusive access for perf. counters Michael Varga (1): hsa/radeon: debugging print statements Oded Gabbay (45): mm: Add kfd_process pointer to mm_struct drm/radeon: reduce number of free VMIDs and pipes in KV drm/radeon: Report doorbell configuration to kfd drm/radeon: Add radeon <--> kfd interface drm/radeon: Add kfd-->kgd interface to get virtual ram size drm/radeon: Add kfd-->kgd interfaces of memory allocation/mapping drm/radeon: Add kfd-->kgd interface of locking srbm_gfx_cntl register drm/radeon: Add calls to initialize and finalize kfd from radeon hsa/radeon: Add code base of hsa driver for AMD's GPUs hsa/radeon: Add initialization and unmapping of doorbell aperture hsa/radeon: Add scheduler code hsa/radeon: Add kfd mmap handler hsa/radeon: Add 2 new IOCTL to kfd, CREATE_QUEUE and DESTROY_QUEUE hsa/radeon: Update MAINTAINERS and CREDITS files hsa/radeon: Add interrupt handling module hsa/radeon: Add the isr function of the KFD scehduler hsa/radeon: Handle deactivation of queues using interrupts hsa/radeon: Enable interrupts in KFD scheduler hsa/radeon: Enable/Disable KFD interrupt module hsa/radeon: Add interrupt callback function to kgd2kfd interface hsa/radeon: Add kgd-->kfd interfaces for suspend and resume drm/radeon: Add calls to suspend and resume of kfd driver drm/radeon/cik: Don't touch int of pipes 1-7 drm/radeon/cik: Call kfd isr function hsa/radeon: Fix memory size allocated for HPD hsa/radeon: Fix list of supported devices hsa/radeon: Fix coding style in cik_int.h hsa/radeon: Print ioctl commnad only in debug mode hsa/radeon: Print ISR info only in debug mode hsa/radeon: Workaround for a bug in amd_iommu hsa/radeon: Eliminate warnings in compilation hsa/radeon: Various kernel styling fixes hsa/radeon: Rearrange structures in kfd_ioctl.h hsa/radeon: change another pr_info to pr_debug hsa/radeon: Fix timeout calculation in sync_with_hw hsa/radeon: Update module information and version hsa/radeon: Update module version to 0.6.0 hsa/radeon: Fix initialization of sh_mem registers hsa/radeon: Fix compilation warnings hsa/radeon: Remove old scheduler code hsa/radeon: Static analysis (smatch) fixes hsa/radeon: Check oversubscription before destroying runlist hsa/radeon: Don't verify cksum when parsing CRAT table hsa/radeon: Update module version to 0.6.1 hsa/radeon: Update module version to 0.6.2 Yair Shachar (1): hsa/radeon: Adding qcm fence return status CREDITS | 7 + MAINTAINERS | 8 + drivers/Kconfig | 2 + drivers/gpu/Makefile | 1 + drivers/gpu/drm/radeon/Makefile | 1 + drivers/gpu/drm/radeon/cik.c | 156 +-- drivers/gpu/drm/radeon/cikd.h | 51 +- drivers/gpu/drm/radeon/radeon.h | 9 + drivers/gpu/drm/radeon/radeon_device.c | 32 + drivers/gpu/drm/radeon/radeon_drv.c | 6 + drivers/gpu/drm/radeon/radeon_kfd.c | 630 ++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 9 + drivers/gpu/hsa/Kconfig | 20 + drivers/gpu/hsa/Makefile | 1 + drivers/gpu/hsa/radeon/Makefile | 12 + drivers/gpu/hsa/radeon/cik_int.h | 50 + drivers/gpu/hsa/radeon/cik_mqds.h | 250 ++++ drivers/gpu/hsa/radeon/cik_regs.h | 220 ++++ drivers/gpu/hsa/radeon/kfd_aperture.c | 123 ++ drivers/gpu/hsa/radeon/kfd_chardev.c | 530 +++++++++ drivers/gpu/hsa/radeon/kfd_crat.h | 294 +++++ drivers/gpu/hsa/radeon/kfd_device.c | 244 ++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.c | 981 ++++++++++++++++ drivers/gpu/hsa/radeon/kfd_device_queue_manager.h | 101 ++ drivers/gpu/hsa/radeon/kfd_doorbell.c | 242 ++++ drivers/gpu/hsa/radeon/kfd_interrupt.c | 177 +++ drivers/gpu/hsa/radeon/kfd_kernel_queue.c | 305 +++++ drivers/gpu/hsa/radeon/kfd_kernel_queue.h | 66 ++ drivers/gpu/hsa/radeon/kfd_module.c | 130 +++ drivers/gpu/hsa/radeon/kfd_mqd_manager.c | 290 +++++ drivers/gpu/hsa/radeon/kfd_mqd_manager.h | 54 + drivers/gpu/hsa/radeon/kfd_packet_manager.c | 488 ++++++++ drivers/gpu/hsa/radeon/kfd_pasid.c | 97 ++ drivers/gpu/hsa/radeon/kfd_pm4_headers.h | 682 +++++++++++ drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h | 107 ++ drivers/gpu/hsa/radeon/kfd_priv.h | 475 ++++++++ drivers/gpu/hsa/radeon/kfd_process.c | 391 +++++++ drivers/gpu/hsa/radeon/kfd_process_queue_manager.c | 343 ++++++ drivers/gpu/hsa/radeon/kfd_queue.c | 109 ++ drivers/gpu/hsa/radeon/kfd_scheduler.h | 72 ++ drivers/gpu/hsa/radeon/kfd_topology.c | 1207 ++++++++++++++++++++ drivers/gpu/hsa/radeon/kfd_topology.h | 168 +++ drivers/gpu/hsa/radeon/kfd_vidmem.c | 97 ++ include/linux/mm_types.h | 14 + include/linux/radeon_kfd.h | 106 ++ include/uapi/linux/kfd_ioctl.h | 133 +++ mm/rmap.c | 8 +- 47 files changed, 9402 insertions(+), 97 deletions(-) create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c create mode 100644 drivers/gpu/hsa/Kconfig create mode 100644 drivers/gpu/hsa/Makefile create mode 100644 drivers/gpu/hsa/radeon/Makefile create mode 100644 drivers/gpu/hsa/radeon/cik_int.h create mode 100644 drivers/gpu/hsa/radeon/cik_mqds.h create mode 100644 drivers/gpu/hsa/radeon/cik_regs.h create mode 100644 drivers/gpu/hsa/radeon/kfd_aperture.c create mode 100644 drivers/gpu/hsa/radeon/kfd_chardev.c create mode 100644 drivers/gpu/hsa/radeon/kfd_crat.h create mode 100644 drivers/gpu/hsa/radeon/kfd_device.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_device_queue_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_doorbell.c create mode 100644 drivers/gpu/hsa/radeon/kfd_interrupt.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_kernel_queue.h create mode 100644 drivers/gpu/hsa/radeon/kfd_module.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_mqd_manager.h create mode 100644 drivers/gpu/hsa/radeon/kfd_packet_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pasid.c create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_headers.h create mode 100644 drivers/gpu/hsa/radeon/kfd_pm4_opcodes.h create mode 100644 drivers/gpu/hsa/radeon/kfd_priv.h create mode 100644 drivers/gpu/hsa/radeon/kfd_process.c create mode 100644 drivers/gpu/hsa/radeon/kfd_process_queue_manager.c create mode 100644 drivers/gpu/hsa/radeon/kfd_queue.c create mode 100644 drivers/gpu/hsa/radeon/kfd_scheduler.h create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.c create mode 100644 drivers/gpu/hsa/radeon/kfd_topology.h create mode 100644 drivers/gpu/hsa/radeon/kfd_vidmem.c create mode 100644 include/linux/radeon_kfd.h create mode 100644 include/uapi/linux/kfd_ioctl.h -- 1.9.1
On Sun, Jul 13, 2014 at 03:34:12PM +0000, Bridgman, John wrote:
From: Jerome Glisse [mailto:j.glisse@gmail.com] Sent: Saturday, July 12, 2014 11:56 PM To: Gabbay, Oded Cc: linux-kernel@vger.kernel.org; Bridgman, John; Deucher, Alexander; Lewycky, Andrew; joro@8bytes.org; akpm@linux-foundation.org; dri- devel@lists.freedesktop.org; airlied@linux.ie; oded.gabbay@gmail.com Subject: Re: [PATCH 00/83] AMD HSA kernel driver
On Sat, Jul 12, 2014 at 09:55:49PM +0000, Gabbay, Oded wrote:
On Fri, 2014-07-11 at 17:18 -0400, Jerome Glisse wrote:
On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote: > This patch set implements a Heterogeneous System > Architecture > (HSA) driver > for radeon-family GPUs. This is just quick comments on few things. Given size of this, people will need to have time to review things. > HSA allows different processor types (CPUs, DSPs, GPUs, > etc..) to > share > system resources more effectively via HW features including > shared pageable > memory, userspace-accessible work queues, and platform-level > atomics. In > addition to the memory protection mechanisms in GPUVM and > IOMMUv2, the Sea > Islands family of GPUs also performs HW-level validation of > commands passed > in through the queues (aka rings). > The code in this patch set is intended to serve both as a > sample driver for > other HSA-compatible hardware devices and as a production > driver for > radeon-family processors. The code is architected to support > multiple CPUs > each with connected GPUs, although the current > implementation focuses on a > single Kaveri/Berlin APU, and works alongside the existing > radeon kernel > graphics driver (kgd). > AMD GPUs designed for use with HSA (Sea Islands and up) > share some hardware > functionality between HSA compute and regular gfx/compute > (memory, > interrupts, registers), while other functionality has been > added > specifically for HSA compute (hw scheduler for virtualized > compute rings). > All shared hardware is owned by the radeon graphics driver, > and an interface > between kfd and kgd allows the kfd to make use of those > shared resources, > while HSA-specific functionality is managed directly by kfd > by submitting > packets into an HSA-specific command queue (the "HIQ"). > During kfd module initialization a char device node > (/dev/kfd) is > created > (surviving until module exit), with ioctls for queue > creation & management, > and data structures are initialized for managing HSA device > topology. > The rest of the initialization is driven by calls from the > radeon kgd at > the following points : > - radeon_init (kfd_init) > - radeon_exit (kfd_fini) > - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) > - radeon_driver_unload_kms (kfd_device_fini) > During the probe and init processing per-device data > structures are > established which connect to the associated graphics kernel > driver. This > information is exposed to userspace via sysfs, along with a > version number > allowing userspace to determine if a topology change has > occurred while it > was reading from sysfs. > The interface between kfd and kgd also allows the kfd to > request buffer > management services from kgd, and allows kgd to route > interrupt requests to > kfd code since the interrupt block is shared between regular > graphics/compute and HSA compute subsystems in the GPU. > The kfd code works with an open source usermode library > ("libhsakmt") which > is in the final stages of IP review and should be published > in a separate > repo over the next few days. > The code operates in one of three modes, selectable via the > sched_policy > module parameter : > - sched_policy=0 uses a hardware scheduler running in the > MEC block within > CP, and allows oversubscription (more queues than HW slots) > - sched_policy=1 also uses HW scheduling but does not allow > oversubscription, so create_queue requests fail when we run > out of HW slots > - sched_policy=2 does not use HW scheduling, so the driver > manually assigns > queues to HW slots by programming registers > The "no HW scheduling" option is for debug & new hardware > bringup only, so > has less test coverage than the other options. Default in > the current code > is "HW scheduling without oversubscription" since that is > where we have the > most test coverage but we expect to change the default to > "HW scheduling > with oversubscription" after further testing. This > effectively removes the > HW limit on the number of work queues available to > applications. > Programs running on the GPU are associated with an address > space through the > VMID field, which is translated to a unique PASID at access > time via a set > of 16 VMID-to-PASID mapping registers. The available VMIDs > (currently 16) > are partitioned (under control of the radeon kgd) between > current > gfx/compute and HSA compute, with each getting 8 in the > current code. The > VMID-to-PASID mapping registers are updated by the HW > scheduler when used, > and by driver code if HW scheduling is not being used. > The Sea Islands compute queues use a new "doorbell" > mechanism instead of the > earlier kernel-managed write pointer registers. Doorbells > use a separate BAR > dedicated for this purpose, and pages within the doorbell > aperture are > mapped to userspace (each page mapped to only one user > address space). > Writes to the doorbell aperture are intercepted by GPU > hardware, allowing > userspace code to safely manage work queues (rings) without > requiring a > kernel call for every ring update. > First step for an application process is to open the kfd > device. > Calls to > open create a kfd "process" structure only for the first > thread of the > process. Subsequent open calls are checked to see if they > are from processes > using the same mm_struct and, if so, don't do anything. The > kfd per-process > data lives as long as the mm_struct exists. Each mm_struct > is associated > with a unique PASID, allowing the IOMMUv2 to make userspace > process memory > accessible to the GPU. > Next step is for the application to collect topology > information via sysfs. > This gives userspace enough information to be able to > identify specific > nodes (processors) in subsequent queue management calls. > Application > processes can create queues on multiple processors, and > processors support > queues from multiple processes. I am not a fan to use sysfs to discover topoly. > At this point the application can create work queues in > userspace memory and > pass them through the usermode library to kfd to have them > mapped onto HW > queue slots so that commands written to the queues can be > executed by the > GPU. Queue operations specify a processor node, and so the > bulk of this code > is device-specific. > Written by John Bridgman John.Bridgman@amd.com So general comment is you need to collapse many patches things like 58 fixing kernel style should be collapsed ie fix all previous patch that have broken style. Even worse is thing like 71, removing code you just added few patch earlier in the patchset.
Not quite, the code was added on patch 11.
This means that if i start reviewing following patch order i might spend time on code that is later deleted/modified/fixed ie time i spend understanding and reading some code might be just wasted.
Quick answer is that you are of course right, but having said that, I think it would be not simple at all to do that squashing. I squashed what I could, and probably I can do a little more (like 58), but the major problem is that one of the main modules of the driver - the scheduler (QCM) - was completely re-written (patches 46-53). Therefore, from patch 1 to 53, we use the old scheduler code and from 54 we use the new QCM (and the old scheduler code was completely remove at 71). So I could maybe squash 71 into 54, but that won't help you much, I think. So, the current advice I can give is to completely ignore the following files because they do not exist in the final commit:
- kfd_sched_cik_static.c (stopped using in 54)
- kfd_registers.c (stopped using in 81 because we moved all
register writes to radeon)
- kfd_hw_pointer_store.c (alive from 46 to 67)
- kfd_hw_pointer_store.h (alive from 46 to 67) Oded
Ok i try but this is driving me crazy, i am jungling btw full applied patchset and individual patch going back and forth trying to find which patch is the one i want to comment on. Not even mentioning that after that people would need to ack bad patch just because they know a latter good patch fix the badness.
This patchset can be reduced to less then 10 big patches. I would first do core patches that touch core things like iommu and are preparatory to the patchset. Then kfd patches that do not touch anything in radeon but implement the skeleton and core infrastructure. Then radeon specific patches.
Hi Jerome, I thought about what you said and I would like to make a suggestion. I believe I can restructure the patchset into 10-20 patches, organized logically and will be easier to review.
The downside is that you will lose all references to the current patchset and hence, all the review work you (and other people) have done so far will be somewhat wasted. Also, the entire git history of the driver development will be lost (at least externally).
This history does not matter much.
John suggested something a bit different in the email thread of PATCH 07/83. He said to squash everything from patch 2 to 54 (including 71) and leave the remaining patches as they were, with maybe some additional small squashing.
Please tell me which option do you prefer:
A. Continue with the review of the current patchset. B. Go with my suggestion. C. Go with John's suggestion.
I'm going tomorrow (Sunday) to prepare options B & C, but I need your input before publishing anything. So, if you could reply by Monday morning my time, that would be great as it will allow me to publish (if chosen) the new set by Monday morning, EST.
And thanks again for the time you dedicate to this review. This is highly appreciated.
Oded
Squashing patch together will not be enough, what really needs to happen is a hsa module like drm module and radeon module registering itself into this hsa module.
Hi Jerome;
Agree completely that this needs to end up as a cross-vendor core framework, but our HSAF partners are not ready to focus on standardizing the low-level implementation this point. That left us with two options -- (a) push out a framework now in the hope that other vendors would use it in the future (not a trivial thing given the wide range of hardware that can be used), or (b) push out a minimal driver with only the IOCTLs we felt could be maintained going forward, while (and this is the important part) making sure we avoided code and/or interfaces which would get in the way of making a standard "core HSA" framework in the future. Our feeling was that option (b) had a much greater chance of success and much lower risk of codifying things in the kernel which ended up not being used in the future.
Sorry for not including this in the initial writeup -- it was getting too long already but maybe this wasn't the right thing to leave out.
The current implementation (specifically the interface between kfd and kgd) is designed to support multiple gfx drivers for AMD hardware, and we tried to avoid any IOCTLs which would need to be replaced with multi-vendor-under-a-single-framework IOCTLs in the future, so I think we are meeting the abstract goal of not pushing something upstream that will leave litter in the kernel we all get stuck with later... in fact I really do believe this approach has the lowest risk of doing that.
That said, at minimum we should probably identify points in the code where a "core vs HW-specific" break would live, and comment that in the code so we have an RFC of sorts for that but without trying to implement and upstream a framework in advance of sufficient cross-vendor participation.
Would something like that be sufficient for now ?
Thanks, JB
So i need to go over hsa specific once more but from memory doorbell, ring buffer with pm4 packet format are mandatory and thus the same ioctl can be use to setup this no matter what hardware is behind.
I would rather avoid having temporary ioctl. Common ioctl can be designed with enough flexibilities for future change. First field of each ioctl can be a version field describing a structure version.
I can understand the reluctance to enforce some framework to not bother your partner but this is the Linux kernel, and if hardware manufacturer wants to have their hardware supported they will have to abide by what ever rules the kernel community decide.
I vote for HSA module that expose ioctl and is an intermediary with the kernel driver that handle the hardware. This gives a single point for HSA hardware and yes this enforce things for any hardware manufacturer. I am more than happy to tell them that this is it and nothing else if they want to get upstream.
Cheers, Jérôme
I vote for HSA module that expose ioctl and is an intermediary with the kernel driver that handle the hardware. This gives a single point for HSA hardware and yes this enforce things for any hardware manufacturer. I am more than happy to tell them that this is it and nothing else if they want to get upstream.
I think we should still discuss this single point of entry a bit more.
Just to make it clear the plan is to expose all physical HSA capable devices through a single /dev/hsa device node to userspace.
While this obviously makes device enumeration much easier it's still a quite hard break with Unix traditions. Essentially we now expose all devices of one kind though a single device node instead of creating independent nodes for each physical or logical device.
What makes it even worse is that we want to expose different drivers though the same device node.
Because of this any effort of a system administrator to limit access to HSA is reduced to an on/off decision. It's simply not possible any more to apply simple file system access semantics to individual hardware devices.
Just imaging you are an administrator with a bunch of different compute cards in a system and you want to restrict access of one off them because it's faulty or has a security problem or something like this. Or you have several hardware device and want to assign each of them a distinct container.
Just some thoughts, Christian.
Am 13.07.2014 18:49, schrieb Jerome Glisse:
On Sun, Jul 13, 2014 at 03:34:12PM +0000, Bridgman, John wrote:
From: Jerome Glisse [mailto:j.glisse@gmail.com] Sent: Saturday, July 12, 2014 11:56 PM To: Gabbay, Oded Cc: linux-kernel@vger.kernel.org; Bridgman, John; Deucher, Alexander; Lewycky, Andrew; joro@8bytes.org; akpm@linux-foundation.org; dri- devel@lists.freedesktop.org; airlied@linux.ie; oded.gabbay@gmail.com Subject: Re: [PATCH 00/83] AMD HSA kernel driver
On Sat, Jul 12, 2014 at 09:55:49PM +0000, Gabbay, Oded wrote:
On Fri, 2014-07-11 at 17:18 -0400, Jerome Glisse wrote:
On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote: > On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote: >> This patch set implements a Heterogeneous System >> Architecture >> (HSA) driver >> for radeon-family GPUs. > This is just quick comments on few things. Given size of this, > people will need to have time to review things. >> HSA allows different processor types (CPUs, DSPs, GPUs, >> etc..) to >> share >> system resources more effectively via HW features including >> shared pageable >> memory, userspace-accessible work queues, and platform-level >> atomics. In >> addition to the memory protection mechanisms in GPUVM and >> IOMMUv2, the Sea >> Islands family of GPUs also performs HW-level validation of >> commands passed >> in through the queues (aka rings). >> The code in this patch set is intended to serve both as a >> sample driver for >> other HSA-compatible hardware devices and as a production >> driver for >> radeon-family processors. The code is architected to support >> multiple CPUs >> each with connected GPUs, although the current >> implementation focuses on a >> single Kaveri/Berlin APU, and works alongside the existing >> radeon kernel >> graphics driver (kgd). >> AMD GPUs designed for use with HSA (Sea Islands and up) >> share some hardware >> functionality between HSA compute and regular gfx/compute >> (memory, >> interrupts, registers), while other functionality has been >> added >> specifically for HSA compute (hw scheduler for virtualized >> compute rings). >> All shared hardware is owned by the radeon graphics driver, >> and an interface >> between kfd and kgd allows the kfd to make use of those >> shared resources, >> while HSA-specific functionality is managed directly by kfd >> by submitting >> packets into an HSA-specific command queue (the "HIQ"). >> During kfd module initialization a char device node >> (/dev/kfd) is >> created >> (surviving until module exit), with ioctls for queue >> creation & management, >> and data structures are initialized for managing HSA device >> topology. >> The rest of the initialization is driven by calls from the >> radeon kgd at >> the following points : >> - radeon_init (kfd_init) >> - radeon_exit (kfd_fini) >> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init) >> - radeon_driver_unload_kms (kfd_device_fini) >> During the probe and init processing per-device data >> structures are >> established which connect to the associated graphics kernel >> driver. This >> information is exposed to userspace via sysfs, along with a >> version number >> allowing userspace to determine if a topology change has >> occurred while it >> was reading from sysfs. >> The interface between kfd and kgd also allows the kfd to >> request buffer >> management services from kgd, and allows kgd to route >> interrupt requests to >> kfd code since the interrupt block is shared between regular >> graphics/compute and HSA compute subsystems in the GPU. >> The kfd code works with an open source usermode library >> ("libhsakmt") which >> is in the final stages of IP review and should be published >> in a separate >> repo over the next few days. >> The code operates in one of three modes, selectable via the >> sched_policy >> module parameter : >> - sched_policy=0 uses a hardware scheduler running in the >> MEC block within >> CP, and allows oversubscription (more queues than HW slots) >> - sched_policy=1 also uses HW scheduling but does not allow >> oversubscription, so create_queue requests fail when we run >> out of HW slots >> - sched_policy=2 does not use HW scheduling, so the driver >> manually assigns >> queues to HW slots by programming registers >> The "no HW scheduling" option is for debug & new hardware >> bringup only, so >> has less test coverage than the other options. Default in >> the current code >> is "HW scheduling without oversubscription" since that is >> where we have the >> most test coverage but we expect to change the default to >> "HW scheduling >> with oversubscription" after further testing. This >> effectively removes the >> HW limit on the number of work queues available to >> applications. >> Programs running on the GPU are associated with an address >> space through the >> VMID field, which is translated to a unique PASID at access >> time via a set >> of 16 VMID-to-PASID mapping registers. The available VMIDs >> (currently 16) >> are partitioned (under control of the radeon kgd) between >> current >> gfx/compute and HSA compute, with each getting 8 in the >> current code. The >> VMID-to-PASID mapping registers are updated by the HW >> scheduler when used, >> and by driver code if HW scheduling is not being used. >> The Sea Islands compute queues use a new "doorbell" >> mechanism instead of the >> earlier kernel-managed write pointer registers. Doorbells >> use a separate BAR >> dedicated for this purpose, and pages within the doorbell >> aperture are >> mapped to userspace (each page mapped to only one user >> address space). >> Writes to the doorbell aperture are intercepted by GPU >> hardware, allowing >> userspace code to safely manage work queues (rings) without >> requiring a >> kernel call for every ring update. >> First step for an application process is to open the kfd >> device. >> Calls to >> open create a kfd "process" structure only for the first >> thread of the >> process. Subsequent open calls are checked to see if they >> are from processes >> using the same mm_struct and, if so, don't do anything. The >> kfd per-process >> data lives as long as the mm_struct exists. Each mm_struct >> is associated >> with a unique PASID, allowing the IOMMUv2 to make userspace >> process memory >> accessible to the GPU. >> Next step is for the application to collect topology >> information via sysfs. >> This gives userspace enough information to be able to >> identify specific >> nodes (processors) in subsequent queue management calls. >> Application >> processes can create queues on multiple processors, and >> processors support >> queues from multiple processes. > I am not a fan to use sysfs to discover topoly. >> At this point the application can create work queues in >> userspace memory and >> pass them through the usermode library to kfd to have them >> mapped onto HW >> queue slots so that commands written to the queues can be >> executed by the >> GPU. Queue operations specify a processor node, and so the >> bulk of this code >> is device-specific. >> Written by John Bridgman John.Bridgman@amd.com > So general comment is you need to collapse many patches things > like > 58 fixing > kernel style should be collapsed ie fix all previous patch that > have broken style. > Even worse is thing like 71, removing code you just added few > patch earlier in the patchset. Not quite, the code was added on patch 11. > This means that if i start reviewing following patch order i > might spend time on code that is later deleted/modified/fixed ie > time i spend understanding and reading some code might be just > wasted. Quick answer is that you are of course right, but having said that, I think it would be not simple at all to do that squashing. I squashed what I could, and probably I can do a little more (like 58), but the major problem is that one of the main modules of the driver - the scheduler (QCM) - was completely re-written (patches 46-53). Therefore, from patch 1 to 53, we use the old scheduler code and from 54 we use the new QCM (and the old scheduler code was completely remove at 71). So I could maybe squash 71 into 54, but that won't help you much, I think. So, the current advice I can give is to completely ignore the following files because they do not exist in the final commit:
- kfd_sched_cik_static.c (stopped using in 54)
- kfd_registers.c (stopped using in 81 because we moved all
register writes to radeon)
- kfd_hw_pointer_store.c (alive from 46 to 67)
- kfd_hw_pointer_store.h (alive from 46 to 67) Oded
Ok i try but this is driving me crazy, i am jungling btw full applied patchset and individual patch going back and forth trying to find which patch is the one i want to comment on. Not even mentioning that after that people would need to ack bad patch just because they know a latter good patch fix the badness.
This patchset can be reduced to less then 10 big patches. I would first do core patches that touch core things like iommu and are preparatory to the patchset. Then kfd patches that do not touch anything in radeon but implement the skeleton and core infrastructure. Then radeon specific patches.
Hi Jerome, I thought about what you said and I would like to make a suggestion. I believe I can restructure the patchset into 10-20 patches, organized logically and will be easier to review.
The downside is that you will lose all references to the current patchset and hence, all the review work you (and other people) have done so far will be somewhat wasted. Also, the entire git history of the driver development will be lost (at least externally).
This history does not matter much.
John suggested something a bit different in the email thread of PATCH 07/83. He said to squash everything from patch 2 to 54 (including 71) and leave the remaining patches as they were, with maybe some additional small squashing.
Please tell me which option do you prefer:
A. Continue with the review of the current patchset. B. Go with my suggestion. C. Go with John's suggestion.
I'm going tomorrow (Sunday) to prepare options B & C, but I need your input before publishing anything. So, if you could reply by Monday morning my time, that would be great as it will allow me to publish (if chosen) the new set by Monday morning, EST.
And thanks again for the time you dedicate to this review. This is highly appreciated.
Oded
Squashing patch together will not be enough, what really needs to happen is a hsa module like drm module and radeon module registering itself into this hsa module.
Hi Jerome;
Agree completely that this needs to end up as a cross-vendor core framework, but our HSAF partners are not ready to focus on standardizing the low-level implementation this point. That left us with two options -- (a) push out a framework now in the hope that other vendors would use it in the future (not a trivial thing given the wide range of hardware that can be used), or (b) push out a minimal driver with only the IOCTLs we felt could be maintained going forward, while (and this is the important part) making sure we avoided code and/or interfaces which would get in the way of making a standard "core HSA" framework in the future. Our feeling was that option (b) had a much greater chance of success and much lower risk of codifying things in the kernel which ended up not being used in the future.
Sorry for not including this in the initial writeup -- it was getting too long already but maybe this wasn't the right thing to leave out.
The current implementation (specifically the interface between kfd and kgd) is designed to support multiple gfx drivers for AMD hardware, and we tried to avoid any IOCTLs which would need to be replaced with multi-vendor-under-a-single-framework IOCTLs in the future, so I think we are meeting the abstract goal of not pushing something upstream that will leave litter in the kernel we all get stuck with later... in fact I really do believe this approach has the lowest risk of doing that.
That said, at minimum we should probably identify points in the code where a "core vs HW-specific" break would live, and comment that in the code so we have an RFC of sorts for that but without trying to implement and upstream a framework in advance of sufficient cross-vendor participation.
Would something like that be sufficient for now ?
Thanks, JB
So i need to go over hsa specific once more but from memory doorbell, ring buffer with pm4 packet format are mandatory and thus the same ioctl can be use to setup this no matter what hardware is behind.
I would rather avoid having temporary ioctl. Common ioctl can be designed with enough flexibilities for future change. First field of each ioctl can be a version field describing a structure version.
I can understand the reluctance to enforce some framework to not bother your partner but this is the Linux kernel, and if hardware manufacturer wants to have their hardware supported they will have to abide by what ever rules the kernel community decide.
I vote for HSA module that expose ioctl and is an intermediary with the kernel driver that handle the hardware. This gives a single point for HSA hardware and yes this enforce things for any hardware manufacturer. I am more than happy to tell them that this is it and nothing else if they want to get upstream.
Cheers, Jérôme _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On 14 July 2014 18:37, Christian König deathsimple@vodafone.de wrote:
I vote for HSA module that expose ioctl and is an intermediary with the kernel driver that handle the hardware. This gives a single point for HSA hardware and yes this enforce things for any hardware manufacturer. I am more than happy to tell them that this is it and nothing else if they want to get upstream.
I think we should still discuss this single point of entry a bit more.
Just to make it clear the plan is to expose all physical HSA capable devices through a single /dev/hsa device node to userspace.
This is why we don't design kernel interfaces in secret foundations, and expect anyone to like them.
So before we go any further, how is this stuff planned to work for multiple GPUs/accelerators?
Do we have a userspace to exercise this interface so we can see how such a thing would look?
Dave.
On Tue, Jul 15, 2014 at 02:35:19PM +1000, Dave Airlie wrote:
On 14 July 2014 18:37, Christian König deathsimple@vodafone.de wrote:
I vote for HSA module that expose ioctl and is an intermediary with the kernel driver that handle the hardware. This gives a single point for HSA hardware and yes this enforce things for any hardware manufacturer. I am more than happy to tell them that this is it and nothing else if they want to get upstream.
I think we should still discuss this single point of entry a bit more.
Just to make it clear the plan is to expose all physical HSA capable devices through a single /dev/hsa device node to userspace.
This is why we don't design kernel interfaces in secret foundations, and expect anyone to like them.
I think at this time this is unlikely to get into 3.17. But Christian had point on having multiple device file. So something like /dev/hsa/*
So before we go any further, how is this stuff planned to work for multiple GPUs/accelerators?
My understanding is that you create queue and each queue is associated with a device. You can create several queue for same device and have different priority btw queue.
Btw queue here means a ring buffer that understand a common set of pm4 packet.
Do we have a userspace to exercise this interface so we can see how such a thing would look?
I think we need to wait a bit before freezing and accepting the kernel api and see enough userspace bits to be confortable. Moreover if AMD wants common API for HSA i also think that they at very least needs there HSA partner to make public comment on the kernel API.
Cheers, Jérôme
-----Original Message----- From: Dave Airlie [mailto:airlied@gmail.com] Sent: Tuesday, July 15, 2014 12:35 AM To: Christian König Cc: Jerome Glisse; Bridgman, John; Lewycky, Andrew; linux- kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; Deucher, Alexander; akpm@linux-foundation.org Subject: Re: [PATCH 00/83] AMD HSA kernel driver
On 14 July 2014 18:37, Christian König deathsimple@vodafone.de wrote:
I vote for HSA module that expose ioctl and is an intermediary with the kernel driver that handle the hardware. This gives a single point for HSA hardware and yes this enforce things for any hardware
manufacturer.
I am more than happy to tell them that this is it and nothing else if they want to get upstream.
I think we should still discuss this single point of entry a bit more.
Just to make it clear the plan is to expose all physical HSA capable devices through a single /dev/hsa device node to userspace.
This is why we don't design kernel interfaces in secret foundations, and expect anyone to like them.
Understood and agree. In this case though this isn't a cross-vendor interface designed by a secret committee, it's supposed to be more of an inoffensive little single-vendor interface designed *for* a secret committee. I'm hoping that's better ;)
So before we go any further, how is this stuff planned to work for multiple GPUs/accelerators?
Three classes of "multiple" :
1. Single CPU with IOMMUv2 and multiple GPUs:
- all devices accessible via /dev/kfd - topology information identifies CPU + GPUs, each has "node ID" at top of userspace API, "global ID" at user/kernel interface (don't think we've implemented CPU part yet though) - userspace builds snapshot from sysfs info & exposes to HSAIL runtime, which in turn exposes the "standard" API - kfd sets up ATC aperture so GPUs can access system RAM via IOMMUv2 (fast for APU, relatively less so for dGPU over PCIE) - to-be-added memory operations allow allocation & residency control (within existing gfx driver limits) of buffers in VRAM & carved-out system RAM - queue operations specify a node ID to userspace library, which translates to "global ID" before calling kfd
2. Multiple CPUs connected via fabric (eg HyperTransport) each with 0 or more GPUs:
- topology information exposes CPUs & GPUs, along with affinity info showing what is connected to what - everything else works as in (1) above
3. Multiple CPUs not connected via fabric (eg a blade server) each with 0 or more GPUs
- no attempt to cover this with HSA topology, each CPU and associated GPUs is accessed independently via separate /dev/kfd instances
Do we have a userspace to exercise this interface so we can see how such a thing would look?
Yes -- initial IP review done, legal stuff done, sanitizing WIP, hoping for final approval this week
There's a separate test harness to exercise the userspace lib calls, haven't started IP review or sanitizing for that but legal stuff is done
Dave.
-----Original Message----- From: dri-devel [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Bridgman, John Sent: Tuesday, July 15, 2014 1:07 PM To: Dave Airlie; Christian König Cc: Lewycky, Andrew; linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org; Deucher, Alexander; akpm@linux- foundation.org Subject: RE: [PATCH 00/83] AMD HSA kernel driver
-----Original Message----- From: Dave Airlie [mailto:airlied@gmail.com] Sent: Tuesday, July 15, 2014 12:35 AM To: Christian König Cc: Jerome Glisse; Bridgman, John; Lewycky, Andrew; linux- kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; Deucher, Alexander; akpm@linux-foundation.org Subject: Re: [PATCH 00/83] AMD HSA kernel driver
On 14 July 2014 18:37, Christian König deathsimple@vodafone.de wrote:
I vote for HSA module that expose ioctl and is an intermediary with the kernel driver that handle the hardware. This gives a single point for HSA hardware and yes this enforce things for any hardware
manufacturer.
I am more than happy to tell them that this is it and nothing else if they want to get upstream.
I think we should still discuss this single point of entry a bit more.
Just to make it clear the plan is to expose all physical HSA capable devices through a single /dev/hsa device node to userspace.
This is why we don't design kernel interfaces in secret foundations, and expect anyone to like them.
Understood and agree. In this case though this isn't a cross-vendor interface designed by a secret committee, it's supposed to be more of an inoffensive little single-vendor interface designed *for* a secret committee. I'm hoping that's better ;)
So before we go any further, how is this stuff planned to work for multiple GPUs/accelerators?
Three classes of "multiple" :
- Single CPU with IOMMUv2 and multiple GPUs:
- all devices accessible via /dev/kfd
- topology information identifies CPU + GPUs, each has "node ID" at top of
userspace API, "global ID" at user/kernel interface (don't think we've implemented CPU part yet though)
- userspace builds snapshot from sysfs info & exposes to HSAIL runtime,
which in turn exposes the "standard" API
- kfd sets up ATC aperture so GPUs can access system RAM via IOMMUv2 (fast
for APU, relatively less so for dGPU over PCIE)
- to-be-added memory operations allow allocation & residency control
(within existing gfx driver limits) of buffers in VRAM & carved-out system RAM
- queue operations specify a node ID to userspace library, which translates to
"global ID" before calling kfd
- Multiple CPUs connected via fabric (eg HyperTransport) each with 0 or
more GPUs:
- topology information exposes CPUs & GPUs, along with affinity info
showing what is connected to what
- everything else works as in (1) above
This is probably a good point to stress that HSA topology is only intended as an OS-independent way of communicating system info up to higher levels of the HSA stack, not as a new and competing way to *manage* system properties inside Linux or any other OS.
- Multiple CPUs not connected via fabric (eg a blade server) each with 0 or
more GPUs
- no attempt to cover this with HSA topology, each CPU and associated GPUs
is accessed independently via separate /dev/kfd instances
Do we have a userspace to exercise this interface so we can see how such a thing would look?
Yes -- initial IP review done, legal stuff done, sanitizing WIP, hoping for final approval this week
There's a separate test harness to exercise the userspace lib calls, haven't started IP review or sanitizing for that but legal stuff is done
Dave.
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Jul 15, 2014 at 05:06:56PM +0000, Bridgman, John wrote:
From: Dave Airlie [mailto:airlied@gmail.com] Sent: Tuesday, July 15, 2014 12:35 AM To: Christian König Cc: Jerome Glisse; Bridgman, John; Lewycky, Andrew; linux- kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; Deucher, Alexander; akpm@linux-foundation.org Subject: Re: [PATCH 00/83] AMD HSA kernel driver
On 14 July 2014 18:37, Christian König deathsimple@vodafone.de wrote:
I vote for HSA module that expose ioctl and is an intermediary with the kernel driver that handle the hardware. This gives a single point for HSA hardware and yes this enforce things for any hardware
manufacturer.
I am more than happy to tell them that this is it and nothing else if they want to get upstream.
I think we should still discuss this single point of entry a bit more.
Just to make it clear the plan is to expose all physical HSA capable devices through a single /dev/hsa device node to userspace.
This is why we don't design kernel interfaces in secret foundations, and expect anyone to like them.
Understood and agree. In this case though this isn't a cross-vendor interface designed by a secret committee, it's supposed to be more of an inoffensive little single-vendor interface designed *for* a secret committee. I'm hoping that's better ;)
So before we go any further, how is this stuff planned to work for multiple GPUs/accelerators?
Three classes of "multiple" :
- Single CPU with IOMMUv2 and multiple GPUs:
- all devices accessible via /dev/kfd
- topology information identifies CPU + GPUs, each has "node ID" at top of userspace API, "global ID" at user/kernel interface
(don't think we've implemented CPU part yet though)
- userspace builds snapshot from sysfs info & exposes to HSAIL runtime, which in turn exposes the "standard" API
This is why i do not like the sysfs approach, it would be lot nicer to have device file per provider and thus hsail can listen on device file event and discover if hardware is vanishing or appearing. Periodicaly going over sysfs files is not the right way to do that.
- kfd sets up ATC aperture so GPUs can access system RAM via IOMMUv2 (fast for APU, relatively less so for dGPU over PCIE)
- to-be-added memory operations allow allocation & residency control (within existing gfx driver limits) of buffers in VRAM & carved-out system RAM
- queue operations specify a node ID to userspace library, which translates to "global ID" before calling kfd
- Multiple CPUs connected via fabric (eg HyperTransport) each with 0 or more GPUs:
- topology information exposes CPUs & GPUs, along with affinity info showing what is connected to what
- everything else works as in (1) above
This is suppose to be part of HSA ? This is lot broader than i thought.
- Multiple CPUs not connected via fabric (eg a blade server) each with 0 or more GPUs
- no attempt to cover this with HSA topology, each CPU and associated GPUs is accessed independently via separate /dev/kfd instances
Do we have a userspace to exercise this interface so we can see how such a thing would look?
Yes -- initial IP review done, legal stuff done, sanitizing WIP, hoping for final approval this week
There's a separate test harness to exercise the userspace lib calls, haven't started IP review or sanitizing for that but legal stuff is done
Dave.
-----Original Message----- From: Jerome Glisse [mailto:j.glisse@gmail.com] Sent: Tuesday, July 15, 2014 1:37 PM To: Bridgman, John Cc: Dave Airlie; Christian König; Lewycky, Andrew; linux- kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; Deucher, Alexander; akpm@linux-foundation.org Subject: Re: [PATCH 00/83] AMD HSA kernel driver
On Tue, Jul 15, 2014 at 05:06:56PM +0000, Bridgman, John wrote:
From: Dave Airlie [mailto:airlied@gmail.com] Sent: Tuesday, July 15, 2014 12:35 AM To: Christian König Cc: Jerome Glisse; Bridgman, John; Lewycky, Andrew; linux- kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; Deucher, Alexander; akpm@linux-foundation.org Subject: Re: [PATCH 00/83] AMD HSA kernel driver
On 14 July 2014 18:37, Christian König deathsimple@vodafone.de wrote:
I vote for HSA module that expose ioctl and is an intermediary with the kernel driver that handle the hardware. This gives a single point for HSA hardware and yes this enforce things for any hardware
manufacturer.
I am more than happy to tell them that this is it and nothing else if they want to get upstream.
I think we should still discuss this single point of entry a bit more.
Just to make it clear the plan is to expose all physical HSA capable devices through a single /dev/hsa device node to userspace.
This is why we don't design kernel interfaces in secret foundations, and expect anyone to like them.
Understood and agree. In this case though this isn't a cross-vendor interface designed by a secret committee, it's supposed to be more of an inoffensive little single-vendor interface designed *for* a secret committee. I'm hoping that's better ;)
So before we go any further, how is this stuff planned to work for multiple GPUs/accelerators?
Three classes of "multiple" :
- Single CPU with IOMMUv2 and multiple GPUs:
- all devices accessible via /dev/kfd
- topology information identifies CPU + GPUs, each has "node ID" at
top of userspace API, "global ID" at user/kernel interface (don't think we've implemented CPU part yet though)
- userspace builds snapshot from sysfs info & exposes to HSAIL
runtime, which in turn exposes the "standard" API
This is why i do not like the sysfs approach, it would be lot nicer to have device file per provider and thus hsail can listen on device file event and discover if hardware is vanishing or appearing. Periodicaly going over sysfs files is not the right way to do that.
Agree that wouldn't be good. There's an event mechanism still to come - mostly for communicating fences and shader interrupts back to userspace, but also used for "device change" notifications, so no polling of sysfs.
- kfd sets up ATC aperture so GPUs can access system RAM via IOMMUv2
(fast for APU, relatively less so for dGPU over PCIE)
- to-be-added memory operations allow allocation & residency control
(within existing gfx driver limits) of buffers in VRAM & carved-out system RAM
- queue operations specify a node ID to userspace library, which
translates to "global ID" before calling kfd
- Multiple CPUs connected via fabric (eg HyperTransport) each with 0 or
more GPUs:
- topology information exposes CPUs & GPUs, along with affinity info
showing what is connected to what
- everything else works as in (1) above
This is suppose to be part of HSA ? This is lot broader than i thought.
Yes although it can be skipped on most systems. We figured that topology needed to cover everything that would be handled by a single OS image, so in a NUMA system it would need to cover all the CPUs. I think that is still the right scope, do you agree ?
- Multiple CPUs not connected via fabric (eg a blade server) each
with 0 or more GPUs
- no attempt to cover this with HSA topology, each CPU and associated
GPUs is accessed independently via separate /dev/kfd instances
Do we have a userspace to exercise this interface so we can see how such a thing would look?
Yes -- initial IP review done, legal stuff done, sanitizing WIP, hoping for final approval this week
There's a separate test harness to exercise the userspace lib calls, haven't started IP review or sanitizing for that but legal stuff is done
Dave.
On Tue, Jul 15, 2014 at 05:53:32PM +0000, Bridgman, John wrote:
From: Jerome Glisse [mailto:j.glisse@gmail.com] Sent: Tuesday, July 15, 2014 1:37 PM To: Bridgman, John Cc: Dave Airlie; Christian König; Lewycky, Andrew; linux- kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; Deucher, Alexander; akpm@linux-foundation.org Subject: Re: [PATCH 00/83] AMD HSA kernel driver
On Tue, Jul 15, 2014 at 05:06:56PM +0000, Bridgman, John wrote:
From: Dave Airlie [mailto:airlied@gmail.com] Sent: Tuesday, July 15, 2014 12:35 AM To: Christian König Cc: Jerome Glisse; Bridgman, John; Lewycky, Andrew; linux- kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; Deucher, Alexander; akpm@linux-foundation.org Subject: Re: [PATCH 00/83] AMD HSA kernel driver
On 14 July 2014 18:37, Christian König deathsimple@vodafone.de wrote:
I vote for HSA module that expose ioctl and is an intermediary with the kernel driver that handle the hardware. This gives a single point for HSA hardware and yes this enforce things for any hardware
manufacturer.
I am more than happy to tell them that this is it and nothing else if they want to get upstream.
I think we should still discuss this single point of entry a bit more.
Just to make it clear the plan is to expose all physical HSA capable devices through a single /dev/hsa device node to userspace.
This is why we don't design kernel interfaces in secret foundations, and expect anyone to like them.
Understood and agree. In this case though this isn't a cross-vendor interface designed by a secret committee, it's supposed to be more of an inoffensive little single-vendor interface designed *for* a secret committee. I'm hoping that's better ;)
So before we go any further, how is this stuff planned to work for multiple GPUs/accelerators?
Three classes of "multiple" :
- Single CPU with IOMMUv2 and multiple GPUs:
- all devices accessible via /dev/kfd
- topology information identifies CPU + GPUs, each has "node ID" at
top of userspace API, "global ID" at user/kernel interface (don't think we've implemented CPU part yet though)
- userspace builds snapshot from sysfs info & exposes to HSAIL
runtime, which in turn exposes the "standard" API
This is why i do not like the sysfs approach, it would be lot nicer to have device file per provider and thus hsail can listen on device file event and discover if hardware is vanishing or appearing. Periodicaly going over sysfs files is not the right way to do that.
Agree that wouldn't be good. There's an event mechanism still to come - mostly for communicating fences and shader interrupts back to userspace, but also used for "device change" notifications, so no polling of sysfs.
My point being, do not use sysfs, use /dev/hsa/device* and have hsail listen on file event on /dev/hsa/ directory. The hsail would be inform of new device and of device that are unloaded. It would do a first pass to open each device file and get there capabilities through standardize ioctl.
Thought maybe sysfs is ok given than cpu numa is expose through sysfs
- kfd sets up ATC aperture so GPUs can access system RAM via IOMMUv2
(fast for APU, relatively less so for dGPU over PCIE)
- to-be-added memory operations allow allocation & residency control
(within existing gfx driver limits) of buffers in VRAM & carved-out system RAM
- queue operations specify a node ID to userspace library, which
translates to "global ID" before calling kfd
- Multiple CPUs connected via fabric (eg HyperTransport) each with 0 or
more GPUs:
- topology information exposes CPUs & GPUs, along with affinity info
showing what is connected to what
- everything else works as in (1) above
This is suppose to be part of HSA ? This is lot broader than i thought.
Yes although it can be skipped on most systems. We figured that topology needed to cover everything that would be handled by a single OS image, so in a NUMA system it would need to cover all the CPUs. I think that is still the right scope, do you agree ?
I think it is a idea to duplicate cpu. I would rather have each device give its afinity against each cpu and for cpu just keep the existing kernel api that expose this through sysfs iirc.
- Multiple CPUs not connected via fabric (eg a blade server) each
with 0 or more GPUs
- no attempt to cover this with HSA topology, each CPU and associated
GPUs is accessed independently via separate /dev/kfd instances
Do we have a userspace to exercise this interface so we can see how such a thing would look?
Yes -- initial IP review done, legal stuff done, sanitizing WIP, hoping for final approval this week
There's a separate test harness to exercise the userspace lib calls, haven't started IP review or sanitizing for that but legal stuff is done
Dave.
On Tue, Jul 15, 2014 at 8:04 PM, Jerome Glisse j.glisse@gmail.com wrote:
Yes although it can be skipped on most systems. We figured that topology needed to cover everything that would be handled by a single OS image, so in a NUMA system it would need to cover all the CPUs. I think that is still the right scope, do you agree ?
I think it is a idea to duplicate cpu. I would rather have each device give its afinity against each cpu and for cpu just keep the existing kernel api that expose this through sysfs iirc.
It's all there already if we fix up the hsa dev-node model to expose one dev node per underlying device instead of one for everything: - cpus already expose the full numa topology in sysfs - pci devices have a numa_node file in sysfs to display the link - we can easily add similar stuff for platform devices on arm socs without pci devices.
Then the only thing userspace needs to do is follow the device link in the hsa instance node in sysfs and we have all the information exposed. Iff we expose one hsa driver instance to userspace per physical device (which is the normal linux device driver model anyway).
I don't see a need to add anything hsa specific here at all (well maybe some description of the cache architecture on the hsa device itself, the spec seems to have provisions for that). -Daniel
On Wed, Jul 16, 2014 at 10:27:42AM +0200, Daniel Vetter wrote:
On Tue, Jul 15, 2014 at 8:04 PM, Jerome Glisse j.glisse@gmail.com wrote:
Yes although it can be skipped on most systems. We figured that topology needed to cover everything that would be handled by a single OS image, so in a NUMA system it would need to cover all the CPUs. I think that is still the right scope, do you agree ?
I think it is a idea to duplicate cpu. I would rather have each device give its afinity against each cpu and for cpu just keep the existing kernel api that expose this through sysfs iirc.
It's all there already if we fix up the hsa dev-node model to expose one dev node per underlying device instead of one for everything:
- cpus already expose the full numa topology in sysfs
- pci devices have a numa_node file in sysfs to display the link
- we can easily add similar stuff for platform devices on arm socs
without pci devices.
Then the only thing userspace needs to do is follow the device link in the hsa instance node in sysfs and we have all the information exposed. Iff we expose one hsa driver instance to userspace per physical device (which is the normal linux device driver model anyway).
I don't see a need to add anything hsa specific here at all (well maybe some description of the cache architecture on the hsa device itself, the spec seems to have provisions for that). -Daniel
What is HSA specific is userspace command queue in form of common ring buffer execution queue all sharing common packet format. So yes i see a reason for an HSA class that provide common ioctl through one dev file per device. Note that i am not a fan of userspace command queue given that linux ioctl overhead is small and having kernel do stuff would allow for really "infinite" number of userspace context while right now limit is DOORBELL_APERTURE_SIZE/PAGE_SIZE.
No, CPU should not be included, neither should the numa topology of device. And yes all numa topology should use existing kernel interface. I however understand that a second GPU specific topology might make sense ie if you have specialize link btw some discrete GPU.
So if Intel wants to join the HSA foundation fine, but unless you are ready to implement what is needed i do not see the value of forcing your wish on another group that is trying to standardize something.
Cheers, Jérôme
On Wed, Jul 16, 2014 at 10:52:56AM -0400, Jerome Glisse wrote:
On Wed, Jul 16, 2014 at 10:27:42AM +0200, Daniel Vetter wrote:
On Tue, Jul 15, 2014 at 8:04 PM, Jerome Glisse j.glisse@gmail.com wrote:
Yes although it can be skipped on most systems. We figured that topology needed to cover everything that would be handled by a single OS image, so in a NUMA system it would need to cover all the CPUs. I think that is still the right scope, do you agree ?
I think it is a idea to duplicate cpu. I would rather have each device give its afinity against each cpu and for cpu just keep the existing kernel api that expose this through sysfs iirc.
It's all there already if we fix up the hsa dev-node model to expose one dev node per underlying device instead of one for everything:
- cpus already expose the full numa topology in sysfs
- pci devices have a numa_node file in sysfs to display the link
- we can easily add similar stuff for platform devices on arm socs
without pci devices.
Then the only thing userspace needs to do is follow the device link in the hsa instance node in sysfs and we have all the information exposed. Iff we expose one hsa driver instance to userspace per physical device (which is the normal linux device driver model anyway).
I don't see a need to add anything hsa specific here at all (well maybe some description of the cache architecture on the hsa device itself, the spec seems to have provisions for that). -Daniel
What is HSA specific is userspace command queue in form of common ring buffer execution queue all sharing common packet format. So yes i see a reason for an HSA class that provide common ioctl through one dev file per device. Note that i am not a fan of userspace command queue given that linux ioctl overhead is small and having kernel do stuff would allow for really "infinite" number of userspace context while right now limit is DOORBELL_APERTURE_SIZE/PAGE_SIZE.
No, CPU should not be included, neither should the numa topology of device. And yes all numa topology should use existing kernel interface. I however understand that a second GPU specific topology might make sense ie if you have specialize link btw some discrete GPU.
So if Intel wants to join the HSA foundation fine, but unless you are ready to implement what is needed i do not see the value of forcing your wish on another group that is trying to standardize something.
You're mixing up my replies ;-) This was really just a comment on the proposed hsa interfaces for exposing the topology - we already have all the stuff exposed in sysfs for cpus and pci devices, so exposing this again through a hsa specific interface doesn't make much sense imo.
What intel does or not does is completely irrelevant for my comment. I.e. I've written the above with my drm hacker hat on, not with my intel hat on.
Cheers, Daniel
On Tue, Jul 15, 2014 at 7:53 PM, Bridgman, John John.Bridgman@amd.com wrote: [snip away the discussion about hsa device discover, I'm hijacking this thread just for the event/fence stuff here.]
... There's an event mechanism still to come - mostly for communicating fences and shader interrupts back to userspace, but also used for "device change" notifications, so no polling of sysfs.
That could would be interesting. On i915 my plan is to internally use the recently added struct fence from Maarten. For the external interface for userspace that wants explicit control over fences I'm leaning towards polishing the android syncpt stuff (currently in staging). But in any case I _really_ want to avoid that we end up with multiple different and incompatible explicit fencing interfaces on linux.
Adding relevant people. -Daniel
On Wed, Jul 16, 2014 at 10:21:14AM +0200, Daniel Vetter wrote:
On Tue, Jul 15, 2014 at 7:53 PM, Bridgman, John John.Bridgman@amd.com wrote: [snip away the discussion about hsa device discover, I'm hijacking this thread just for the event/fence stuff here.]
... There's an event mechanism still to come - mostly for communicating fences and shader interrupts back to userspace, but also used for "device change" notifications, so no polling of sysfs.
That could would be interesting. On i915 my plan is to internally use the recently added struct fence from Maarten. For the external interface for userspace that wants explicit control over fences I'm leaning towards polishing the android syncpt stuff (currently in staging). But in any case I _really_ want to avoid that we end up with multiple different and incompatible explicit fencing interfaces on linux.
I agree, and I'll say it stronger than that, we WILL NOT have different and incompatible fencing interfaces in the kernel. That way lies madness.
John, take a look at what is now in linux-next, it should provide what you need here, right?
thanks,
greg k-h
dri-devel@lists.freedesktop.org