Instead of always falling back to memcpy_fromio() for any size, prefer using read{b,w,l}(). When reading struct members it's common to read individual integer variables individually. Going through memcpy_fromio() for each of them poses a high penalty.
Employ a similar trick as __seqprop() by using _Generic() to generate only the specific call based on a type-compatible variable.
For a pariticular i915 workload producing GPU context switches, __get_engine_usage_record() is particularly hot since the engine usage is read from device local memory with dgfx, possibly multiple times since it's racy. Test execution time for this test shows a ~12.5% improvement with DG2:
Before: nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07; median = 9.52548e+06; var = 526149; After: nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06; median = 8.33955e+06; var = 333113;
Other things attempted that didn't prove very useful: 1) Change the _Generic() on x86 to just dereference the memory address 2) Change __get_engine_usage_record() to do just 1 read per loop, comparing with the previous value read 3) Change __get_engine_usage_record() to access the fields directly as it was before the conversion to iosys-map
(3) did gave a small improvement (~3%), but doesn't seem to scale well to other similar cases in the driver.
Additional test by Chris Wilson using gem_create from igt with some changes to track object creation time. This happens to accidentally stress this code path:
Pre iosys_map conversion of engine busyness: lmem0: Creating 262144 4KiB objects took 59274.2ms
Unpatched: lmem0: Creating 262144 4KiB objects took 108830.2ms
With readl (this patch): lmem0: Creating 262144 4KiB objects took 61348.6ms
s/readl/READ_ONCE/ lmem0: Creating 262144 4KiB objects took 61333.2ms
So we do take a little bit more time than before the conversion, but that is due to other factors: bringing the READ_ONCE back would be as good as just doing this conversion.
v2: - Remove default from _Generic() - callers wanting to read more than u64 should use iosys_map_memcpy_from() - Add READ_ONCE() cases dereferencing the pointer when using system memory
Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com Reviewed-by: Christian König christian.koenig@amd.com # v1 --- include/linux/iosys-map.h | 45 +++++++++++++++++++++++++++++++-------- 1 file changed, 36 insertions(+), 9 deletions(-)
diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h index 4b8406ee8bc4..f59dd00ed202 100644 --- a/include/linux/iosys-map.h +++ b/include/linux/iosys-map.h @@ -6,6 +6,7 @@ #ifndef __IOSYS_MAP_H__ #define __IOSYS_MAP_H__
+#include <linux/compiler_types.h> #include <linux/io.h> #include <linux/string.h>
@@ -333,6 +334,26 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, memset(dst->vaddr + offset, value, len); }
+#ifdef CONFIG_64BIT +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \ + u64: val_ = readq(vaddr_iomem_) +#else +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \ + u64: memcpy_fromio(&(val_), vaddr_iomem__, sizeof(u64)) +#endif + +#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__, \ + u8: val__ = readb(vaddr_iomem__), \ + u16: val__ = readw(vaddr_iomem__), \ + u32: val__ = readl(vaddr_iomem__), \ + __iosys_map_rd_io_u64_case(val__, vaddr_iomem__)) + +#define __iosys_map_rd_sys(val__, vaddr__, type__) ({ \ + compiletime_assert(sizeof(type__) <= sizeof(u64), \ + "Unsupported access size for __iosys_map_rd_sys()"); \ + val__ = READ_ONCE(*((type__ *)vaddr__)); \ +}) + /** * iosys_map_rd - Read a C-type value from the iosys_map * @@ -340,16 +361,21 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, * @offset__: The offset from which to read * @type__: Type of the value being read * - * Read a C type value from iosys_map, handling possible un-aligned accesses to - * the mapping. + * Read a C type value (u8, u16, u32 and u64) from iosys_map. For other types or + * if pointer may be unaligned (and problematic for the architecture supported), + * use iosys_map_memcpy_from(). * * Returns: * The value read from the mapping. */ -#define iosys_map_rd(map__, offset__, type__) ({ \ - type__ val; \ - iosys_map_memcpy_from(&val, map__, offset__, sizeof(val)); \ - val; \ +#define iosys_map_rd(map__, offset__, type__) ({ \ + type__ val; \ + if ((map__)->is_iomem) { \ + __iosys_map_rd_io(val, (map__)->vaddr_iomem + (offset__), type__);\ + } else { \ + __iosys_map_rd_sys(val, (map__)->vaddr + (offset__), type__); \ + } \ + val; \ })
/** @@ -379,9 +405,10 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, * * Read a value from iosys_map considering its layout is described by a C struct * starting at @struct_offset__. The field offset and size is calculated and its - * value read handling possible un-aligned memory accesses. For example: suppose - * there is a @struct foo defined as below and the value ``foo.field2.inner2`` - * needs to be read from the iosys_map: + * value read. If the field access would incur in un-aligned access, then either + * iosys_map_memcpy_from() needs to be used or the architecture must support it. + * For example: suppose there is a @struct foo defined as below and the value + * ``foo.field2.inner2`` needs to be read from the iosys_map: * * .. code-block:: c *
Like was done for read, provide the equivalent for write. Even if current users are not in the hot path, this should future-proof it.
v2: - Remove default from _Generic() - callers wanting to write more than u64 should use iosys_map_memcpy_to() - Add WRITE_ONCE() cases dereferencing the pointer when using system memory
Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com Reviewed-by: Reviewed-by: Christian König christian.koenig@amd.com # v1 --- include/linux/iosys-map.h | 42 ++++++++++++++++++++++++++++++--------- 1 file changed, 33 insertions(+), 9 deletions(-)
diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h index f59dd00ed202..580e14cd360c 100644 --- a/include/linux/iosys-map.h +++ b/include/linux/iosys-map.h @@ -337,9 +337,13 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, #ifdef CONFIG_64BIT #define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \ u64: val_ = readq(vaddr_iomem_) +#define __iosys_map_wr_io_u64_case(val_, vaddr_iomem_) \ + u64: writeq(val_, vaddr_iomem_) #else #define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \ u64: memcpy_fromio(&(val_), vaddr_iomem__, sizeof(u64)) +#define __iosys_map_wr_io_u64_case(val_, vaddr_iomem_) \ + u64: memcpy_toio(vaddr_iomem_, &(val_), sizeof(u64)) #endif
#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__, \ @@ -354,6 +358,19 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, val__ = READ_ONCE(*((type__ *)vaddr__)); \ })
+#define __iosys_map_wr_io(val__, vaddr_iomem__, type__) _Generic(val__, \ + u8: writeb(val__, vaddr_iomem__), \ + u16: writew(val__, vaddr_iomem__), \ + u32: writel(val__, vaddr_iomem__), \ + __iosys_map_wr_io_u64_case(val__, vaddr_iomem__)) + +#define __iosys_map_wr_sys(val__, vaddr__, type__) ({ \ + compiletime_assert(sizeof(type__) <= sizeof(u64), \ + "Unsupported access size for __iosys_map_wr_sys()"); \ + WRITE_ONCE(*((type__ *)vaddr__), val__); \ +}) + + /** * iosys_map_rd - Read a C-type value from the iosys_map * @@ -386,12 +403,17 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, * @type__: Type of the value being written * @val__: Value to write * - * Write a C-type value to the iosys_map, handling possible un-aligned accesses - * to the mapping. + * Write a C type value (u8, u16, u32 and u64) to the iosys_map. For other types + * or if pointer may be unaligned (and problematic for the architecture + * supported), use iosys_map_memcpy_to() */ -#define iosys_map_wr(map__, offset__, type__, val__) ({ \ - type__ val = (val__); \ - iosys_map_memcpy_to(map__, offset__, &val, sizeof(val)); \ +#define iosys_map_wr(map__, offset__, type__, val__) ({ \ + type__ val = (val__); \ + if ((map__)->is_iomem) { \ + __iosys_map_wr_io(val, (map__)->vaddr_iomem + (offset__), type__);\ + } else { \ + __iosys_map_wr_sys(val, (map__)->vaddr + (offset__), type__); \ + } \ })
/** @@ -472,10 +494,12 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, * @field__: Member of the struct to read * @val__: Value to write * - * Write a value to the iosys_map considering its layout is described by a C struct - * starting at @struct_offset__. The field offset and size is calculated and the - * @val__ is written handling possible un-aligned memory accesses. Refer to - * iosys_map_rd_field() for expected usage and memory layout. + * Write a value to the iosys_map considering its layout is described by a C + * struct starting at @struct_offset__. The field offset and size is calculated + * and the @val__ is written. If the field access would incur in un-aligned + * access, then either iosys_map_memcpy_to() needs to be used or the + * architecture must support it. Refer to iosys_map_rd_field() for expected + * usage and memory layout. */ #define iosys_map_wr_field(map__, struct_offset__, struct_type__, field__, val__) ({ \ struct_type__ *s; \
On Fri, Jun 17, 2022 at 01:52:03AM -0700, Lucas De Marchi wrote:
Instead of always falling back to memcpy_fromio() for any size, prefer using read{b,w,l}(). When reading struct members it's common to read individual integer variables individually. Going through memcpy_fromio() for each of them poses a high penalty.
Employ a similar trick as __seqprop() by using _Generic() to generate only the specific call based on a type-compatible variable.
For a pariticular i915 workload producing GPU context switches, __get_engine_usage_record() is particularly hot since the engine usage is read from device local memory with dgfx, possibly multiple times since it's racy. Test execution time for this test shows a ~12.5% improvement with DG2:
Before: nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07; median = 9.52548e+06; var = 526149; After: nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06; median = 8.33955e+06; var = 333113;
Other things attempted that didn't prove very useful:
- Change the _Generic() on x86 to just dereference the memory address
- Change __get_engine_usage_record() to do just 1 read per loop,
comparing with the previous value read 3) Change __get_engine_usage_record() to access the fields directly as it was before the conversion to iosys-map
(3) did gave a small improvement (~3%), but doesn't seem to scale well to other similar cases in the driver.
Additional test by Chris Wilson using gem_create from igt with some changes to track object creation time. This happens to accidentally stress this code path:
Pre iosys_map conversion of engine busyness: lmem0: Creating 262144 4KiB objects took 59274.2ms
Unpatched: lmem0: Creating 262144 4KiB objects took 108830.2ms
With readl (this patch): lmem0: Creating 262144 4KiB objects took 61348.6ms
s/readl/READ_ONCE/ lmem0: Creating 262144 4KiB objects took 61333.2ms
So we do take a little bit more time than before the conversion, but that is due to other factors: bringing the READ_ONCE back would be as good as just doing this conversion.
v2:
- Remove default from _Generic() - callers wanting to read more
than u64 should use iosys_map_memcpy_from()
- Add READ_ONCE() cases dereferencing the pointer when using system
memory
Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com Reviewed-by: Christian König christian.koenig@amd.com # v1
include/linux/iosys-map.h | 45 +++++++++++++++++++++++++++++++-------- 1 file changed, 36 insertions(+), 9 deletions(-)
diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h index 4b8406ee8bc4..f59dd00ed202 100644 --- a/include/linux/iosys-map.h +++ b/include/linux/iosys-map.h @@ -6,6 +6,7 @@ #ifndef __IOSYS_MAP_H__ #define __IOSYS_MAP_H__
+#include <linux/compiler_types.h> #include <linux/io.h> #include <linux/string.h>
@@ -333,6 +334,26 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, memset(dst->vaddr + offset, value, len); }
+#ifdef CONFIG_64BIT +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \
- u64: val_ = readq(vaddr_iomem_)
+#else +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \
- u64: memcpy_fromio(&(val_), vaddr_iomem__, sizeof(u64))
I tested io/sys and forgot again to test it for 32-bit :(. This should fix the build for 32-bits:
diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h index 580e14cd360c..f8bc052f8975 100644 --- a/include/linux/iosys-map.h +++ b/include/linux/iosys-map.h @@ -341,7 +341,7 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, u64: writeq(val_, vaddr_iomem_) #else #define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \ - u64: memcpy_fromio(&(val_), vaddr_iomem__, sizeof(u64)) + u64: memcpy_fromio(&(val_), vaddr_iomem_, sizeof(u64)) #define __iosys_map_wr_io_u64_case(val_, vaddr_iomem_) \ u64: memcpy_toio(vaddr_iomem_, &(val_), sizeof(u64)) #endif
Lucas De Marchi
dri-devel@lists.freedesktop.org