Optimize performance of the fbdev console for the common case of software-based clearing and image blitting.
The commit descripton of each patch contains resuls os a simple microbenchmark. I also tested the full patchset's effect on the console output by printing directory listings (i7-4790, FullHD, simpledrm, kernel with debugging).
time find /usr/share/doc -type f
In the unoptimized case:
real 0m6.173s user 0m0.044s sys 0m6.107s
With optimizations applied:
real 0m4.754s user 0m0.044s sys 0m4.698s
In the optimized case, printing the directory listing is ~25% faster than before.
In v2 of the patchset, after implementing Sam's suggestion to update cfb_imageblit() as well, it turns out that the compiled code in sys_imageblit() is still significantly slower than the CFB version. A fix is probably a larger task and would include architecture-specific changes. A new TODO item suggests to investigate the performance of the various helpers and format-conversion functions in DRM and fbdev.
v3: * fix description of cfb_imageblit() patch (Pekka) v2: * improve readability for sys_imageblit() (Gerd, Sam) * new TODO item for further optimization
Thomas Zimmermann (5): fbdev: Improve performance of sys_fillrect() fbdev: Improve performance of sys_imageblit() fbdev: Remove trailing whitespaces from cfbimgblt.c fbdev: Improve performance of cfb_imageblit() drm: Add TODO item for optimizing format helpers
Documentation/gpu/todo.rst | 22 +++++ drivers/video/fbdev/core/cfbimgblt.c | 107 ++++++++++++++++--------- drivers/video/fbdev/core/sysfillrect.c | 16 +--- drivers/video/fbdev/core/sysimgblt.c | 49 ++++++++--- 4 files changed, 133 insertions(+), 61 deletions(-)
Improve the performance of sys_fillrect() by using word-aligned 32/64-bit mov instructions. While the code tried to implement this, the compiler failed to create fast instructions. The resulting binary instructions were even slower than cfb_fillrect(), which uses the same algorithm, but operates on I/O memory.
A microbenchmark measures the average number of CPU cycles for sys_fillrect() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging). The value for CFB is given as a reference.
sys_fillrect(), new: 26586 cycles sys_fillrect(), old: 166603 cycles cfb_fillrect(): 41012 cycles
In the optimized case, sys_fillrect() is now ~6x faster than before and ~1.5x faster than the CFB implementation.
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de Reviewed-by: Javier Martinez Canillas javierm@redhat.com Reviewed-by: Sam Ravnborg sam@ravnborg.org --- drivers/video/fbdev/core/sysfillrect.c | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-)
diff --git a/drivers/video/fbdev/core/sysfillrect.c b/drivers/video/fbdev/core/sysfillrect.c index 33ee3d34f9d2..bcdcaeae6538 100644 --- a/drivers/video/fbdev/core/sysfillrect.c +++ b/drivers/video/fbdev/core/sysfillrect.c @@ -50,19 +50,9 @@ bitfill_aligned(struct fb_info *p, unsigned long *dst, int dst_idx,
/* Main chunk */ n /= bits; - while (n >= 8) { - *dst++ = pat; - *dst++ = pat; - *dst++ = pat; - *dst++ = pat; - *dst++ = pat; - *dst++ = pat; - *dst++ = pat; - *dst++ = pat; - n -= 8; - } - while (n--) - *dst++ = pat; + memset_l(dst, pat, n); + dst += n; + /* Trailing bits */ if (last) *dst = comp(pat, *dst, last);
Improve the performance of sys_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. The resulting binary code was even slower than the cfb_imageblit() helper, which uses the same algorithm, but operates on I/O memory.
A microbenchmark measures the average number of CPU cycles for sys_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging). The value for CFB is given as a reference.
sys_imageblit(), new: 25934 cycles sys_imageblit(), old: 35944 cycles cfb_imageblit(): 30566 cycles
In the optimized case, sys_imageblit() is now ~30% faster than before and ~20% faster than cfb_imageblit().
v2: * move switch out of inner loop (Gerd) * remove test for alignment of dst1 (Sam)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de Reviewed-by: Javier Martinez Canillas javierm@redhat.com Acked-by: Sam Ravnborg sam@ravnborg.org --- drivers/video/fbdev/core/sysimgblt.c | 49 +++++++++++++++++++++------- 1 file changed, 38 insertions(+), 11 deletions(-)
diff --git a/drivers/video/fbdev/core/sysimgblt.c b/drivers/video/fbdev/core/sysimgblt.c index a4d05b1b17d7..722c327a381b 100644 --- a/drivers/video/fbdev/core/sysimgblt.c +++ b/drivers/video/fbdev/core/sysimgblt.c @@ -188,23 +188,29 @@ static void fast_imageblit(const struct fb_image *image, struct fb_info *p, { u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel; u32 ppw = 32/bpp, spitch = (image->width + 7)/8; - u32 bit_mask, end_mask, eorx, shift; + u32 bit_mask, eorx; const char *s = image->data, *src; u32 *dst; - const u32 *tab = NULL; + const u32 *tab; + size_t tablen; + u32 colortab[16]; int i, j, k;
switch (bpp) { case 8: tab = fb_be_math(p) ? cfb_tab8_be : cfb_tab8_le; + tablen = 16; break; case 16: tab = fb_be_math(p) ? cfb_tab16_be : cfb_tab16_le; + tablen = 4; break; case 32: - default: tab = cfb_tab32; + tablen = 2; break; + default: + return; }
for (i = ppw-1; i--; ) { @@ -218,19 +224,40 @@ static void fast_imageblit(const struct fb_image *image, struct fb_info *p, eorx = fgx ^ bgx; k = image->width/ppw;
+ for (i = 0; i < tablen; ++i) + colortab[i] = (tab[i] & eorx) ^ bgx; + for (i = image->height; i--; ) { dst = dst1; - shift = 8; src = s;
- for (j = k; j--; ) { - shift -= ppw; - end_mask = tab[(*src >> shift) & bit_mask]; - *dst++ = (end_mask & eorx) ^ bgx; - if (!shift) { - shift = 8; - src++; + switch (ppw) { + case 4: /* 8 bpp */ + for (j = k; j; j -= 2, ++src) { + *dst++ = colortab[(*src >> 4) & bit_mask]; + *dst++ = colortab[(*src >> 0) & bit_mask]; + } + break; + case 2: /* 16 bpp */ + for (j = k; j; j -= 4, ++src) { + *dst++ = colortab[(*src >> 6) & bit_mask]; + *dst++ = colortab[(*src >> 4) & bit_mask]; + *dst++ = colortab[(*src >> 2) & bit_mask]; + *dst++ = colortab[(*src >> 0) & bit_mask]; + } + break; + case 1: /* 32 bpp */ + for (j = k; j; j -= 8, ++src) { + *dst++ = colortab[(*src >> 7) & bit_mask]; + *dst++ = colortab[(*src >> 6) & bit_mask]; + *dst++ = colortab[(*src >> 5) & bit_mask]; + *dst++ = colortab[(*src >> 4) & bit_mask]; + *dst++ = colortab[(*src >> 3) & bit_mask]; + *dst++ = colortab[(*src >> 2) & bit_mask]; + *dst++ = colortab[(*src >> 1) & bit_mask]; + *dst++ = colortab[(*src >> 0) & bit_mask]; } + break; } dst1 += p->fix.line_length; s += spitch;
Fix coding style. No functional changes.
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de --- drivers/video/fbdev/core/cfbimgblt.c | 60 ++++++++++++++-------------- 1 file changed, 30 insertions(+), 30 deletions(-)
diff --git a/drivers/video/fbdev/core/cfbimgblt.c b/drivers/video/fbdev/core/cfbimgblt.c index a2bb276a8b24..01b01a279681 100644 --- a/drivers/video/fbdev/core/cfbimgblt.c +++ b/drivers/video/fbdev/core/cfbimgblt.c @@ -16,15 +16,15 @@ * must be laid out exactly in the same format as the framebuffer. Yes I know * their are cards with hardware that coverts images of various depths to the * framebuffer depth. But not every card has this. All images must be rounded - * up to the nearest byte. For example a bitmap 12 bits wide must be two - * bytes width. + * up to the nearest byte. For example a bitmap 12 bits wide must be two + * bytes width. * - * Tony: - * Incorporate mask tables similar to fbcon-cfb*.c in 2.4 API. This speeds + * Tony: + * Incorporate mask tables similar to fbcon-cfb*.c in 2.4 API. This speeds * up the code significantly. - * + * * Code for depths not multiples of BITS_PER_LONG is still kludgy, which is - * still processed a bit at a time. + * still processed a bit at a time. * * Also need to add code to deal with cards endians that are different than * the native cpu endians. I also need to deal with MSB position in the word. @@ -72,8 +72,8 @@ static const u32 cfb_tab32[] = { #define FB_WRITEL fb_writel #define FB_READL fb_readl
-static inline void color_imageblit(const struct fb_image *image, - struct fb_info *p, u8 __iomem *dst1, +static inline void color_imageblit(const struct fb_image *image, + struct fb_info *p, u8 __iomem *dst1, u32 start_index, u32 pitch_index) { @@ -92,7 +92,7 @@ static inline void color_imageblit(const struct fb_image *image, dst = (u32 __iomem *) dst1; shift = 0; val = 0; - + if (start_index) { u32 start_mask = ~fb_shifted_pixels_mask_u32(p, start_index, bswapmask); @@ -109,8 +109,8 @@ static inline void color_imageblit(const struct fb_image *image, val |= FB_SHIFT_HIGH(p, color, shift ^ bswapmask); if (shift >= null_bits) { FB_WRITEL(val, dst++); - - val = (shift == null_bits) ? 0 : + + val = (shift == null_bits) ? 0 : FB_SHIFT_LOW(p, color, 32 - shift); } shift += bpp; @@ -134,9 +134,9 @@ static inline void color_imageblit(const struct fb_image *image, } }
-static inline void slow_imageblit(const struct fb_image *image, struct fb_info *p, +static inline void slow_imageblit(const struct fb_image *image, struct fb_info *p, u8 __iomem *dst1, u32 fgcolor, - u32 bgcolor, + u32 bgcolor, u32 start_index, u32 pitch_index) { @@ -172,7 +172,7 @@ static inline void slow_imageblit(const struct fb_image *image, struct fb_info * l--; color = (*s & (1 << l)) ? fgcolor : bgcolor; val |= FB_SHIFT_HIGH(p, color, shift ^ bswapmask); - + /* Did the bitshift spill bits to the next long? */ if (shift >= null_bits) { FB_WRITEL(val, dst++); @@ -191,16 +191,16 @@ static inline void slow_imageblit(const struct fb_image *image, struct fb_info *
FB_WRITEL((FB_READL(dst) & end_mask) | val, dst); } - + dst1 += pitch; - src += spitch; + src += spitch; if (pitch_index) { dst2 += pitch; dst1 = (u8 __iomem *)((long __force)dst2 & ~(sizeof(u32) - 1)); start_index += pitch_index; start_index &= 32 - 1; } - + } }
@@ -212,9 +212,9 @@ static inline void slow_imageblit(const struct fb_image *image, struct fb_info * * fix->line_legth is divisible by 4; * beginning and end of a scanline is dword aligned */ -static inline void fast_imageblit(const struct fb_image *image, struct fb_info *p, - u8 __iomem *dst1, u32 fgcolor, - u32 bgcolor) +static inline void fast_imageblit(const struct fb_image *image, struct fb_info *p, + u8 __iomem *dst1, u32 fgcolor, + u32 bgcolor) { u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel; u32 ppw = 32/bpp, spitch = (image->width + 7)/8; @@ -243,25 +243,25 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info * fgx |= fgcolor; bgx |= bgcolor; } - + bit_mask = (1 << ppw) - 1; eorx = fgx ^ bgx; k = image->width/ppw;
for (i = image->height; i--; ) { dst = (u32 __iomem *) dst1, shift = 8; src = s; - + for (j = k; j--; ) { shift -= ppw; end_mask = tab[(*src >> shift) & bit_mask]; FB_WRITEL((end_mask & eorx)^bgx, dst++); - if (!shift) { shift = 8; src++; } + if (!shift) { shift = 8; src++; } } dst1 += p->fix.line_length; s += spitch; } -} - +} + void cfb_imageblit(struct fb_info *p, const struct fb_image *image) { u32 fgcolor, bgcolor, start_index, bitstart, pitch_index = 0; @@ -292,13 +292,13 @@ void cfb_imageblit(struct fb_info *p, const struct fb_image *image) } else { fgcolor = image->fg_color; bgcolor = image->bg_color; - } - - if (32 % bpp == 0 && !start_index && !pitch_index && + } + + if (32 % bpp == 0 && !start_index && !pitch_index && ((width & (32/bpp-1)) == 0) && - bpp >= 8 && bpp <= 32) + bpp >= 8 && bpp <= 32) fast_imageblit(image, p, dst1, fgcolor, bgcolor); - else + else slow_imageblit(image, p, dst1, fgcolor, bgcolor, start_index, pitch_index); } else
On Wed, Feb 23, 2022 at 08:38:02PM +0100, Thomas Zimmermann wrote:
Fix coding style. No functional changes.
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de
Acked-by: Sam Ravnborg sam@ravnborg.org
On 2/23/22 20:38, Thomas Zimmermann wrote:
Fix coding style. No functional changes.
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de
Reviewed-by: Javier Martinez Canillas javierm@redhat.com
Best regards,
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3: * fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de --- drivers/video/fbdev/core/cfbimgblt.c | 51 +++++++++++++++++++++++----- 1 file changed, 42 insertions(+), 9 deletions(-)
diff --git a/drivers/video/fbdev/core/cfbimgblt.c b/drivers/video/fbdev/core/cfbimgblt.c index 01b01a279681..7361cfabdd85 100644 --- a/drivers/video/fbdev/core/cfbimgblt.c +++ b/drivers/video/fbdev/core/cfbimgblt.c @@ -218,23 +218,29 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info * { u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel; u32 ppw = 32/bpp, spitch = (image->width + 7)/8; - u32 bit_mask, end_mask, eorx, shift; + u32 bit_mask, eorx; const char *s = image->data, *src; u32 __iomem *dst; const u32 *tab = NULL; + size_t tablen; + u32 colortab[16]; int i, j, k;
switch (bpp) { case 8: tab = fb_be_math(p) ? cfb_tab8_be : cfb_tab8_le; + tablen = 16; break; case 16: tab = fb_be_math(p) ? cfb_tab16_be : cfb_tab16_le; + tablen = 4; break; case 32: - default: tab = cfb_tab32; + tablen = 2; break; + default: + return; }
for (i = ppw-1; i--; ) { @@ -248,15 +254,42 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info * eorx = fgx ^ bgx; k = image->width/ppw;
- for (i = image->height; i--; ) { - dst = (u32 __iomem *) dst1, shift = 8; src = s; + for (i = 0; i < tablen; ++i) + colortab[i] = (tab[i] & eorx) ^ bgx;
- for (j = k; j--; ) { - shift -= ppw; - end_mask = tab[(*src >> shift) & bit_mask]; - FB_WRITEL((end_mask & eorx)^bgx, dst++); - if (!shift) { shift = 8; src++; } + for (i = image->height; i--; ) { + dst = (u32 __iomem *)dst1; + src = s; + + switch (ppw) { + case 4: /* 8 bpp */ + for (j = k; j; j -= 2, ++src) { + FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++); + } + break; + case 2: /* 16 bpp */ + for (j = k; j; j -= 4, ++src) { + FB_WRITEL(colortab[(*src >> 6) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 2) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++); + } + break; + case 1: /* 32 bpp */ + for (j = k; j; j -= 8, ++src) { + FB_WRITEL(colortab[(*src >> 7) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 6) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 5) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 3) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 2) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 1) & bit_mask], dst++); + FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++); + } + break; } + dst1 += p->fix.line_length; s += spitch; }
On Wed, Feb 23, 2022 at 08:38:03PM +0100, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3:
- fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de
Acked-by: Sam Ravnborg sam@ravnborg.org
The code looks equally complicated now in the sys and cfb variants.
Question: What is cfb an abbreviation for anyway? Not related to the patch - but if I have known the memory is lost..
Sam
Hello Sam,
On 2/23/22 21:25, Sam Ravnborg wrote:
[snip]
Question: What is cfb an abbreviation for anyway? Not related to the patch - but if I have known the memory is lost..
I was curious so I dug on this. It seems CFB stands for Color Frame Buffer. Doing a `git grep "(CFB)"` in the linux history repo [0], I get this:
Documentation/isdn/README.diversion: (CFB). drivers/video/pmag-ba-fb.c: * PMAG-BA TURBOchannel Color Frame Buffer (CFB) card support, include/video/pmag-ba-fb.h: * TURBOchannel PMAG-BA Color Frame Buffer (CFB) card support,
Probably the helpers are called like this because they were for any fbdev driver but assumed that the framebuffer was always in I/O memory. Later some drivers were allocating the framebuffer in system memory and still using the helpers, that were using I/O memory accessors and it's ilegal on some arches.
So the sys_* variants where introduced by commit 68648ed1f58d ("fbdev: add drawing functions for framebuffers in system RAM") to fix this. The old ones just kept their name, but probably it should had been renamed to io_* for the naming to be consistent with the sys_* functions.
[0]: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/
Best regards,
Hi Javier, On Thu, Feb 24, 2022 at 10:02:59AM +0100, Javier Martinez Canillas wrote:
Hello Sam,
On 2/23/22 21:25, Sam Ravnborg wrote:
[snip]
Question: What is cfb an abbreviation for anyway? Not related to the patch - but if I have known the memory is lost..
I was curious so I dug on this. It seems CFB stands for Color Frame Buffer. Doing a `git grep "(CFB)"` in the linux history repo [0], I get this:
Documentation/isdn/README.diversion: (CFB). drivers/video/pmag-ba-fb.c: * PMAG-BA TURBOchannel Color Frame Buffer (CFB) card support, include/video/pmag-ba-fb.h: * TURBOchannel PMAG-BA Color Frame Buffer (CFB) card support,
Probably the helpers are called like this because they were for any fbdev driver but assumed that the framebuffer was always in I/O memory. Later some drivers were allocating the framebuffer in system memory and still using the helpers, that were using I/O memory accessors and it's ilegal on some arches.
So the sys_* variants where introduced by commit 68648ed1f58d ("fbdev: add drawing functions for framebuffers in system RAM") to fix this. The old ones just kept their name, but probably it should had been renamed to io_* for the naming to be consistent with the sys_* functions.
Interesting - thanks for the history lesson and thanks for taking your time to share your findings too.
Sam
Hi Javier,
On Thu, Feb 24, 2022 at 10:03 AM Javier Martinez Canillas javierm@redhat.com wrote:
On 2/23/22 21:25, Sam Ravnborg wrote:
Question: What is cfb an abbreviation for anyway? Not related to the patch - but if I have known the memory is lost..
I was curious so I dug on this. It seems CFB stands for Color Frame Buffer. Doing a `git grep "(CFB)"` in the linux history repo [0], I get this:
The naming actually comes from X11. "mfb" is a monochrome frame buffer (bpp = 1). "cfb" is a color frame buffer (bpp > 1), which uses a chunky format.
Probably the helpers are called like this because they were for any fbdev driver but assumed that the framebuffer was always in I/O memory. Later some drivers were allocating the framebuffer in system memory and still using the helpers, that were using I/O memory accessors and it's ilegal on some arches.
Yep. Graphics memory used to be on a graphics card. On systems (usually non-x86) where it was part of main memory, usually it didn't matter at all whether you used I/O memory or plain memory accessors anyway.
Then x86 got unified memory...
Gr{oetje,eeting}s,
Geert
-- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On 2/23/22 20:38, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3:
- fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de
Makes sense, improves perf and makes the two more consistent as you mention.
Reviewed-by: Javier Martinez Canillas javierm@redhat.com
Best regards,
Hi Thomas,
On 23.02.2022 20:38, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3:
- fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de Acked-by: Sam Ravnborg sam@ravnborg.org Reviewed-by: Javier Martinez Canillas javierm@redhat.com
This patch landed recently in linux next-20220308 as commit 0d03011894d2 ("fbdev: Improve performance of cfb_imageblit()"). Sadly it causes a freeze after DRM and emulated fbdev initialization on various Samsung Exynos ARM 32bit based boards. This happens when kernel is compiled from exynos_defconfig. Surprisingly when kernel is compiled from multi_v7_defconfig all those boards boot fine, so this is a matter of one of the debugging options enabled in the exynos_defconfig. I will try to analyze this further and share the results. Reverting $subject on top of next-20220308 fixes the boot issue.
drivers/video/fbdev/core/cfbimgblt.c | 51 +++++++++++++++++++++++----- 1 file changed, 42 insertions(+), 9 deletions(-)
diff --git a/drivers/video/fbdev/core/cfbimgblt.c b/drivers/video/fbdev/core/cfbimgblt.c index 01b01a279681..7361cfabdd85 100644 --- a/drivers/video/fbdev/core/cfbimgblt.c +++ b/drivers/video/fbdev/core/cfbimgblt.c @@ -218,23 +218,29 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info * { u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel; u32 ppw = 32/bpp, spitch = (image->width + 7)/8;
- u32 bit_mask, end_mask, eorx, shift;
u32 bit_mask, eorx; const char *s = image->data, *src; u32 __iomem *dst; const u32 *tab = NULL;
size_t tablen;
u32 colortab[16]; int i, j, k;
switch (bpp) { case 8: tab = fb_be_math(p) ? cfb_tab8_be : cfb_tab8_le;
tablen = 16;
break; case 16: tab = fb_be_math(p) ? cfb_tab16_be : cfb_tab16_le;
tablen = 4;
break; case 32:
- default: tab = cfb_tab32;
tablen = 2;
break;
default:
return;
}
for (i = ppw-1; i--; ) {
@@ -248,15 +254,42 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info * eorx = fgx ^ bgx; k = image->width/ppw;
- for (i = image->height; i--; ) {
dst = (u32 __iomem *) dst1, shift = 8; src = s;
- for (i = 0; i < tablen; ++i)
colortab[i] = (tab[i] & eorx) ^ bgx;
for (j = k; j--; ) {
shift -= ppw;
end_mask = tab[(*src >> shift) & bit_mask];
FB_WRITEL((end_mask & eorx)^bgx, dst++);
if (!shift) { shift = 8; src++; }
- for (i = image->height; i--; ) {
dst = (u32 __iomem *)dst1;
src = s;
switch (ppw) {
case 4: /* 8 bpp */
for (j = k; j; j -= 2, ++src) {
FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
}
break;
case 2: /* 16 bpp */
for (j = k; j; j -= 4, ++src) {
FB_WRITEL(colortab[(*src >> 6) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 2) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
}
break;
case 1: /* 32 bpp */
for (j = k; j; j -= 8, ++src) {
FB_WRITEL(colortab[(*src >> 7) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 6) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 5) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 3) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 2) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 1) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
}
}break;
- dst1 += p->fix.line_length; s += spitch; }
Best regards
Hi
Am 08.03.22 um 23:52 schrieb Marek Szyprowski:
Hi Thomas,
On 23.02.2022 20:38, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3:
- fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de Acked-by: Sam Ravnborg sam@ravnborg.org Reviewed-by: Javier Martinez Canillas javierm@redhat.com
This patch landed recently in linux next-20220308 as commit 0d03011894d2 ("fbdev: Improve performance of cfb_imageblit()"). Sadly it causes a freeze after DRM and emulated fbdev initialization on various Samsung Exynos ARM 32bit based boards. This happens when kernel is compiled from exynos_defconfig. Surprisingly when kernel is compiled from multi_v7_defconfig all those boards boot fine, so this is a matter of one of the debugging options enabled in the exynos_defconfig. I will try to analyze this further and share the results. Reverting $subject on top of next-20220308 fixes the boot issue.
Thanks for reporting. I don't have the hardware to reproduce it and there's no obvious difference to the original version. It's supposed to be the same algorithm with a different implementation. Unless you can figure out the issue, we can also revert the patch easily.
Best regards Thomas
drivers/video/fbdev/core/cfbimgblt.c | 51 +++++++++++++++++++++++----- 1 file changed, 42 insertions(+), 9 deletions(-)
diff --git a/drivers/video/fbdev/core/cfbimgblt.c b/drivers/video/fbdev/core/cfbimgblt.c index 01b01a279681..7361cfabdd85 100644 --- a/drivers/video/fbdev/core/cfbimgblt.c +++ b/drivers/video/fbdev/core/cfbimgblt.c @@ -218,23 +218,29 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info * { u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel; u32 ppw = 32/bpp, spitch = (image->width + 7)/8;
- u32 bit_mask, end_mask, eorx, shift;
u32 bit_mask, eorx; const char *s = image->data, *src; u32 __iomem *dst; const u32 *tab = NULL;
size_t tablen;
u32 colortab[16]; int i, j, k;
switch (bpp) { case 8: tab = fb_be_math(p) ? cfb_tab8_be : cfb_tab8_le;
tablen = 16; break;
case 16: tab = fb_be_math(p) ? cfb_tab16_be : cfb_tab16_le;
tablen = 4; break;
case 32:
- default: tab = cfb_tab32;
tablen = 2; break;
default:
return;
}
for (i = ppw-1; i--; ) {
@@ -248,15 +254,42 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info * eorx = fgx ^ bgx; k = image->width/ppw;
- for (i = image->height; i--; ) {
dst = (u32 __iomem *) dst1, shift = 8; src = s;
- for (i = 0; i < tablen; ++i)
colortab[i] = (tab[i] & eorx) ^ bgx;
for (j = k; j--; ) {
shift -= ppw;
end_mask = tab[(*src >> shift) & bit_mask];
FB_WRITEL((end_mask & eorx)^bgx, dst++);
if (!shift) { shift = 8; src++; }
- for (i = image->height; i--; ) {
dst = (u32 __iomem *)dst1;
src = s;
switch (ppw) {
case 4: /* 8 bpp */
for (j = k; j; j -= 2, ++src) {
FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
}
break;
case 2: /* 16 bpp */
for (j = k; j; j -= 4, ++src) {
FB_WRITEL(colortab[(*src >> 6) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 2) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
}
break;
case 1: /* 32 bpp */
for (j = k; j; j -= 8, ++src) {
FB_WRITEL(colortab[(*src >> 7) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 6) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 5) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 3) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 2) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 1) & bit_mask], dst++);
FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
}
break; }
}dst1 += p->fix.line_length; s += spitch;
Best regards
Hi,
On 09.03.2022 09:22, Thomas Zimmermann wrote:
Am 08.03.22 um 23:52 schrieb Marek Szyprowski:
On 23.02.2022 20:38, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3: * fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de Acked-by: Sam Ravnborg sam@ravnborg.org Reviewed-by: Javier Martinez Canillas javierm@redhat.com
This patch landed recently in linux next-20220308 as commit 0d03011894d2 ("fbdev: Improve performance of cfb_imageblit()"). Sadly it causes a freeze after DRM and emulated fbdev initialization on various Samsung Exynos ARM 32bit based boards. This happens when kernel is compiled from exynos_defconfig. Surprisingly when kernel is compiled from multi_v7_defconfig all those boards boot fine, so this is a matter of one of the debugging options enabled in the exynos_defconfig. I will try to analyze this further and share the results. Reverting $subject on top of next-20220308 fixes the boot issue.
Thanks for reporting. I don't have the hardware to reproduce it and there's no obvious difference to the original version. It's supposed to be the same algorithm with a different implementation. Unless you can figure out the issue, we can also revert the patch easily.
I've played a bit with .config options and found that the issue is caused by the compiled-in fonts used for the framebuffer. For some reasons (so far unknown to me), exynos_defconfig has the following odd setup:
CONFIG_FONT_SUPPORT=y CONFIG_FONTS=y # CONFIG_FONT_8x8 is not set # CONFIG_FONT_8x16 is not set # CONFIG_FONT_6x11 is not set CONFIG_FONT_7x14=y # CONFIG_FONT_PEARL_8x8 is not set # CONFIG_FONT_ACORN_8x8 is not set # CONFIG_FONT_MINI_4x6 is not set # CONFIG_FONT_6x10 is not set # CONFIG_FONT_10x18 is not set # CONFIG_FONT_SUN8x16 is not set # CONFIG_FONT_SUN12x22 is not set # CONFIG_FONT_TER16x32 is not set # CONFIG_FONT_6x8 is not set
Such setup causes a freeze during framebuffer initialization (or just after it got registered). I've reproduced this even on Raspberry Pi 3B with multi_v7_defconfig and changed fonts configuration (this also required to disable vivid driver, which forces 8x16 font), where I got the following panic:
simple-framebuffer 3eace000.framebuffer: framebuffer at 0x3eace000, 0x12c000 bytes simple-framebuffer 3eace000.framebuffer: format=a8r8g8b8, mode=640x480x32, linelength=2560 8<--- cut here --- Unable to handle kernel paging request at virtual address f0aac000 [f0aac000] *pgd=01d8b811, *pte=00000000, *ppte=00000000 Internal error: Oops: 807 [#1] SMP ARM Modules linked in: CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc7-next-20220308-00002-g9e9894c98f8c #11471 Hardware name: BCM2835 PC is at cfb_imageblit+0x52c/0x64c LR is at 0x1 pc : [<c0603dd8>] lr : [<00000001>] psr: a0000013 sp : f081da68 ip : c1d5ffff fp : f081dad8 r10: f0980000 r9 : c1d69600 r8 : fffb5007 r7 : 00000000 r6 : 00000001 r5 : 00000a00 r4 : 00000001 r3 : 00000055 r2 : f0aac000 r1 : f081dad8 r0 : 00000007 Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5383d Table: 0000406a DAC: 00000051 Register r0 information: non-paged memory Register r1 information: 2-page vmalloc region starting at 0xf081c000 allocated at kernel_clone+0xc0/0x428 Register r2 information: 0-page vmalloc region starting at 0xf0980000 allocated at simplefb_probe+0x284/0x9b0 Register r3 information: non-paged memory Register r4 information: non-paged memory Register r5 information: non-paged memory Register r6 information: non-paged memory Register r7 information: NULL pointer Register r8 information: non-paged memory Register r9 information: non-slab/vmalloc memory Register r10 information: 0-page vmalloc region starting at 0xf0980000 allocated at simplefb_probe+0x284/0x9b0 Register r11 information: 2-page vmalloc region starting at 0xf081c000 allocated at kernel_clone+0xc0/0x428 Register r12 information: non-slab/vmalloc memory Process swapper/0 (pid: 1, stack limit = 0x(ptrval)) Stack: (0xf081da68 to 0xf081e000) ... cfb_imageblit from soft_cursor+0x164/0x1cc soft_cursor from bit_cursor+0x4c0/0x4fc bit_cursor from fbcon_cursor+0xf8/0x108 fbcon_cursor from hide_cursor+0x34/0x94 hide_cursor from redraw_screen+0x13c/0x22c redraw_screen from fbcon_prepare_logo+0x164/0x444 fbcon_prepare_logo from fbcon_init+0x38c/0x4bc fbcon_init from visual_init+0xc0/0x108 visual_init from do_bind_con_driver+0x1ac/0x38c do_bind_con_driver from do_take_over_console+0x13c/0x1c8 do_take_over_console from do_fbcon_takeover+0x74/0xcc do_fbcon_takeover from register_framebuffer+0x1bc/0x2cc register_framebuffer from simplefb_probe+0x8dc/0x9b0 simplefb_probe from platform_probe+0x80/0xc0 platform_probe from really_probe+0xc0/0x304 really_probe from __driver_probe_device+0x88/0xe0 __driver_probe_device from driver_probe_device+0x34/0xd4 driver_probe_device from __driver_attach+0x8c/0xe0 __driver_attach from bus_for_each_dev+0x64/0xb0 bus_for_each_dev from bus_add_driver+0x160/0x1e4 bus_add_driver from driver_register+0x78/0x10c driver_register from do_one_initcall+0x44/0x1e0 do_one_initcall from kernel_init_freeable+0x1bc/0x20c kernel_init_freeable from kernel_init+0x18/0x12c kernel_init from ret_from_fork+0x14/0x2c Code: e28db070 e00473a3 e08b7107 e5177044 (e5827000) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b CPU0: stopping CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D 5.17.0-rc7-next-20220308-00002-g9e9894c98f8c #11471 Hardware name: BCM2835 unwind_backtrace from show_stack+0x10/0x14 show_stack from 0xc1201e64 CPU2: stopping CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D 5.17.0-rc7-next-20220308-00002-g9e9894c98f8c #11471 Hardware name: BCM2835 unwind_backtrace from show_stack+0x10/0x14 show_stack from 0xf0809f5c CPU1: stopping CPU: 1 PID: 0 Comm: swapper/1 Tainted: G D 5.17.0-rc7-next-20220308-00002-g9e9894c98f8c #11471 Hardware name: BCM2835 unwind_backtrace from show_stack+0x10/0x14 show_stack from 0xf0805f5c ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
Best regards
Hi Marek,
On Wed, Mar 9, 2022 at 10:22 AM Marek Szyprowski m.szyprowski@samsung.com wrote:
On 09.03.2022 09:22, Thomas Zimmermann wrote:
Am 08.03.22 um 23:52 schrieb Marek Szyprowski:
On 23.02.2022 20:38, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3: * fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de Acked-by: Sam Ravnborg sam@ravnborg.org Reviewed-by: Javier Martinez Canillas javierm@redhat.com
This patch landed recently in linux next-20220308 as commit 0d03011894d2 ("fbdev: Improve performance of cfb_imageblit()"). Sadly it causes a freeze after DRM and emulated fbdev initialization on various Samsung Exynos ARM 32bit based boards. This happens when kernel is compiled from exynos_defconfig. Surprisingly when kernel is compiled from multi_v7_defconfig all those boards boot fine, so this is a matter of one of the debugging options enabled in the exynos_defconfig. I will try to analyze this further and share the results. Reverting $subject on top of next-20220308 fixes the boot issue.
Thanks for reporting. I don't have the hardware to reproduce it and there's no obvious difference to the original version. It's supposed to be the same algorithm with a different implementation. Unless you can figure out the issue, we can also revert the patch easily.
I've played a bit with .config options and found that the issue is caused by the compiled-in fonts used for the framebuffer. For some reasons (so far unknown to me), exynos_defconfig has the following odd setup:
CONFIG_FONT_SUPPORT=y CONFIG_FONTS=y # CONFIG_FONT_8x8 is not set # CONFIG_FONT_8x16 is not set # CONFIG_FONT_6x11 is not set CONFIG_FONT_7x14=y # CONFIG_FONT_PEARL_8x8 is not set # CONFIG_FONT_ACORN_8x8 is not set # CONFIG_FONT_MINI_4x6 is not set # CONFIG_FONT_6x10 is not set # CONFIG_FONT_10x18 is not set # CONFIG_FONT_SUN8x16 is not set # CONFIG_FONT_SUN12x22 is not set # CONFIG_FONT_TER16x32 is not set # CONFIG_FONT_6x8 is not set
Such setup causes a freeze during framebuffer initialization (or just after it got registered). I've reproduced this even on Raspberry Pi 3B with multi_v7_defconfig and changed fonts configuration (this also required to disable vivid driver, which forces 8x16 font), where I got the following panic:
simple-framebuffer 3eace000.framebuffer: framebuffer at 0x3eace000, 0x12c000 bytes simple-framebuffer 3eace000.framebuffer: format=a8r8g8b8, mode=640x480x32, linelength=2560 8<--- cut here --- Unable to handle kernel paging request at virtual address f0aac000
So support for images with offsets or widths that are not a multiple of 8 got broken in cfb_imageblit(). Oops...
BTW, the various drawing routines used to set a bitmask indicating which alignments were supported (see blit_x), but most of them no longer do, presumably because all alignments are now supported (since ca. 20 years?). So you can (temporarily) work around this by filling in blit_x, preventing the use of the 7x14 font.
Gr{oetje,eeting}s,
Geert
-- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Hi
Am 09.03.22 um 11:39 schrieb Geert Uytterhoeven:
Hi Marek,
On Wed, Mar 9, 2022 at 10:22 AM Marek Szyprowski m.szyprowski@samsung.com wrote:
On 09.03.2022 09:22, Thomas Zimmermann wrote:
Am 08.03.22 um 23:52 schrieb Marek Szyprowski:
On 23.02.2022 20:38, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3: * fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de Acked-by: Sam Ravnborg sam@ravnborg.org Reviewed-by: Javier Martinez Canillas javierm@redhat.com
This patch landed recently in linux next-20220308 as commit 0d03011894d2 ("fbdev: Improve performance of cfb_imageblit()"). Sadly it causes a freeze after DRM and emulated fbdev initialization on various Samsung Exynos ARM 32bit based boards. This happens when kernel is compiled from exynos_defconfig. Surprisingly when kernel is compiled from multi_v7_defconfig all those boards boot fine, so this is a matter of one of the debugging options enabled in the exynos_defconfig. I will try to analyze this further and share the results. Reverting $subject on top of next-20220308 fixes the boot issue.
Thanks for reporting. I don't have the hardware to reproduce it and there's no obvious difference to the original version. It's supposed to be the same algorithm with a different implementation. Unless you can figure out the issue, we can also revert the patch easily.
I've played a bit with .config options and found that the issue is caused by the compiled-in fonts used for the framebuffer. For some reasons (so far unknown to me), exynos_defconfig has the following odd setup:
CONFIG_FONT_SUPPORT=y CONFIG_FONTS=y # CONFIG_FONT_8x8 is not set # CONFIG_FONT_8x16 is not set # CONFIG_FONT_6x11 is not set CONFIG_FONT_7x14=y # CONFIG_FONT_PEARL_8x8 is not set # CONFIG_FONT_ACORN_8x8 is not set # CONFIG_FONT_MINI_4x6 is not set # CONFIG_FONT_6x10 is not set # CONFIG_FONT_10x18 is not set # CONFIG_FONT_SUN8x16 is not set # CONFIG_FONT_SUN12x22 is not set # CONFIG_FONT_TER16x32 is not set # CONFIG_FONT_6x8 is not set
Such setup causes a freeze during framebuffer initialization (or just after it got registered). I've reproduced this even on Raspberry Pi 3B with multi_v7_defconfig and changed fonts configuration (this also required to disable vivid driver, which forces 8x16 font), where I got the following panic:
simple-framebuffer 3eace000.framebuffer: framebuffer at 0x3eace000, 0x12c000 bytes simple-framebuffer 3eace000.framebuffer: format=a8r8g8b8, mode=640x480x32, linelength=2560 8<--- cut here --- Unable to handle kernel paging request at virtual address f0aac000
So support for images with offsets or widths that are not a multiple of 8 got broken in cfb_imageblit(). Oops...
BTW, the various drawing routines used to set a bitmask indicating which alignments were supported (see blit_x), but most of them no longer do, presumably because all alignments are now supported (since ca. 20 years?). So you can (temporarily) work around this by filling in blit_x, preventing the use of the 7x14 font.
How do I activate the 7x14 font? It's compiled into the kernel already (CONFIG_FONT_7x14=y).
Best regards Thomas
Gr{oetje,eeting}s,
Geert
-- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Hi Thomas,
On Thu, Mar 10, 2022 at 8:22 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Am 09.03.22 um 11:39 schrieb Geert Uytterhoeven:
On Wed, Mar 9, 2022 at 10:22 AM Marek Szyprowski m.szyprowski@samsung.com wrote:
On 09.03.2022 09:22, Thomas Zimmermann wrote:
Am 08.03.22 um 23:52 schrieb Marek Szyprowski:
On 23.02.2022 20:38, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3: * fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de Acked-by: Sam Ravnborg sam@ravnborg.org Reviewed-by: Javier Martinez Canillas javierm@redhat.com
This patch landed recently in linux next-20220308 as commit 0d03011894d2 ("fbdev: Improve performance of cfb_imageblit()"). Sadly it causes a freeze after DRM and emulated fbdev initialization on various Samsung Exynos ARM 32bit based boards. This happens when kernel is compiled from exynos_defconfig. Surprisingly when kernel is compiled from multi_v7_defconfig all those boards boot fine, so this is a matter of one of the debugging options enabled in the exynos_defconfig. I will try to analyze this further and share the results. Reverting $subject on top of next-20220308 fixes the boot issue.
Thanks for reporting. I don't have the hardware to reproduce it and there's no obvious difference to the original version. It's supposed to be the same algorithm with a different implementation. Unless you can figure out the issue, we can also revert the patch easily.
I've played a bit with .config options and found that the issue is caused by the compiled-in fonts used for the framebuffer. For some reasons (so far unknown to me), exynos_defconfig has the following odd setup:
CONFIG_FONT_SUPPORT=y CONFIG_FONTS=y # CONFIG_FONT_8x8 is not set # CONFIG_FONT_8x16 is not set # CONFIG_FONT_6x11 is not set CONFIG_FONT_7x14=y # CONFIG_FONT_PEARL_8x8 is not set # CONFIG_FONT_ACORN_8x8 is not set # CONFIG_FONT_MINI_4x6 is not set # CONFIG_FONT_6x10 is not set # CONFIG_FONT_10x18 is not set # CONFIG_FONT_SUN8x16 is not set # CONFIG_FONT_SUN12x22 is not set # CONFIG_FONT_TER16x32 is not set # CONFIG_FONT_6x8 is not set
Such setup causes a freeze during framebuffer initialization (or just after it got registered). I've reproduced this even on Raspberry Pi 3B with multi_v7_defconfig and changed fonts configuration (this also required to disable vivid driver, which forces 8x16 font), where I got the following panic:
simple-framebuffer 3eace000.framebuffer: framebuffer at 0x3eace000, 0x12c000 bytes simple-framebuffer 3eace000.framebuffer: format=a8r8g8b8, mode=640x480x32, linelength=2560 8<--- cut here --- Unable to handle kernel paging request at virtual address f0aac000
So support for images with offsets or widths that are not a multiple of 8 got broken in cfb_imageblit(). Oops...
BTW, the various drawing routines used to set a bitmask indicating which alignments were supported (see blit_x), but most of them no longer do, presumably because all alignments are now supported (since ca. 20 years?). So you can (temporarily) work around this by filling in blit_x, preventing the use of the 7x14 font.
How do I activate the 7x14 font? It's compiled into the kernel already (CONFIG_FONT_7x14=y).
Documentation/fb/fbcon.rst:1. fbcon=font:<name>
Or just disable all other fonts.
Gr{oetje,eeting}s,
Geert
-- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Hi Geert
Am 10.03.22 um 20:23 schrieb Geert Uytterhoeven: [...]
How do I activate the 7x14 font? It's compiled into the kernel already (CONFIG_FONT_7x14=y).
Documentation/fb/fbcon.rst:1. fbcon=font:<name>
Or just disable all other fonts.
Thanks. I've been able to reproduce the problem and will send a patch soon.
Best regards Thomas
Gr{oetje,eeting}s,
Geert
-- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Hi,
On Wed, Feb 23, 2022 at 08:38:03PM +0100, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3:
- fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de
This patch causes crashes with arm mainstone, z2, and collie emulations. Reverting it fixes the problem.
collie crash log and bisect log attached.
Guenter
--- 8<--- cut here --- Unable to handle kernel paging request at virtual address e090d000 [e090d000] *pgd=c0c0b811c0c0b811, *pte=c0c0b000, *ppte=00000000 Internal error: Oops: 807 [#1] ARM CPU: 0 PID: 1 Comm: swapper Not tainted 5.17.0-next-20220324 #1 Hardware name: Sharp-Collie PC is at cfb_imageblit+0x58c/0x6e0 LR is at 0x5 pc : [<c040eab0>] lr : [<00000005>] psr: a0000153 sp : e0809958 ip : e090d000 fp : e08099f4 r10: e08099c8 r9 : c0c70600 r8 : ffff6802 r7 : c0c6e000 r6 : 00000000 r5 : e08e7000 r4 : 00000280 r3 : 00000020 r2 : 00000003 r1 : 00000002 r0 : 00000002 Flags: NzCv IRQs on FIQs off Mode SVC_32 ISA ARM Segment none Control: 0000717f Table: c0004000 DAC: 00000053 Register r0 information: non-paged memory Register r1 information: non-paged memory Register r2 information: non-paged memory Register r3 information: non-paged memory Register r4 information: non-paged memory Register r5 information: 0-page vmalloc region starting at 0xe08e6000 allocated at dma_common_contiguous_remap+0x94/0xb0 Register r6 information: NULL pointer Register r7 information: non-slab/vmalloc memory Register r8 information: non-paged memory Register r9 information: non-slab/vmalloc memory Register r10 information: 2-page vmalloc region starting at 0xe0808000 allocated at kernel_clone+0x78/0x4e4 Register r11 information: 2-page vmalloc region starting at 0xe0808000 allocated at kernel_clone+0x78/0x4e4 Register r12 information: 0-page vmalloc region starting at 0xe08e6000 allocated at dma_common_contiguous_remap+0x94/0xb0 Process swapper (pid: 1, stack limit = 0x(ptrval)) Stack: (0xe0809958 to 0xe080a000) 9940: 80000153 0000005e 9960: dfb1b424 00000020 00000000 00000000 00000001 00000002 00000003 00000004 9980: dfb1b420 00000000 00000000 00000000 00000000 c067f338 e08099ab 00000026 99a0: 80000153 00000820 007fe178 c07db82c e08099d4 0000003e 00000820 c0e32b00 99c0: 00000006 c07db82c 00000001 c0da1e40 e0809a54 c0e32b00 00000006 00000001 99e0: 00000001 c0c6e000 e0809a34 e08099f8 c040a3f8 c040e530 00000006 00000001 9a00: c0e61920 c0da1e78 00000000 c0e61920 00000000 e0809a54 c06ad89c c0e32b00 9a20: c0da1e00 00000020 e0809acc e0809a38 c040a040 c040a26c e0809a7c 00000140 9a40: 00000002 00000002 00000001 00000007 00000000 00000039 00000001 c0da1e00 9a60: 00000000 00000000 00000000 00000004 00000006 00000007 00000000 00000001 9a80: c06ad89c 00000000 00000000 00000000 00000000 ffffffff ffffffff c07db82c 9aa0: e0809acc c0c0c3c0 c0e32b00 00000007 00000002 00000720 c0409cf0 00000028 9ac0: e0809afc e0809ad0 c040665c c0409cfc 00000000 00000000 c0c0c3c0 c0807584 9ae0: 00000000 00000000 ffffff60 c0c70000 e0809b1c e0809b00 c0439a50 c040656c 9b00: c0c0c3c0 00000000 00000000 00000000 e0809b54 e0809b20 c043a798 c0439a24 9b20: c04095c8 c0c6ff60 00000000 c07db82c e0809b54 c0c0c3c0 c0c6ff60 00000000 9b40: 00000000 ffffff60 e0809ba4 e0809b58 c0407254 c043a5ac e0809b7c e0809b68 9b60: c04145d8 00000000 00000000 00000000 00000720 00000000 00000050 c0c0c3c0 9b80: c0e32b00 c0e61920 00000050 00000028 c0a00df8 00000028 e0809bec e0809ba8 9ba0: c0407748 c0406f04 00000050 00000028 00000050 00000001 c0a02f70 00000000 9bc0: 00000000 c0c0c3c0 c0c0c624 00000000 c0a02f84 0000003e 00000000 c0a03080 9be0: e0809c0c e0809bf0 c0438b10 c040734c c0c0c3c0 c06affbc 00000001 c0a02f84 9c00: e0809c54 e0809c10 c043be28 c0438a80 0000003e 00000001 00000000 c0779d88 9c20: 00000000 00000001 c08075a8 c06affbc 00000000 00000001 00000000 0000003e 9c40: 00000001 c0a02f8c e0809c9c e0809c58 c043c6ec c043bc98 c08075a8 c077c29c 9c60: 00000001 00000000 c0e32b44 c0a03a58 c067f354 c0805a24 c0a00cc8 c0805a24 9c80: 00000000 c07dbabc c0e32da4 fffff000 e0809cb4 e0809ca0 c0405d5c c043c5d4 9ca0: c0a00dac 00000000 e0809cd4 e0809cb8 c0408f48 c0405cfc c0e32b00 00000000 9cc0: c0a00ca8 c0e32b10 e0809d44 e0809cd8 c03ff9e4 c0408e70 c0779a14 00000000 9ce0: c000ea7c 00000000 00000041 00000140 000000f0 00029e01 0000000b 0000001e 9d00: 00000002 00000000 00000005 00000001 00000003 00000000 00000020 c07db82c 9d20: c0e32b00 00000000 c07dfe08 00000004 0000000d 00000000 e0809d84 e0809d48 9d40: c040f550 c03ff7e8 00000004 c077a1b8 c0e32b00 c0180a04 c07dfe18 00000000 9d60: c07dfe18 c0805abc 00000000 00000000 c07cb87c c07cb85c e0809da4 e0809d88 9d80: c045c2c8 c040f228 00000000 c07dfe18 c0805abc 00000000 e0809dc4 e0809da8 9da0: c045a304 c045c288 c07dfe18 c0805abc c07dfe18 00000000 e0809ddc e0809dc8 9dc0: c045a548 c045a250 c0a04c6c 60000153 e0809e04 e0809de0 c045a5f4 c045a4d0 9de0: e0809e04 e0809df0 c07dfe18 c0805abc c045a7d0 c080af60 e0809e24 e0809e08 9e00: c045a860 c045a5b4 00000000 c0805abc c045a7d0 c080af60 e0809e54 e0809e28 9e20: c0458694 c045a7dc c0c30e20 c0c30dec c0c9f3b4 c07db82c c06559b8 c0805abc 9e40: c0e5d340 00000000 e0809e64 e0809e58 c045adf0 c0458620 e0809e8c e0809e68 9e60: c0458fb8 c045addc c0749e14 e0809e78 c0805abc c0c19000 c07bc340 c07a4bc8 9e80: e0809ea4 e0809e90 c045b684 c0458e84 c0818000 c0c19000 e0809eb4 e0809ea8 9ea0: c045d10c c045b614 e0809ec4 e0809eb8 c07bc368 c045d0f8 e0809f4c e0809ec8 9ec0: c07a821c c07bc34c e0809eec e0809ed8 c00374f4 c065f390 c0c427da c07a5600 9ee0: e0809f4c e0809ef0 c00376f4 c07a7448 e0809f03 00000006 00000006 00000000 9f00: 00000000 c07a743c c0793b5c c07a4bc8 00000dc0 c0c427cc c0c427d4 c07db82c 9f20: 00000000 c07d2060 c0c427a0 00000007 c07a4bc8 c0818000 c07cb87c c07cb85c 9f40: e0809f94 e0809f50 c07a85a0 c07a81b0 00000006 00000006 00000000 c07a743c 9f60: c0c19000 0000008c e0809f8c 00000000 c0675804 00000000 00000000 00000000 9f80: 00000000 00000000 e0809fac e0809f98 c067581c c07a842c 00000000 c0675804 9fa0: 00000000 e0809fb0 c0008328 c0675810 00000000 00000000 00000000 00000000 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000 Backtrace: cfb_imageblit from soft_cursor+0x198/0x1fc r10:c0c6e000 r9:00000001 r8:00000001 r7:00000006 r6:c0e32b00 r5:e0809a54 r4:c0da1e40 soft_cursor from bit_cursor+0x350/0x4fc r10:00000020 r9:c0da1e00 r8:c0e32b00 r7:c06ad89c r6:e0809a54 r5:00000000 r4:c0e61920 bit_cursor from fbcon_cursor+0xfc/0x110 r10:00000028 r9:c0409cf0 r8:00000720 r7:00000002 r6:00000007 r5:c0e32b00 r4:c0c0c3c0 fbcon_cursor from hide_cursor+0x38/0xac r9:c0c70000 r8:ffffff60 r7:00000000 r6:00000000 r5:c0807584 r4:c0c0c3c0 hide_cursor from redraw_screen+0x1f8/0x258 r7:00000000 r6:00000000 r5:00000000 r4:c0c0c3c0 redraw_screen from fbcon_prepare_logo+0x35c/0x448 r8:ffffff60 r7:00000000 r6:00000000 r5:c0c6ff60 r4:c0c0c3c0 fbcon_prepare_logo from fbcon_init+0x408/0x4f8 r10:00000028 r9:c0a00df8 r8:00000028 r7:00000050 r6:c0e61920 r5:c0e32b00 r4:c0c0c3c0 fbcon_init from visual_init+0x9c/0xe0 r10:c0a03080 r9:00000000 r8:0000003e r7:c0a02f84 r6:00000000 r5:c0c0c624 r4:c0c0c3c0 visual_init from do_bind_con_driver+0x19c/0x370 r7:c0a02f84 r6:00000001 r5:c06affbc r4:c0c0c3c0 do_bind_con_driver from do_take_over_console+0x124/0x1b8 r10:c0a02f8c r9:00000001 r8:0000003e r7:00000000 r6:00000001 r5:00000000 r4:c06affbc do_take_over_console from do_fbcon_takeover+0x6c/0xcc r10:fffff000 r9:c0e32da4 r8:c07dbabc r7:00000000 r6:c0805a24 r5:c0a00cc8 r4:c0805a24 do_fbcon_takeover from fbcon_fb_registered+0xe4/0x128 r5:00000000 r4:c0a00dac fbcon_fb_registered from register_framebuffer+0x208/0x318 r7:c0e32b10 r6:c0a00ca8 r5:00000000 r4:c0e32b00 register_framebuffer from sa1100fb_probe+0x334/0x420 r9:00000000 r8:0000000d r7:00000004 r6:c07dfe08 r5:00000000 r4:c0e32b00 sa1100fb_probe from platform_probe+0x4c/0xac r10:c07cb85c r9:c07cb87c r8:00000000 r7:00000000 r6:c0805abc r5:c07dfe18 r4:00000000 platform_probe from really_probe+0xc0/0x280 r7:00000000 r6:c0805abc r5:c07dfe18 r4:00000000 really_probe from __driver_probe_device+0x84/0xe4 r7:00000000 r6:c07dfe18 r5:c0805abc r4:c07dfe18 __driver_probe_device from driver_probe_device+0x4c/0x10c r5:60000153 r4:c0a04c6c driver_probe_device from __driver_attach+0x90/0x104 r7:c080af60 r6:c045a7d0 r5:c0805abc r4:c07dfe18 __driver_attach from bus_for_each_dev+0x80/0xcc r7:c080af60 r6:c045a7d0 r5:c0805abc r4:00000000 bus_for_each_dev from driver_attach+0x20/0x28 r6:00000000 r5:c0e5d340 r4:c0805abc driver_attach from bus_add_driver+0x140/0x1c8 bus_add_driver from driver_register+0x7c/0x110 r7:c07a4bc8 r6:c07bc340 r5:c0c19000 r4:c0805abc driver_register from __platform_driver_register+0x20/0x28 r5:c0c19000 r4:c0818000 __platform_driver_register from sa1100fb_init+0x28/0x3c sa1100fb_init from do_one_initcall+0x78/0x220 do_one_initcall from kernel_init_freeable+0x180/0x1fc r10:c07cb85c r9:c07cb87c r8:c0818000 r7:c07a4bc8 r6:00000007 r5:c0c427a0 r4:c07d2060 kernel_init_freeable from kernel_init+0x18/0x10c r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0675804 r4:00000000 kernel_init from ret_from_fork+0x14/0x2c Exception stack(0xe0809fb0 to 0xe0809ff8) 9fa0: 00000000 00000000 00000000 00000000 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 r5:c0675804 r4:00000000 Code: e24ba02c e0026323 e08a6106 e5166044 (e58c6000) ---[ end trace 00000000c08187d8 ]--- Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b Reboot failed -- System halted
--- # bad: [dd315b5800612e6913343524aa9b993f9a8bb0cf] Add linux-next specific files for 20220324 # good: [f443e374ae131c168a065ea1748feac6b2e76613] Linux 5.17 git bisect start 'HEAD' 'v5.17' # good: [6788381e2f3c20c25cf7ab91df9cf0d6bec153f9] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git git bisect good 6788381e2f3c20c25cf7ab91df9cf0d6bec153f9 # bad: [59c7e0caa3e7bc21dd1b6c681c87d2b307f399ee] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git git bisect bad 59c7e0caa3e7bc21dd1b6c681c87d2b307f399ee # good: [4d17d43de9d186150b3289ce99d7a79fcff202f9] net: usb: asix: suspend embedded PHY if external is used git bisect good 4d17d43de9d186150b3289ce99d7a79fcff202f9 # good: [6c64ae228f0826859c56711ce133aff037d6205f] Backmerge tag 'v5.17-rc6' into drm-next git bisect good 6c64ae228f0826859c56711ce133aff037d6205f # good: [01fd8d2522c49a333b0ee46ba19a6fedfc1c9a60] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git git bisect good 01fd8d2522c49a333b0ee46ba19a6fedfc1c9a60 # bad: [6de7e4f02640fba2ffa6ac04e2be13785d614175] Merge tag 'drm-msm-next-2022-03-01' of https://gitlab.freedesktop.org/drm/msm into drm-next git bisect bad 6de7e4f02640fba2ffa6ac04e2be13785d614175 # bad: [c9e9ce0b6f85ac330adee912745048a0af5f315d] Merge tag 'drm-misc-next-2022-03-03' of git://anongit.freedesktop.org/drm/drm-misc into drm-next git bisect bad c9e9ce0b6f85ac330adee912745048a0af5f315d # good: [e2573d5f2a5cebe789bbf415e484b589d8eebad7] drm/amd/display: limit unbounded requesting to 5k git bisect good e2573d5f2a5cebe789bbf415e484b589d8eebad7 # good: [3c54c95bd917d43d12fe1b192df9aa4c5973449b] fbdev: Remove trailing whitespaces from cfbimgblt.c git bisect good 3c54c95bd917d43d12fe1b192df9aa4c5973449b # good: [ed6e76676b2657b71a0b9e5e847d96e4de0b394b] drm: rcar-du: lvds: Add r8a77961 support git bisect good ed6e76676b2657b71a0b9e5e847d96e4de0b394b # good: [66a8af1f6e3c10190dff14a5668661c092a2a85f] Merge tag 'drm/tegra/for-5.18-rc1' of https://gitlab.freedesktop.org/drm/tegra into drm-next git bisect good 66a8af1f6e3c10190dff14a5668661c092a2a85f # bad: [701920ca9822eb63b420b3bcb627f2c1ec759903] drm/ssd130x: remove redundant initialization of pointer mode git bisect bad 701920ca9822eb63b420b3bcb627f2c1ec759903 # bad: [9ae2ac4d31a85ce59cc560d514a31b95f4ace154] drm: Add TODO item for optimizing format helpers git bisect bad 9ae2ac4d31a85ce59cc560d514a31b95f4ace154 # bad: [0d03011894d23241db1a1cad5c12aede60897d5e] fbdev: Improve performance of cfb_imageblit() git bisect bad 0d03011894d23241db1a1cad5c12aede60897d5e # first bad commit: [0d03011894d23241db1a1cad5c12aede60897d5e] fbdev: Improve performance of cfb_imageblit()
Hi
Am 24.03.22 um 20:11 schrieb Guenter Roeck:
Hi,
On Wed, Feb 23, 2022 at 08:38:03PM +0100, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3:
- fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de
This patch causes crashes with arm mainstone, z2, and collie emulations. Reverting it fixes the problem.
collie crash log and bisect log attached.
Does it work if you apply the fixes at
https://patchwork.freedesktop.org/series/101321/
?
Best regards Thomas
Guenter
8<--- cut here --- Unable to handle kernel paging request at virtual address e090d000 [e090d000] *pgd=c0c0b811c0c0b811, *pte=c0c0b000, *ppte=00000000 Internal error: Oops: 807 [#1] ARM CPU: 0 PID: 1 Comm: swapper Not tainted 5.17.0-next-20220324 #1 Hardware name: Sharp-Collie PC is at cfb_imageblit+0x58c/0x6e0 LR is at 0x5 pc : [<c040eab0>] lr : [<00000005>] psr: a0000153 sp : e0809958 ip : e090d000 fp : e08099f4 r10: e08099c8 r9 : c0c70600 r8 : ffff6802 r7 : c0c6e000 r6 : 00000000 r5 : e08e7000 r4 : 00000280 r3 : 00000020 r2 : 00000003 r1 : 00000002 r0 : 00000002 Flags: NzCv IRQs on FIQs off Mode SVC_32 ISA ARM Segment none Control: 0000717f Table: c0004000 DAC: 00000053 Register r0 information: non-paged memory Register r1 information: non-paged memory Register r2 information: non-paged memory Register r3 information: non-paged memory Register r4 information: non-paged memory Register r5 information: 0-page vmalloc region starting at 0xe08e6000 allocated at dma_common_contiguous_remap+0x94/0xb0 Register r6 information: NULL pointer Register r7 information: non-slab/vmalloc memory Register r8 information: non-paged memory Register r9 information: non-slab/vmalloc memory Register r10 information: 2-page vmalloc region starting at 0xe0808000 allocated at kernel_clone+0x78/0x4e4 Register r11 information: 2-page vmalloc region starting at 0xe0808000 allocated at kernel_clone+0x78/0x4e4 Register r12 information: 0-page vmalloc region starting at 0xe08e6000 allocated at dma_common_contiguous_remap+0x94/0xb0 Process swapper (pid: 1, stack limit = 0x(ptrval)) Stack: (0xe0809958 to 0xe080a000) 9940: 80000153 0000005e 9960: dfb1b424 00000020 00000000 00000000 00000001 00000002 00000003 00000004 9980: dfb1b420 00000000 00000000 00000000 00000000 c067f338 e08099ab 00000026 99a0: 80000153 00000820 007fe178 c07db82c e08099d4 0000003e 00000820 c0e32b00 99c0: 00000006 c07db82c 00000001 c0da1e40 e0809a54 c0e32b00 00000006 00000001 99e0: 00000001 c0c6e000 e0809a34 e08099f8 c040a3f8 c040e530 00000006 00000001 9a00: c0e61920 c0da1e78 00000000 c0e61920 00000000 e0809a54 c06ad89c c0e32b00 9a20: c0da1e00 00000020 e0809acc e0809a38 c040a040 c040a26c e0809a7c 00000140 9a40: 00000002 00000002 00000001 00000007 00000000 00000039 00000001 c0da1e00 9a60: 00000000 00000000 00000000 00000004 00000006 00000007 00000000 00000001 9a80: c06ad89c 00000000 00000000 00000000 00000000 ffffffff ffffffff c07db82c 9aa0: e0809acc c0c0c3c0 c0e32b00 00000007 00000002 00000720 c0409cf0 00000028 9ac0: e0809afc e0809ad0 c040665c c0409cfc 00000000 00000000 c0c0c3c0 c0807584 9ae0: 00000000 00000000 ffffff60 c0c70000 e0809b1c e0809b00 c0439a50 c040656c 9b00: c0c0c3c0 00000000 00000000 00000000 e0809b54 e0809b20 c043a798 c0439a24 9b20: c04095c8 c0c6ff60 00000000 c07db82c e0809b54 c0c0c3c0 c0c6ff60 00000000 9b40: 00000000 ffffff60 e0809ba4 e0809b58 c0407254 c043a5ac e0809b7c e0809b68 9b60: c04145d8 00000000 00000000 00000000 00000720 00000000 00000050 c0c0c3c0 9b80: c0e32b00 c0e61920 00000050 00000028 c0a00df8 00000028 e0809bec e0809ba8 9ba0: c0407748 c0406f04 00000050 00000028 00000050 00000001 c0a02f70 00000000 9bc0: 00000000 c0c0c3c0 c0c0c624 00000000 c0a02f84 0000003e 00000000 c0a03080 9be0: e0809c0c e0809bf0 c0438b10 c040734c c0c0c3c0 c06affbc 00000001 c0a02f84 9c00: e0809c54 e0809c10 c043be28 c0438a80 0000003e 00000001 00000000 c0779d88 9c20: 00000000 00000001 c08075a8 c06affbc 00000000 00000001 00000000 0000003e 9c40: 00000001 c0a02f8c e0809c9c e0809c58 c043c6ec c043bc98 c08075a8 c077c29c 9c60: 00000001 00000000 c0e32b44 c0a03a58 c067f354 c0805a24 c0a00cc8 c0805a24 9c80: 00000000 c07dbabc c0e32da4 fffff000 e0809cb4 e0809ca0 c0405d5c c043c5d4 9ca0: c0a00dac 00000000 e0809cd4 e0809cb8 c0408f48 c0405cfc c0e32b00 00000000 9cc0: c0a00ca8 c0e32b10 e0809d44 e0809cd8 c03ff9e4 c0408e70 c0779a14 00000000 9ce0: c000ea7c 00000000 00000041 00000140 000000f0 00029e01 0000000b 0000001e 9d00: 00000002 00000000 00000005 00000001 00000003 00000000 00000020 c07db82c 9d20: c0e32b00 00000000 c07dfe08 00000004 0000000d 00000000 e0809d84 e0809d48 9d40: c040f550 c03ff7e8 00000004 c077a1b8 c0e32b00 c0180a04 c07dfe18 00000000 9d60: c07dfe18 c0805abc 00000000 00000000 c07cb87c c07cb85c e0809da4 e0809d88 9d80: c045c2c8 c040f228 00000000 c07dfe18 c0805abc 00000000 e0809dc4 e0809da8 9da0: c045a304 c045c288 c07dfe18 c0805abc c07dfe18 00000000 e0809ddc e0809dc8 9dc0: c045a548 c045a250 c0a04c6c 60000153 e0809e04 e0809de0 c045a5f4 c045a4d0 9de0: e0809e04 e0809df0 c07dfe18 c0805abc c045a7d0 c080af60 e0809e24 e0809e08 9e00: c045a860 c045a5b4 00000000 c0805abc c045a7d0 c080af60 e0809e54 e0809e28 9e20: c0458694 c045a7dc c0c30e20 c0c30dec c0c9f3b4 c07db82c c06559b8 c0805abc 9e40: c0e5d340 00000000 e0809e64 e0809e58 c045adf0 c0458620 e0809e8c e0809e68 9e60: c0458fb8 c045addc c0749e14 e0809e78 c0805abc c0c19000 c07bc340 c07a4bc8 9e80: e0809ea4 e0809e90 c045b684 c0458e84 c0818000 c0c19000 e0809eb4 e0809ea8 9ea0: c045d10c c045b614 e0809ec4 e0809eb8 c07bc368 c045d0f8 e0809f4c e0809ec8 9ec0: c07a821c c07bc34c e0809eec e0809ed8 c00374f4 c065f390 c0c427da c07a5600 9ee0: e0809f4c e0809ef0 c00376f4 c07a7448 e0809f03 00000006 00000006 00000000 9f00: 00000000 c07a743c c0793b5c c07a4bc8 00000dc0 c0c427cc c0c427d4 c07db82c 9f20: 00000000 c07d2060 c0c427a0 00000007 c07a4bc8 c0818000 c07cb87c c07cb85c 9f40: e0809f94 e0809f50 c07a85a0 c07a81b0 00000006 00000006 00000000 c07a743c 9f60: c0c19000 0000008c e0809f8c 00000000 c0675804 00000000 00000000 00000000 9f80: 00000000 00000000 e0809fac e0809f98 c067581c c07a842c 00000000 c0675804 9fa0: 00000000 e0809fb0 c0008328 c0675810 00000000 00000000 00000000 00000000 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000 Backtrace: cfb_imageblit from soft_cursor+0x198/0x1fc r10:c0c6e000 r9:00000001 r8:00000001 r7:00000006 r6:c0e32b00 r5:e0809a54 r4:c0da1e40 soft_cursor from bit_cursor+0x350/0x4fc r10:00000020 r9:c0da1e00 r8:c0e32b00 r7:c06ad89c r6:e0809a54 r5:00000000 r4:c0e61920 bit_cursor from fbcon_cursor+0xfc/0x110 r10:00000028 r9:c0409cf0 r8:00000720 r7:00000002 r6:00000007 r5:c0e32b00 r4:c0c0c3c0 fbcon_cursor from hide_cursor+0x38/0xac r9:c0c70000 r8:ffffff60 r7:00000000 r6:00000000 r5:c0807584 r4:c0c0c3c0 hide_cursor from redraw_screen+0x1f8/0x258 r7:00000000 r6:00000000 r5:00000000 r4:c0c0c3c0 redraw_screen from fbcon_prepare_logo+0x35c/0x448 r8:ffffff60 r7:00000000 r6:00000000 r5:c0c6ff60 r4:c0c0c3c0 fbcon_prepare_logo from fbcon_init+0x408/0x4f8 r10:00000028 r9:c0a00df8 r8:00000028 r7:00000050 r6:c0e61920 r5:c0e32b00 r4:c0c0c3c0 fbcon_init from visual_init+0x9c/0xe0 r10:c0a03080 r9:00000000 r8:0000003e r7:c0a02f84 r6:00000000 r5:c0c0c624 r4:c0c0c3c0 visual_init from do_bind_con_driver+0x19c/0x370 r7:c0a02f84 r6:00000001 r5:c06affbc r4:c0c0c3c0 do_bind_con_driver from do_take_over_console+0x124/0x1b8 r10:c0a02f8c r9:00000001 r8:0000003e r7:00000000 r6:00000001 r5:00000000 r4:c06affbc do_take_over_console from do_fbcon_takeover+0x6c/0xcc r10:fffff000 r9:c0e32da4 r8:c07dbabc r7:00000000 r6:c0805a24 r5:c0a00cc8 r4:c0805a24 do_fbcon_takeover from fbcon_fb_registered+0xe4/0x128 r5:00000000 r4:c0a00dac fbcon_fb_registered from register_framebuffer+0x208/0x318 r7:c0e32b10 r6:c0a00ca8 r5:00000000 r4:c0e32b00 register_framebuffer from sa1100fb_probe+0x334/0x420 r9:00000000 r8:0000000d r7:00000004 r6:c07dfe08 r5:00000000 r4:c0e32b00 sa1100fb_probe from platform_probe+0x4c/0xac r10:c07cb85c r9:c07cb87c r8:00000000 r7:00000000 r6:c0805abc r5:c07dfe18 r4:00000000 platform_probe from really_probe+0xc0/0x280 r7:00000000 r6:c0805abc r5:c07dfe18 r4:00000000 really_probe from __driver_probe_device+0x84/0xe4 r7:00000000 r6:c07dfe18 r5:c0805abc r4:c07dfe18 __driver_probe_device from driver_probe_device+0x4c/0x10c r5:60000153 r4:c0a04c6c driver_probe_device from __driver_attach+0x90/0x104 r7:c080af60 r6:c045a7d0 r5:c0805abc r4:c07dfe18 __driver_attach from bus_for_each_dev+0x80/0xcc r7:c080af60 r6:c045a7d0 r5:c0805abc r4:00000000 bus_for_each_dev from driver_attach+0x20/0x28 r6:00000000 r5:c0e5d340 r4:c0805abc driver_attach from bus_add_driver+0x140/0x1c8 bus_add_driver from driver_register+0x7c/0x110 r7:c07a4bc8 r6:c07bc340 r5:c0c19000 r4:c0805abc driver_register from __platform_driver_register+0x20/0x28 r5:c0c19000 r4:c0818000 __platform_driver_register from sa1100fb_init+0x28/0x3c sa1100fb_init from do_one_initcall+0x78/0x220 do_one_initcall from kernel_init_freeable+0x180/0x1fc r10:c07cb85c r9:c07cb87c r8:c0818000 r7:c07a4bc8 r6:00000007 r5:c0c427a0 r4:c07d2060 kernel_init_freeable from kernel_init+0x18/0x10c r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c0675804 r4:00000000 kernel_init from ret_from_fork+0x14/0x2c Exception stack(0xe0809fb0 to 0xe0809ff8) 9fa0: 00000000 00000000 00000000 00000000 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 r5:c0675804 r4:00000000 Code: e24ba02c e0026323 e08a6106 e5166044 (e58c6000) ---[ end trace 00000000c08187d8 ]--- Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b Reboot failed -- System halted
# bad: [dd315b5800612e6913343524aa9b993f9a8bb0cf] Add linux-next specific files for 20220324 # good: [f443e374ae131c168a065ea1748feac6b2e76613] Linux 5.17 git bisect start 'HEAD' 'v5.17' # good: [6788381e2f3c20c25cf7ab91df9cf0d6bec153f9] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git git bisect good 6788381e2f3c20c25cf7ab91df9cf0d6bec153f9 # bad: [59c7e0caa3e7bc21dd1b6c681c87d2b307f399ee] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git git bisect bad 59c7e0caa3e7bc21dd1b6c681c87d2b307f399ee # good: [4d17d43de9d186150b3289ce99d7a79fcff202f9] net: usb: asix: suspend embedded PHY if external is used git bisect good 4d17d43de9d186150b3289ce99d7a79fcff202f9 # good: [6c64ae228f0826859c56711ce133aff037d6205f] Backmerge tag 'v5.17-rc6' into drm-next git bisect good 6c64ae228f0826859c56711ce133aff037d6205f # good: [01fd8d2522c49a333b0ee46ba19a6fedfc1c9a60] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git git bisect good 01fd8d2522c49a333b0ee46ba19a6fedfc1c9a60 # bad: [6de7e4f02640fba2ffa6ac04e2be13785d614175] Merge tag 'drm-msm-next-2022-03-01' of https://gitlab.freedesktop.org/drm/msm into drm-next git bisect bad 6de7e4f02640fba2ffa6ac04e2be13785d614175 # bad: [c9e9ce0b6f85ac330adee912745048a0af5f315d] Merge tag 'drm-misc-next-2022-03-03' of git://anongit.freedesktop.org/drm/drm-misc into drm-next git bisect bad c9e9ce0b6f85ac330adee912745048a0af5f315d # good: [e2573d5f2a5cebe789bbf415e484b589d8eebad7] drm/amd/display: limit unbounded requesting to 5k git bisect good e2573d5f2a5cebe789bbf415e484b589d8eebad7 # good: [3c54c95bd917d43d12fe1b192df9aa4c5973449b] fbdev: Remove trailing whitespaces from cfbimgblt.c git bisect good 3c54c95bd917d43d12fe1b192df9aa4c5973449b # good: [ed6e76676b2657b71a0b9e5e847d96e4de0b394b] drm: rcar-du: lvds: Add r8a77961 support git bisect good ed6e76676b2657b71a0b9e5e847d96e4de0b394b # good: [66a8af1f6e3c10190dff14a5668661c092a2a85f] Merge tag 'drm/tegra/for-5.18-rc1' of https://gitlab.freedesktop.org/drm/tegra into drm-next git bisect good 66a8af1f6e3c10190dff14a5668661c092a2a85f # bad: [701920ca9822eb63b420b3bcb627f2c1ec759903] drm/ssd130x: remove redundant initialization of pointer mode git bisect bad 701920ca9822eb63b420b3bcb627f2c1ec759903 # bad: [9ae2ac4d31a85ce59cc560d514a31b95f4ace154] drm: Add TODO item for optimizing format helpers git bisect bad 9ae2ac4d31a85ce59cc560d514a31b95f4ace154 # bad: [0d03011894d23241db1a1cad5c12aede60897d5e] fbdev: Improve performance of cfb_imageblit() git bisect bad 0d03011894d23241db1a1cad5c12aede60897d5e # first bad commit: [0d03011894d23241db1a1cad5c12aede60897d5e] fbdev: Improve performance of cfb_imageblit()
On 3/24/22 12:18, Thomas Zimmermann wrote:
Hi
Am 24.03.22 um 20:11 schrieb Guenter Roeck:
Hi,
On Wed, Feb 23, 2022 at 08:38:03PM +0100, Thomas Zimmermann wrote:
Improve the performance of cfb_imageblit() by manually unrolling the inner blitting loop and moving some invariants out. The compiler failed to do this automatically. This change keeps cfb_imageblit() in sync with sys_imagebit().
A microbenchmark measures the average number of CPU cycles for cfb_imageblit() after a stabilizing period of a few minutes (i7-4790, FullHD, simpledrm, kernel with debugging).
cfb_imageblit(), new: 15724 cycles cfb_imageblit(): old: 30566 cycles
In the optimized case, cfb_imageblit() is now ~2x faster than before.
v3: * fix commit description (Pekka)
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de
This patch causes crashes with arm mainstone, z2, and collie emulations. Reverting it fixes the problem.
collie crash log and bisect log attached.
Does it work if you apply the fixes at
https://patchwork.freedesktop.org/series/101321/
?
Yes, it does, specifically the cfb related patch. I sent a Tested-by:.
Thanks, Guenter
Add a TODO item for optimizing blitting and format-conversion helpers in DRM and fbdev. There's always demand for faster graphics output.
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de --- Documentation/gpu/todo.rst | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/Documentation/gpu/todo.rst b/Documentation/gpu/todo.rst index 7bf7f2111696..7f113c6a02dd 100644 --- a/Documentation/gpu/todo.rst +++ b/Documentation/gpu/todo.rst @@ -241,6 +241,28 @@ Contact: Thomas Zimmermann tzimmermann@suse.de, Daniel Vetter
Level: Advanced
+Benchmark and optimize blitting and format-conversion function +-------------------------------------------------------------- + +Drawing to dispay memory quickly is crucial for many applications' +performance. + +On at least x86-64, sys_imageblit() is significantly slower than +cfb_imageblit(), even though both use the same blitting algorithm and +the latter is written for I/O memory. It turns out that cfb_imageblit() +uses movl instructions, while sys_imageblit apparently does not. This +seems to be a problem with gcc's optimizer. DRM's format-conversion +heleprs might be subject to similar issues. + +Benchmark and optimize fbdev's sys_() helpers and DRM's format-conversion +helpers. In cases that can be further optimized, maybe implement a different +algorithm, For micro-optimizations, use movl/movq instructions explicitly. +That might possibly require architecture specific helpers (e.g., storel() +storeq()). + +Contact: Thomas Zimmermann tzimmermann@suse.de + +Level: Intermediate
drm_framebuffer_funcs and drm_mode_config_funcs.fb_create cleanup -----------------------------------------------------------------
On Wed, Feb 23, 2022 at 08:38:04PM +0100, Thomas Zimmermann wrote:
Add a TODO item for optimizing blitting and format-conversion helpers in DRM and fbdev. There's always demand for faster graphics output.
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de
Documentation/gpu/todo.rst | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/Documentation/gpu/todo.rst b/Documentation/gpu/todo.rst index 7bf7f2111696..7f113c6a02dd 100644 --- a/Documentation/gpu/todo.rst +++ b/Documentation/gpu/todo.rst @@ -241,6 +241,28 @@ Contact: Thomas Zimmermann tzimmermann@suse.de, Daniel Vetter
Level: Advanced
+Benchmark and optimize blitting and format-conversion function +--------------------------------------------------------------
+Drawing to dispay memory quickly is crucial for many applications'
display
+performance.
+On at least x86-64, sys_imageblit() is significantly slower than
On, at least x86-64, ... To me the extra comma makes sense, but grammar is not my strong side.
+cfb_imageblit(), even though both use the same blitting algorithm and +the latter is written for I/O memory. It turns out that cfb_imageblit() +uses movl instructions, while sys_imageblit apparently does not. This +seems to be a problem with gcc's optimizer. DRM's format-conversion +heleprs might be subject to similar issues.
helpers
+Benchmark and optimize fbdev's sys_() helpers and DRM's format-conversion +helpers. In cases that can be further optimized, maybe implement a different +algorithm, For micro-optimizations, use movl/movq instructions explicitly.
algorithm. (period, not comma)
+That might possibly require architecture specific helpers (e.g., storel() +storeq()).
+Contact: Thomas Zimmermann tzimmermann@suse.de
+Level: Intermediate
With the small fixes above: Acked-by: Sam Ravnborg sam@ravnborg.org
Another option would be to re-implement imageblit() to be drm specific. Maybe we can then throw out some legacy code and optimize only for the drm use. And then maybe only a small part of the code would differ if this is I/O memory or direct accessible memory.
Sam
On 2/23/22 20:38, Thomas Zimmermann wrote:
Add a TODO item for optimizing blitting and format-conversion helpers in DRM and fbdev. There's always demand for faster graphics output.
Signed-off-by: Thomas Zimmermann tzimmermann@suse.de
After fixing the typos mentioned by Sam:
Reviewed-by: Javier Martinez Canillas javierm@redhat.com
Best regards,
Hi,
merged with fixes for the typoes in the final patch. Thanks for reviewing.
Best regards Thomas
Am 23.02.22 um 20:37 schrieb Thomas Zimmermann:
Optimize performance of the fbdev console for the common case of software-based clearing and image blitting.
The commit descripton of each patch contains resuls os a simple microbenchmark. I also tested the full patchset's effect on the console output by printing directory listings (i7-4790, FullHD, simpledrm, kernel with debugging).
time find /usr/share/doc -type f
In the unoptimized case:
real 0m6.173s user 0m0.044s sys 0m6.107s
With optimizations applied:
real 0m4.754s user 0m0.044s sys 0m4.698s
In the optimized case, printing the directory listing is ~25% faster than before.
In v2 of the patchset, after implementing Sam's suggestion to update cfb_imageblit() as well, it turns out that the compiled code in sys_imageblit() is still significantly slower than the CFB version. A fix is probably a larger task and would include architecture-specific changes. A new TODO item suggests to investigate the performance of the various helpers and format-conversion functions in DRM and fbdev.
v3:
- fix description of cfb_imageblit() patch (Pekka)
v2:
- improve readability for sys_imageblit() (Gerd, Sam)
- new TODO item for further optimization
Thomas Zimmermann (5): fbdev: Improve performance of sys_fillrect() fbdev: Improve performance of sys_imageblit() fbdev: Remove trailing whitespaces from cfbimgblt.c fbdev: Improve performance of cfb_imageblit() drm: Add TODO item for optimizing format helpers
Documentation/gpu/todo.rst | 22 +++++ drivers/video/fbdev/core/cfbimgblt.c | 107 ++++++++++++++++--------- drivers/video/fbdev/core/sysfillrect.c | 16 +--- drivers/video/fbdev/core/sysimgblt.c | 49 ++++++++--- 4 files changed, 133 insertions(+), 61 deletions(-)
dri-devel@lists.freedesktop.org