Hi All,
The latest mainline kernel branch fails to build for arm spear3xx_defconfig with the error:
In function 'edid_block_data', inlined from 'drm_edid_is_valid' at drivers/gpu/drm/drm_edid.c:1904:25: ././include/linux/compiler_types.h:352:45: error: call to '__compiletime_assert_250' declared with attribute error: BUILD_BUG_ON failed: sizeof(*edid) != EDID_LENGTH 352 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
git bisect pointed to f1e4c916f97f ("drm/edid: add EDID block count and size helpers")
And, reverting it on top of mainline branch has fixed the build failure.
-- Regards Sudip
On Fri, May 27, 2022 at 2:07 AM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
declared with attribute error: BUILD_BUG_ON failed: sizeof(*edid) != EDID_LENGTH
And, reverting it on top of mainline branch has fixed the build failure.
Hmm. That BUILD_BUG_ON() looks entirely correct, and if that sizeof() doesn't work, then the code doesn't work.
I'm not seeing what could go wrong in there, with all the structures I see being marked as __packed__.
I wonder if the union in 'struct detailed_timing' also wants that "__attribute__((packed))" but I also wonder what it is that would make this fail on spear3xx but not elsewhere.
Very strange. It would be interesting to know where that sizeof goes wrong, but it would seem to be something very peculiar to your build environment (either that config, or your compiler).
Linus
On Fri, May 27, 2022 at 7:56 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, May 27, 2022 at 2:07 AM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
declared with attribute error: BUILD_BUG_ON failed: sizeof(*edid) != EDID_LENGTH
And, reverting it on top of mainline branch has fixed the build failure.
Hmm. That BUILD_BUG_ON() looks entirely correct, and if that sizeof() doesn't work, then the code doesn't work.
<snip>
Very strange. It would be interesting to know where that sizeof goes wrong, but it would seem to be something very peculiar to your build environment (either that config, or your compiler).
I just tested with various values, sizeof(*edid) is 144 bytes at that place.
My last good build was with fdaf9a5840ac ("Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache") And my setup has not changed in anyway since then. Also verified the build failure on my laptop.
On Fri, May 27, 2022 at 4:41 PM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
I just tested with various values, sizeof(*edid) is 144 bytes at that place.
Hmm. What compiler do you have? Because it seems very broken.
You don't actually have to try with various sizes, you could have just done something like
int size_of_edid(const struct edid *edid) { return sizeof(*edid); }
and then "make drivers/gpu/drm/drm_edid.s" to generate assembly and see what it looks like (obviously removing the BUG_ON() in order to build).
That obviously generates code like
movl $128, %eax ret
for me, and looking at the definition of that type I really can't see how it would ever generate anything else. But it's apparently not even close for you.
I suspect some of the structs inside of that 'struct edid' end up getting aligned, despite the '__attribute__((packed))'. For example, 'struct est_timings' is supposed to be just 3 bytes, and it's at an odd offset too (byte offset 35 in the 'struct edid' if I did the math correctly).
But it obviously doesn't happen for me or for most other people, so it's something in your setup. Unusual compiler?
Linus
Hi Linus,
On Fri, May 27, 2022 at 06:04:14PM -0700, Linus Torvalds wrote:
On Fri, May 27, 2022 at 4:41 PM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
I just tested with various values, sizeof(*edid) is 144 bytes at that place.
Hmm. What compiler do you have? Because it seems very broken.
I am using gcc version 11.3.1 20220517 (GCC). And I am not just building spear3xx_defconfig, I am building all the arm configs with the same compiler in the same setup and only spear3xx_defconfig started failing. I am attaching a build summary report generated on 26th May, all arm builds passed, even allmodconfig.
<snip>
But it obviously doesn't happen for me or for most other people, so it's something in your setup. Unusual compiler?
And, just to verify its not my setup or the compiler I use, I took a fresh Debian Bullseye docker, installed 'gcc-arm-linux-gnueabi' from Debian and I see the same build failure with spear3xx_defconfig. This gcc from Debian Bullseye is: gcc version 10.2.1 20210110 (Debian 10.2.1-6).
-- Regards Sudip
On Fri, May 27, 2022 at 06:04:14PM -0700, Linus Torvalds wrote:
On Fri, May 27, 2022 at 4:41 PM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
I just tested with various values, sizeof(*edid) is 144 bytes at that place.
Hmm. What compiler do you have? Because it seems very broken.
You don't actually have to try with various sizes, you could have just done something like
int size_of_edid(const struct edid *edid) { return sizeof(*edid); }
and then "make drivers/gpu/drm/drm_edid.s" to generate assembly and see what it looks like (obviously removing the BUG_ON() in order to build).
just tried this with make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- drivers/gpu/drm/drm_edid.s
That obviously generates code like
movl $128, %eax ret
and for me it looks like:
.L1030: .word .LC40 .word .LC41 .word -1431655765 .word .LC39 .size drm_edid_to_sad, .-drm_edid_to_sad .align 2 .global size_of_edid .syntax unified .arm .type size_of_edid, %function size_of_edid: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 1, uses_anonymous_args = 0 mov ip, sp @, push {fp, ip, lr, pc} @ sub fp, ip, #4 @,, @ drivers/gpu/drm/drm_edid.c:1573: } mov r0, #144 @, ldmfd sp, {fp, sp, pc} @ .size size_of_edid, .-size_of_edid
-- Regards Sudip
On Sat, May 28, 2022 at 5:13 AM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
just tried this with make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- drivers/gpu/drm/drm_edid.s
size_of_edid: mov r0, #144 @, ldmfd sp, {fp, sp, pc} @
So digging a bit deeper - since I have am arm compiler after all - I note that 'sizeof(detailed_timings)' is 88.
Which is completely wrong. It should be 72 bytes (an array of 4 structures, each 18 bytes in size).
I have not dug deeper, but that is clearly the issue.
Now, why that only happens on that spear3xx_defconfig, I have no idea.
Linus
On Sat, May 28, 2022 at 10:40 AM Linus Torvalds torvalds@linux-foundation.org wrote:
So digging a bit deeper - since I have am arm compiler after all - I note that 'sizeof(detailed_timings)' is 88.
Hmm.
sizeof() both
detailed_timings[0].data.other_data.data.range.formula.gtf2
and
detailed_timings[0].data.other_data.data.range.formula.cvt
is 7.
But the *union* of those things is
detailed_timings[0].data.other_data.data.range.formula
and its size is 8 (despite having an alignment of just 1).
The attached patch would seem to fix it for me.
Not very much tested, and I have no idea what it is that triggers this only on spear3xx_defconfig.
Some ARM ABI issue that is triggered by some very particular ARM compiler flag enabled by that config?
Adding some ARM (and SPEAR, and SoC) people in case they have any idea.
This smells like a compiler bug triggered by "there's a 16-bit member field in that gtf2 structure, and despite it being packed and aligned to 1, we somehow still align the size to 2".
I dunno. But marking those unions packed too doesn't seem wrong, and does seem to fix it.
Linus
On Sat, May 28, 2022 at 8:08 PM Linus Torvalds torvalds@linux-foundation.org wrote:
Not very much tested, and I have no idea what it is that triggers this only on spear3xx_defconfig.
Some ARM ABI issue that is triggered by some very particular ARM compiler flag enabled by that config?
It's CONFIG_ARM_AEABI, which is normally set everywhere. Without this option, you the kernel is built for the old 'OABI' that forces all non-packed struct members to be at least 16-bit aligned.
I think Russell still uses OABI kernels on his oldest machines, but it is incompatible with all modern user space and should probably not be in the defconfig.
Your patch looks like the correct solution to me.
Arnd
On Sat, May 28, 2022 at 11:59 AM Arnd Bergmann arnd@arndb.de wrote:
It's CONFIG_ARM_AEABI, which is normally set everywhere. Without this option, you the kernel is built for the old 'OABI' that forces all non-packed struct members to be at least 16-bit aligned.
Looks like forced word (32 bit) alignment to me.
I wonder how many other structures that messes up, but I committed the EDID fix for now.
This has presumably been broken for a long time, but maybe the affected targets don't typically use EDID and kernel modesetting, and only use some fixed display setup instead.
Those structure definitions go back a _loong_ time (from a quick 'git blame' I see November 2008).
But despite that, I did not mark my fix 'cc:stable' because I don't know if any of those machines affected by this bad arm ABI issue could possibly care.
At least my tree hopefully now builds on them, with the BUILD_BUG_ON() that uncovered this.
Linus
On Sat, May 28, 2022 at 10:31 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, May 28, 2022 at 11:59 AM Arnd Bergmann arnd@arndb.de wrote:
It's CONFIG_ARM_AEABI, which is normally set everywhere. Without this option, you the kernel is built for the old 'OABI' that forces all non-packed struct members to be at least 16-bit aligned.
Looks like forced word (32 bit) alignment to me.
Ah, of course, I keep mixing it up with the odd structure alignment of m68k, which does the opposite and aligns struct members to no more than 16 bits.
Arnd
On Sat, 28 May 2022, Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, May 28, 2022 at 11:59 AM Arnd Bergmann arnd@arndb.de wrote:
It's CONFIG_ARM_AEABI, which is normally set everywhere. Without this option, you the kernel is built for the old 'OABI' that forces all non-packed struct members to be at least 16-bit aligned.
Looks like forced word (32 bit) alignment to me.
I wonder how many other structures that messes up, but I committed the EDID fix for now.
Thanks for the fix, and the thorough commit message!
This has presumably been broken for a long time, but maybe the affected targets don't typically use EDID and kernel modesetting, and only use some fixed display setup instead.
Those structure definitions go back a _loong_ time (from a quick 'git blame' I see November 2008).
But despite that, I did not mark my fix 'cc:stable' because I don't know if any of those machines affected by this bad arm ABI issue could possibly care.
At least my tree hopefully now builds on them, with the BUILD_BUG_ON() that uncovered this.
Indeed the bug is ancient. I just threw in the BUILD_BUG_ON() on a whim as an extra sanity check when doing pointer arithmetics on struct edid *.
If there are affected machines, buffer overflows are the real danger due to edid->extensions indicating the number of extensions.
BR, Jani.
On Mon, 30 May 2022, Jani Nikula jani.nikula@intel.com wrote:
On Sat, 28 May 2022, Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, May 28, 2022 at 11:59 AM Arnd Bergmann arnd@arndb.de wrote:
It's CONFIG_ARM_AEABI, which is normally set everywhere. Without this option, you the kernel is built for the old 'OABI' that forces all non-packed struct members to be at least 16-bit aligned.
Looks like forced word (32 bit) alignment to me.
I wonder how many other structures that messes up, but I committed the EDID fix for now.
Thanks for the fix, and the thorough commit message!
This has presumably been broken for a long time, but maybe the affected targets don't typically use EDID and kernel modesetting, and only use some fixed display setup instead.
Those structure definitions go back a _loong_ time (from a quick 'git blame' I see November 2008).
But despite that, I did not mark my fix 'cc:stable' because I don't know if any of those machines affected by this bad arm ABI issue could possibly care.
At least my tree hopefully now builds on them, with the BUILD_BUG_ON() that uncovered this.
Indeed the bug is ancient. I just threw in the BUILD_BUG_ON() on a whim as an extra sanity check when doing pointer arithmetics on struct edid *.
If there are affected machines, buffer overflows are the real danger due to edid->extensions indicating the number of extensions.
That is, for EDID. Makes you wonder about all the other packed structs with enum members across the kernel.
BR, Jani.
On Mon, May 30, 2022 at 11:33 AM Jani Nikula jani.nikula@intel.com wrote:
That is, for EDID. Makes you wonder about all the other packed structs with enum members across the kernel.
It is not the 'enum' that is special here, it's the 'union' having unpacked members, and the same thing happens when you have nested structs: both the inner and the outer aggregate need to be packed, either with __packed at the end, or on each individual member that is not fully aligned to max(sizeof(member), 4)).
I think in general, most __packed annotations we have in the kernel are completely pointless because they do not change the structure layout on any architecture but instead just make member access slower on architectures that lack unaligned load/store instructions. There have definitely been other cases though where a __packed annotation is not needed on any sane architecture but is needed for OABI ARM.
Overall I'm not that worried because the only machines running OABI kernels would be on really old hardware that runs a limited set of driver code.
A completely different matter are the extraneous __packed annotations that lead to possible problems when accessed through a misaligned pointer. We ignore -Waddress-of-packed-member and -Wcast-align in the kernel, so these never get caught at build time, but we have seen bugs from gcc making incorrect assumptions about alignment even on architectures that have unaligned load/store instructions.
Arnd
On Mon, 30 May 2022, Arnd Bergmann arnd@arndb.de wrote:
On Mon, May 30, 2022 at 11:33 AM Jani Nikula jani.nikula@intel.com wrote:
That is, for EDID. Makes you wonder about all the other packed structs with enum members across the kernel.
It is not the 'enum' that is special here, it's the 'union' having unpacked members,
Obviously meant union not enum, that was just a -ENOCOFFEE on my part.
and the same thing happens when you have nested structs: both the inner and the outer aggregate need to be packed, either with __packed at the end, or on each individual member that is not fully aligned to max(sizeof(member), 4)).
I think in general, most __packed annotations we have in the kernel are completely pointless because they do not change the structure layout on any architecture but instead just make member access slower on
Please explain.
They are used quite a bit for parsing blob data, or serialization/deserialization, like in the EDID case at hand. Try removing __attribute__((packed)) from include/drm/drm_edid.h and see the sizeof(struct edid) on any architecture.
BR, Jani.
architectures that lack unaligned load/store instructions. There have definitely been other cases though where a __packed annotation is not needed on any sane architecture but is needed for OABI ARM.
Overall I'm not that worried because the only machines running OABI kernels would be on really old hardware that runs a limited set of driver code.
A completely different matter are the extraneous __packed annotations that lead to possible problems when accessed through a misaligned pointer. We ignore -Waddress-of-packed-member and -Wcast-align in the kernel, so these never get caught at build time, but we have seen bugs from gcc making incorrect assumptions about alignment even on architectures that have unaligned load/store instructions.
Arnd
On Mon, May 30, 2022 at 02:43:45PM +0200, Arnd Bergmann wrote:
Overall I'm not that worried because the only machines running OABI kernels would be on really old hardware that runs a limited set of driver code.
... and from what I remember, none of them care about EDID anyway.
On Mon, May 30, 2022 at 12:33:17PM +0300, Jani Nikula wrote:
On Mon, 30 May 2022, Jani Nikula jani.nikula@intel.com wrote:
On Sat, 28 May 2022, Linus Torvalds torvalds@linux-foundation.org wrote:
On Sat, May 28, 2022 at 11:59 AM Arnd Bergmann arnd@arndb.de wrote:
It's CONFIG_ARM_AEABI, which is normally set everywhere. Without this option, you the kernel is built for the old 'OABI' that forces all non-packed struct members to be at least 16-bit aligned.
Looks like forced word (32 bit) alignment to me.
I wonder how many other structures that messes up, but I committed the EDID fix for now.
Thanks for the fix, and the thorough commit message!
This has presumably been broken for a long time, but maybe the affected targets don't typically use EDID and kernel modesetting, and only use some fixed display setup instead.
Those structure definitions go back a _loong_ time (from a quick 'git blame' I see November 2008).
But despite that, I did not mark my fix 'cc:stable' because I don't know if any of those machines affected by this bad arm ABI issue could possibly care.
At least my tree hopefully now builds on them, with the BUILD_BUG_ON() that uncovered this.
Indeed the bug is ancient. I just threw in the BUILD_BUG_ON() on a whim as an extra sanity check when doing pointer arithmetics on struct edid *.
If there are affected machines, buffer overflows are the real danger due to edid->extensions indicating the number of extensions.
That is, for EDID. Makes you wonder about all the other packed structs with enum members across the kernel.
enum should not be used in structures if the layout of the struct matters. ISTR there was a proposal for EABI to make enums just about big enough to hold their enumerated constants - so you'd end up with 8-bit, 16-bit etc according to the largest enumerated value that the compiler thinks it could hold.
That's a latent disaster when enums get used in structs where the layout matters, __packed or not.
On Sat, May 28, 2022 at 11:08:48AM -0700, Linus Torvalds wrote:
This smells like a compiler bug triggered by "there's a 16-bit member field in that gtf2 structure, and despite it being packed and aligned to 1, we somehow still align the size to 2".
It's an age old thing, it's no compiler bug, and it's completely compliant with the C standards. Implementations are permitted by the C standard to pad structures and unions how they see fit - and some do if it makes sense for performance.
The mistake is that people forget this detail, and they expect structs and unions to be laid out a certain way - because it doesn't matter to the same extent on x86.
However, as older ARM CPUs could not do unaligned loads, ensuring that things were naturally aligned made complete sense, even if it meant that people who assume the world is x86 got tripped up - the only way around that would be to make every load very expensive.
It's not "align to size of 2" in OABI, it tends to be align to a multiple of 4, because the underlying architecture is 32-bit.
dri-devel@lists.freedesktop.org