During testing we observed that the last cacheline was not being flushed from a
mb() for (addr = addr & -clflush_size; addr < end; addr += clflush_size) clflushopt(); mb()
loop (where the initial addr and end were not cacheline aligned).
Changing the loop from addr < end to addr <= end, or replacing the clflushopt() with clflush() both fixed the testcase. Hinting that GCC was miscompling the assembly within the loop and specifically the alternative within clflushopt() was confusing the loop optimizer.
Adding a barrier() into clflushopt() is enough for GCC to dtrt, but solving why GCC is not seeing the constraints from the alternative_io() would be smarter...
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92501 Testcase: gem_tiled_partial_pwrite_pread/read Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Ross Zwisler ross.zwisler@linux.intel.com Cc: H. Peter Anvin hpa@linux.intel.com Cc: Imre Deak imre.deak@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: dri-devel@lists.freedesktop.org --- arch/x86/include/asm/special_insns.h | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 2270e41b32fd..0c7aedbf8930 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,11 @@ static inline void clflushopt(volatile void *__p) ".byte 0x66; clflush %P0", X86_FEATURE_CLFLUSHOPT, "+m" (*(volatile char __force *)__p)); + /* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when + * meeting this alternative() and demonstrably miscompiles loops + * iterating over clflushopts. + */ + barrier(); }
static inline void clwb(volatile void *__p)
On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
During testing we observed that the last cacheline was not being flushed from a
mb() for (addr = addr & -clflush_size; addr < end; addr += clflush_size) clflushopt(); mb()
loop (where the initial addr and end were not cacheline aligned).
Changing the loop from addr < end to addr <= end, or replacing the clflushopt() with clflush() both fixed the testcase. Hinting that GCC was miscompling the assembly within the loop and specifically the alternative within clflushopt() was confusing the loop optimizer.
Adding a barrier() into clflushopt() is enough for GCC to dtrt, but solving why GCC is not seeing the constraints from the alternative_io() would be smarter...
Hmm, would something like adding the memory clobber to the alternative_io() definition work?
--- diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h index 7bfc85bbb8ff..d923e5dacdb1 100644 --- a/arch/x86/include/asm/alternative.h +++ b/arch/x86/include/asm/alternative.h @@ -207,7 +207,7 @@ static inline int alternatives_text_reserved(void *start, void *end) /* Like alternative_input, but with a single output argument */ #define alternative_io(oldinstr, newinstr, feature, output, input...) \ asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \ - : output : "i" (0), ## input) + : output : "i" (0), ## input : "memory")
/* Like alternative_io, but for replacing a direct call with another one. */ #define alternative_call(oldfunc, newfunc, feature, output, input...) \
On Mon, Oct 19, 2015 at 12:16:12PM +0200, Borislav Petkov wrote:
On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
Adding a barrier() into clflushopt() is enough for GCC to dtrt, but solving why GCC is not seeing the constraints from the alternative_io() would be smarter...
Hmm, would something like adding the memory clobber to the alternative_io() definition work?
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h index 7bfc85bbb8ff..d923e5dacdb1 100644 --- a/arch/x86/include/asm/alternative.h +++ b/arch/x86/include/asm/alternative.h @@ -207,7 +207,7 @@ static inline int alternatives_text_reserved(void *start, void *end) /* Like alternative_input, but with a single output argument */ #define alternative_io(oldinstr, newinstr, feature, output, input...) \ asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \
: output : "i" (0), ## input)
: output : "i" (0), ## input : "memory")
/* Like alternative_io, but for replacing a direct call with another one. */ #define alternative_call(oldfunc, newfunc, feature, output, input...) \
In order to add the clobbers, I had to adjust the macro slightly:
+#define alternative_output(oldinstr, newinstr, feature, output) \ + asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \ + : output : "i" (0) : "memory")
and that also works. The concern I have here is that the asm does use "+m" as its output already, and the clflush() asm doesn't require the manual clobber. -Chris
On Mon, Oct 19, 2015 at 12:05:29PM +0100, Chris Wilson wrote:
In order to add the clobbers, I had to adjust the macro slightly:
+#define alternative_output(oldinstr, newinstr, feature, output) \
asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \
: output : "i" (0) : "memory")
and that also works. The concern I have here is that the asm does use "+m" as its output already, and the clflush() asm doesn't require the manual clobber.
Yeah, you could make the above
#define alternative_output(oldinstr, newinstr, feature, output, clobber...)
and use it to define clflushopt(). hpa might have a better idea, though...
On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
During testing we observed that the last cacheline was not being flushed from a
mb() for (addr = addr & -clflush_size; addr < end; addr += clflush_size) clflushopt(); mb()
loop (where the initial addr and end were not cacheline aligned).
Changing the loop from addr < end to addr <= end, or replacing the clflushopt() with clflush() both fixed the testcase. Hinting that GCC was miscompling the assembly within the loop and specifically the alternative within clflushopt() was confusing the loop optimizer.
Adding a barrier() into clflushopt() is enough for GCC to dtrt, but solving why GCC is not seeing the constraints from the alternative_io() would be smarter...
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92501 Testcase: gem_tiled_partial_pwrite_pread/read Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Ross Zwisler ross.zwisler@linux.intel.com Cc: H. Peter Anvin hpa@linux.intel.com Cc: Imre Deak imre.deak@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: dri-devel@lists.freedesktop.org
arch/x86/include/asm/special_insns.h | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 2270e41b32fd..0c7aedbf8930 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,11 @@ static inline void clflushopt(volatile void *__p) ".byte 0x66; clflush %P0", X86_FEATURE_CLFLUSHOPT, "+m" (*(volatile char __force *)__p));
- /* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when
* meeting this alternative() and demonstrably miscompiles loops
* iterating over clflushopts.
*/
- barrier();
}
static inline void clwb(volatile void *__p)
2.6.1
I'm having trouble recreating this - can you help me get a simple reproducer?
I've been using the patch at the end of this mail as a test case.
If the issue you are seeing is indeed a compiler error, the issue should be visible in the resulting assembly. I've dumped the assembly using objdump for new loop in pmem_init() using both clflush() and clflushopt() in the loop, and the loops are exactly the same except the clflushopt case has an extra byte as part of the clflush instruction which is our NOOP prefix - we need this so that the instruction is the right length so we can swap clflushopt in.
clflush() in loop: 3a: 0f ae 3a clflush (%rdx)
clflushopt() in loop: 3a: 3e 0f ae 3a clflush %ds:(%rdx)
We also get an extra entry in the <.altinstr_replacement> section in the clflushopt case, as expected:
15: 66 data16 16: 0f ae 3a clflush (%rdx)
This all looks correct to me. I've tested with both GCC 4.9.2 and GCC 5.1.1, and they behave the same.
I also tried inserting a barrier() in the clflushopt() inline function, and it didn't change the resulting assembly for me.
What am I missing?
- Ross
-- 8< --
From 3e8c9cf1205c4505dcc29e09700c2990a3786664 Mon Sep 17 00:00:00 2001
From: Ross Zwisler ross.zwisler@linux.intel.com Date: Mon, 19 Oct 2015 11:59:02 -0600 Subject: [PATCH] TESTING: understand clflushopt() compilation error
--- drivers/nvdimm/pmem.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 0ba6a97..3f735a6 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -451,6 +451,20 @@ static int __init pmem_init(void) { int error;
+ unsigned long clflush_size = boot_cpu_data.x86_clflush_size; + const unsigned int size = PAGE_SIZE; + void *addr = kmalloc(size, GFP_KERNEL); + void *end = addr + size; + void *p; + + mb(); + for (p = (void *)((unsigned long)addr & -clflush_size); p < end; + p += clflush_size) + clflush(p); + mb(); + + kfree(addr); + pmem_major = register_blkdev(0, "pmem"); if (pmem_major < 0) return pmem_major;
On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
During testing we observed that the last cacheline was not being flushed from a
mb() for (addr = addr & -clflush_size; addr < end; addr += clflush_size) clflushopt(); mb()
loop (where the initial addr and end were not cacheline aligned).
Changing the loop from addr < end to addr <= end, or replacing the clflushopt() with clflush() both fixed the testcase. Hinting that GCC was miscompling the assembly within the loop and specifically the alternative within clflushopt() was confusing the loop optimizer.
Adding a barrier() into clflushopt() is enough for GCC to dtrt, but solving why GCC is not seeing the constraints from the alternative_io() would be smarter...
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92501 Testcase: gem_tiled_partial_pwrite_pread/read Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Ross Zwisler ross.zwisler@linux.intel.com Cc: H. Peter Anvin hpa@linux.intel.com Cc: Imre Deak imre.deak@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: dri-devel@lists.freedesktop.org
arch/x86/include/asm/special_insns.h | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 2270e41b32fd..0c7aedbf8930 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,11 @@ static inline void clflushopt(volatile void *__p) ".byte 0x66; clflush %P0", X86_FEATURE_CLFLUSHOPT, "+m" (*(volatile char __force *)__p));
- /* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when
* meeting this alternative() and demonstrably miscompiles loops
* iterating over clflushopts.
*/
- barrier();
}
Or an alternative:
+#define alternative_output(oldinstr, newinstr, feature, output) \ + asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \ + : output : "i" (0) : "memory")
I would really appreciate some knowledgeable folks taking a look at the asm for clflushopt() as it still affects today's kernel and gcc.
Fwiw, I have confirmed that arch/x86/mm/pageattr.c clflush_cache_range() is similarly affected. -Chris
On Thu, Jan 7, 2016 at 2:16 AM, Chris Wilson chris@chris-wilson.co.uk wrote:
On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote:
During testing we observed that the last cacheline was not being flushed from a
mb() for (addr = addr & -clflush_size; addr < end; addr += clflush_size) clflushopt(); mb()
loop (where the initial addr and end were not cacheline aligned).
Changing the loop from addr < end to addr <= end, or replacing the clflushopt() with clflush() both fixed the testcase. Hinting that GCC was miscompling the assembly within the loop and specifically the alternative within clflushopt() was confusing the loop optimizer.
Adding a barrier() into clflushopt() is enough for GCC to dtrt, but solving why GCC is not seeing the constraints from the alternative_io() would be smarter...
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92501 Testcase: gem_tiled_partial_pwrite_pread/read Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Ross Zwisler ross.zwisler@linux.intel.com Cc: H. Peter Anvin hpa@linux.intel.com Cc: Imre Deak imre.deak@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: dri-devel@lists.freedesktop.org
arch/x86/include/asm/special_insns.h | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 2270e41b32fd..0c7aedbf8930 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,11 @@ static inline void clflushopt(volatile void *__p) ".byte 0x66; clflush %P0", X86_FEATURE_CLFLUSHOPT, "+m" (*(volatile char __force *)__p));
/* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when
* meeting this alternative() and demonstrably miscompiles loops
* iterating over clflushopts.
*/
barrier();
}
Or an alternative:
+#define alternative_output(oldinstr, newinstr, feature, output) \
asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \
: output : "i" (0) : "memory")
I would really appreciate some knowledgeable folks taking a look at the asm for clflushopt() as it still affects today's kernel and gcc.
Fwiw, I have confirmed that arch/x86/mm/pageattr.c clflush_cache_range() is similarly affected.
Unless I'm mis-reading the asm, clflush_cache_range() is compiled correctly for me. (I don't know what the %P is for in the asm, but that shouldn't matter.) The ALTERNATIVE shouldn't even be visible to the optimizer.
Can you attach a bad .s file and let us know what gcc version this is? (You can usually do 'make foo/bar/baz.s' to get a .s file.) I'd also be curious whether changing clflushopt to clwb works around the issue.
--Andy
On Thu, Jan 07, 2016 at 09:55:51AM -0800, Andy Lutomirski wrote:
On Thu, Jan 7, 2016 at 2:16 AM, Chris Wilson chris@chris-wilson.co.uk wrote:
/* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when
* meeting this alternative() and demonstrably miscompiles loops
* iterating over clflushopts.
*/
barrier();
}
Or an alternative:
+#define alternative_output(oldinstr, newinstr, feature, output) \
asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \
: output : "i" (0) : "memory")
I would really appreciate some knowledgeable folks taking a look at the asm for clflushopt() as it still affects today's kernel and gcc.
Fwiw, I have confirmed that arch/x86/mm/pageattr.c clflush_cache_range() is similarly affected.
Unless I'm mis-reading the asm, clflush_cache_range() is compiled correctly for me. (I don't know what the %P is for in the asm, but that shouldn't matter.) The ALTERNATIVE shouldn't even be visible to the optimizer.
Can you attach a bad .s file and let us know what gcc version this is? (You can usually do 'make foo/bar/baz.s' to get a .s file.) I'd also be curious whether changing clflushopt to clwb works around the issue.
Now I feel silly. Looking at the .s, there is no difference with the addition of the barrier to clflush_cache_range(). And sure enough letting the test run for longer, we see a failure. I fell for a placebo.
The failing assertion is always on the last cacheline and is always one value behind. Oh well, back to wondering where we miss the flush. -Chris
On 01/07/16 11:44, Chris Wilson wrote:
Now I feel silly. Looking at the .s, there is no difference with the addition of the barrier to clflush_cache_range(). And sure enough letting the test run for longer, we see a failure. I fell for a placebo.
The failing assertion is always on the last cacheline and is always one value behind. Oh well, back to wondering where we miss the flush. -Chris
Could you include the assembly here?
-hpa
On Thu, Jan 07, 2016 at 01:05:35PM -0800, H. Peter Anvin wrote:
On 01/07/16 11:44, Chris Wilson wrote:
Now I feel silly. Looking at the .s, there is no difference with the addition of the barrier to clflush_cache_range(). And sure enough letting the test run for longer, we see a failure. I fell for a placebo.
The failing assertion is always on the last cacheline and is always one value behind. Oh well, back to wondering where we miss the flush. -Chris
Could you include the assembly here?
Sure, here you go:
.LHOTB18: .p2align 4,,15 .globl clflush_cache_range .type clflush_cache_range, @function clflush_cache_range: .LFB2505: .loc 1 131 0 .cfi_startproc .LVL194: 1: call __fentry__ .loc 1 132 0 movzwl boot_cpu_data+198(%rip), %eax .loc 1 131 0 pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 .loc 1 133 0 movl %esi, %esi .LVL195: addq %rdi, %rsi .LVL196: .loc 1 131 0 movq %rsp, %rbp .cfi_def_cfa_register 6 .loc 1 132 0 subl $1, %eax cltq .LVL197: .loc 1 136 0 #APP # 136 "arch/x86/mm/pageattr.c" 1 mfence # 0 "" 2 .loc 1 138 0 #NO_APP notq %rax .LVL198: andq %rax, %rdi .LVL199: cmpq %rdi, %rsi jbe .L216 .L217: .LBB1741: .LBB1742: .loc 8 198 0 #APP # 198 "./arch/x86/include/asm/special_insns.h" 1 661: .byte 0x3e; clflush (%rdi) 662: .skip -(((6651f-6641f)-(662b-661b)) > 0) * ((6651f-6641f)-(662b-661b)),0x90 663: .pushsection .altinstructions,"a" .long 661b - . .long 6641f - . .word ( 9*32+23) .byte 663b-661b .byte 6651f-6641f .byte 663b-662b .popsection .pushsection .altinstr_replacement, "ax" 6641: .byte 0x66; clflush (%rdi) 6651: .popsection # 0 "" 2 #NO_APP .LBE1742: .LBE1741: .loc 1 141 0 .loc 1 139 0 movzwl boot_cpu_data+198(%rip), %eax addq %rax, %rdi .loc 1 138 0 cmpq %rdi, %rsi ja .L217 .L216: .loc 1 144 0 #APP # 144 "arch/x86/mm/pageattr.c" 1 mfence # 0 "" 2 .loc 1 145 0 #NO_APP popq %rbp .cfi_restore 6 .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE2505: .size clflush_cache_range, .-clflush_cache_range .section .text.unlikely
Whilst you are looking at this asm, note that we reload boot_cpu_data.x86_cflush_size everytime around the loop. That's a small but noticeable extra cost (especially when we are only flushing a single cacheline).
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index a3137a4..2cd2b4b 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -129,14 +129,13 @@ within(unsigned long addr, unsigned long start, unsigned long end) */ void clflush_cache_range(void *vaddr, unsigned int size) { - unsigned long clflush_mask = boot_cpu_data.x86_clflush_size - 1; + unsigned long clflush_size = boot_cpu_data.x86_clflush_size; void *vend = vaddr + size; - void *p; + void *p = (void *)((unsigned long)vaddr & ~(clflush_size - 1));
mb();
- for (p = (void *)((unsigned long)vaddr & ~clflush_mask); - p < vend; p += boot_cpu_data.x86_clflush_size) + for (; p < vend; p += clflush_size) clflushopt(p);
mb();
-Chris
On 01/07/16 13:54, Chris Wilson wrote:
Whilst you are looking at this asm, note that we reload boot_cpu_data.x86_cflush_size everytime around the loop. That's a small but noticeable extra cost (especially when we are only flushing a single cacheline).
I did notice that; I don't know if this is a compiler failure to do an obvious hoisting (which should then be merged with the other load) or a consequence of the volatile. It might be useful to report this to the gcc bugzilla to let them look at it.
Either way, the diff looks good and you can add my:
Acked-by: H. Peter Anvin hpa@linux.intel.com
However, I see absolutely nothing wrong with the assembly other than minor optimization possibilities: since gcc generates an early-out test as well as a late-out test anyway, we could add an explicit:
if (p < vend) return;
before the first mb() at no additional cost (assuming gcc is smart enough to skip the second test; otherwise that can easily be done manually by replacing the for loop with a do { } while loop.
I would be very interested in knowing if replacing the final clflushopt with a clflush would resolve your problems (in which case the last mb() shouldn't be necessary either.)
Something like:
void clflush_cache_range(void *vaddr, unsigned int size) { unsigned long clflush_size = boot_cpu_data.x86_clflush_size; char *vend = (char *)vaddr + size; char *p = (char *)((unsigned long)vaddr & ~(clflush_size - 1);
if (p >= vend) return;
mb();
for (; p < vend - clflush_size; p += clflush_size) clflushopt(p);
clflush(p); /* Serializing and thus a barrier */ }
On 01/07/16 14:29, H. Peter Anvin wrote:
I would be very interested in knowing if replacing the final clflushopt with a clflush would resolve your problems (in which case the last mb() shouldn't be necessary either.)
Nevermind. CLFLUSH is not ordered with regards to CLFLUSHOPT to the same cache line.
Could you add a sync_cpu(); call to the end (can replace the final mb()) and see if that helps your case?
-hpa
On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote:
On 01/07/16 14:29, H. Peter Anvin wrote:
I would be very interested in knowing if replacing the final clflushopt with a clflush would resolve your problems (in which case the last mb() shouldn't be necessary either.)
Nevermind. CLFLUSH is not ordered with regards to CLFLUSHOPT to the same cache line.
Could you add a sync_cpu(); call to the end (can replace the final mb()) and see if that helps your case?
s/sync_cpu()/sync_core()/
No. I still see failures on Baytrail and Braswell (Pineview is not affected) with the final mb() replaced with sync_core(). I can reproduce failures on Pineview by tweaking the clflush_cache_range() parameters, so I am fairly confident that it is validating the current code.
iirc sync_core() is cpuid, a heavy serialising instruction, an alternative to mfence. Is there anything that else I can infer about the nature of my bug from this result? -Chris
On Sat, Jan 9, 2016 at 12:01 AM, Chris Wilson chris@chris-wilson.co.uk wrote:
On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote:
On 01/07/16 14:29, H. Peter Anvin wrote:
I would be very interested in knowing if replacing the final clflushopt with a clflush would resolve your problems (in which case the last mb() shouldn't be necessary either.)
Nevermind. CLFLUSH is not ordered with regards to CLFLUSHOPT to the same cache line.
Could you add a sync_cpu(); call to the end (can replace the final mb()) and see if that helps your case?
s/sync_cpu()/sync_core()/
No. I still see failures on Baytrail and Braswell (Pineview is not affected) with the final mb() replaced with sync_core(). I can reproduce failures on Pineview by tweaking the clflush_cache_range() parameters, so I am fairly confident that it is validating the current code.
iirc sync_core() is cpuid, a heavy serialising instruction, an alternative to mfence. Is there anything that else I can infer about the nature of my bug from this result?
No clue, but I don't know much about the underlying architecture.
Can you try clflush_cache_ranging one cacheline less and then manually doing clflushopt; mb on the last cache line, just to make sure that the helper is really doing the right thing? You could also try clflush instead of clflushopt to see if that makes a difference.
--Andy
On Sat, Jan 09, 2016 at 02:36:03PM -0800, Andy Lutomirski wrote:
On Sat, Jan 9, 2016 at 12:01 AM, Chris Wilson chris@chris-wilson.co.uk wrote:
On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote:
On 01/07/16 14:29, H. Peter Anvin wrote:
I would be very interested in knowing if replacing the final clflushopt with a clflush would resolve your problems (in which case the last mb() shouldn't be necessary either.)
Nevermind. CLFLUSH is not ordered with regards to CLFLUSHOPT to the same cache line.
Could you add a sync_cpu(); call to the end (can replace the final mb()) and see if that helps your case?
s/sync_cpu()/sync_core()/
No. I still see failures on Baytrail and Braswell (Pineview is not affected) with the final mb() replaced with sync_core(). I can reproduce failures on Pineview by tweaking the clflush_cache_range() parameters, so I am fairly confident that it is validating the current code.
iirc sync_core() is cpuid, a heavy serialising instruction, an alternative to mfence. Is there anything that else I can infer about the nature of my bug from this result?
No clue, but I don't know much about the underlying architecture.
Can you try clflush_cache_ranging one cacheline less and then manually doing clflushopt; mb on the last cache line, just to make sure that the helper is really doing the right thing? You could also try clflush instead of clflushopt to see if that makes a difference.
I had looked at increasing the range over which clflush_cache_range() runs (using roundup/rounddown by cache lines), but it took something like +/- 256 bytes to pass all the tests. And also did s/clflushopt/clflush/ to confirm that made no differnce.
Bizarrely,
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 6000ad7..cf074400 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int size) for (; p < vend; p += clflush_size) clflushopt(p);
+ clflushopt(vend-1); mb(); } EXPORT_SYMBOL_GPL(clflush_cache_range);
works like a charm. -Chris
On Mon, Jan 11, 2016 at 3:28 AM, Chris Wilson chris@chris-wilson.co.uk wrote:
Bizarrely,
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 6000ad7..cf074400 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int size) for (; p < vend; p += clflush_size) clflushopt(p);
clflushopt(vend-1); mb();
} EXPORT_SYMBOL_GPL(clflush_cache_range);
works like a charm.
Have you checked all your callers? If the above makes a difference, it really sounds like the caller has passed in a size of zero, resulting in no cache flush, because the caller had incorrect ranges. The additional clflushopt now flushes the previous cacheline that wasn't flushed correctly before.
That "size was zero" thing would explain why changing the loop to "p <= vend" also fixes things for you.
IOW, just how sure are you that all the ranges are correct?
Linus
On Mon, Jan 11, 2016 at 12:11:05PM -0800, Linus Torvalds wrote:
On Mon, Jan 11, 2016 at 3:28 AM, Chris Wilson chris@chris-wilson.co.uk wrote:
Bizarrely,
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 6000ad7..cf074400 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int size) for (; p < vend; p += clflush_size) clflushopt(p);
clflushopt(vend-1); mb();
} EXPORT_SYMBOL_GPL(clflush_cache_range);
works like a charm.
Have you checked all your callers? If the above makes a difference, it really sounds like the caller has passed in a size of zero, resulting in no cache flush, because the caller had incorrect ranges. The additional clflushopt now flushes the previous cacheline that wasn't flushed correctly before.
That "size was zero" thing would explain why changing the loop to "p <= vend" also fixes things for you.
This is on top of HPA's suggestion to do the size==0 check up front.
IOW, just how sure are you that all the ranges are correct?
All our callers are of the pattern:
memcpy(dst, vaddr, size) clflush_cache_range(dst, size)
or
clflush_cache_range(vaddr, size) memcpy(dst, vaddr, size)
I am resonably confident that the ranges are sane. I've tried to verify that we do the clflushes by forcing them. However, if I clflush the whole object instead of the cachelines around the copies, the tests pass. (Flushing up to a couple of megabytes instead of a few hundred bytes, it is hard to draw any conclusions about what the bug might be.)
I can narrow down the principal buggy path by doing the clflush(vend-1) in the callers at least.
The problem is that the tests that fail are those looking for bugs in the coherency code, which may just as well be caused by the GPU writing into those ranges at the same time as the CPU trying to read them. I've looked into timing and tried adding udelay()s or uncached mmio along the suspect paths, but that didn't change the presentation - having a udelay fix the issue is usually a good indicator of a GPU write that hasn't landed before the CPU read.
The bug only affects a couple of recent non-coherent platforms, earlier Atoms and older Core seem unaffacted. That may also mean that it is the GPU flush instruction that changed between platforms and isn't working (as we intended at least).
Thanks for everyone's help and suggestions, -Chris
On Mon, Jan 11, 2016 at 09:05:06PM +0000, Chris Wilson wrote:
I can narrow down the principal buggy path by doing the clflush(vend-1) in the callers at least.
That leads to the suspect path being a read back of a cache line from main memory that was just written to by the GPU. Writes to memory before using them on the GPU do not seem to be affected (or at least we have sufficient flushing in sending the commands to the GPU that we don't notice anything wrong).
And back to the oddity.
Instead of doing:
clflush_cache_range(vaddr + offset, size); clflush(vaddr+offset+size-1); mb(); memcpy(user, vaddr+offset, size);
what also worked was:
clflush_cache_range(vaddr + offset, size); clflush(vaddr); mb(); memcpy(user, vaddr+offset, size);
(size is definitely non-zero, offset is offset_in_page(), vaddr is from kmap_atomic()).
i.e.
void clflush_cache_range(void *vaddr, unsigned int size) { const unsigned long clflush_size = boot_cpu_data.x86_clflush_size; void *p = (void *)((unsigned long)vaddr & ~(clflush_size - 1)); void *vend = vaddr + size;
if (p >= vend) return;
mb();
for (; p < vend; p += clflush_size) clflushopt(p);
clflushopt(vaddr);
mb(); }
I have also confirmed that this doesn't just happen for single cachelines (i.e. where the earlier clflush(vend-1) and this clflush(vaddr) would be equivalent).
At the moment I am more inclined this is serialising the clflush() (since clflush to the same cacheline is regarded as ordered with respect to the earlier clflush iirc) as opposed to the writes not landing timely from the GPU.
Am I completely going mad? -Chris
On Tue, Jan 12, 2016 at 8:37 AM, Chris Wilson chris@chris-wilson.co.uk wrote:
On Mon, Jan 11, 2016 at 09:05:06PM +0000, Chris Wilson wrote:
I can narrow down the principal buggy path by doing the clflush(vend-1) in the callers at least.
That leads to the suspect path being a read back of a cache line from main memory that was just written to by the GPU.
How do you know it was written by the GPU?
Maybe it's a memory ordering issue on the GPU. Say it writes something to memory, then sets the "I'm done" flag (or whatever you check), but because of ordering on the GPU the "I'm done" flag is visible before.
So the reason you see the old content may just be that the GPU writes are still buffered on the GPU. And you adding a clflushopt on the same address just changes the timing enough that you don't see the memory ordering any more (or it's just much harder to see, it might still be there).
Maybe the reason you only see the problem with the last cacheline is simply that the "last" cacheline is also the last that was written by the GPU, and it's still in the GPU write buffers.
Also, did you ever print out the value of clflush_size? Maybe we just got it wrong and it's bogus data.
Linus
On Tue, Jan 12, 2016 at 09:05:19AM -0800, Linus Torvalds wrote:
On Tue, Jan 12, 2016 at 8:37 AM, Chris Wilson chris@chris-wilson.co.uk wrote:
On Mon, Jan 11, 2016 at 09:05:06PM +0000, Chris Wilson wrote:
I can narrow down the principal buggy path by doing the clflush(vend-1) in the callers at least.
That leads to the suspect path being a read back of a cache line from main memory that was just written to by the GPU.
How do you know it was written by the GPU?
Test construction: write some data, copy it on the GPU, read it back. Repeat for various data, sequences of copy (differing GPU instructions, intermediates etc), and accessors.
Maybe it's a memory ordering issue on the GPU. Say it writes something to memory, then sets the "I'm done" flag (or whatever you check), but because of ordering on the GPU the "I'm done" flag is visible before.
That is a continual worry. To try and assuage that fear, I sent 8x flush gpu writes between the end of the copy and setting the "I'm done" flag. The definition of the GPU flush is that it both flushes all previous writes before it completes and only after it completes does it do the post-sync write (before moving onto the next command). The spec is always a bit hazy on what order the memory writes will be visible on the CPU though.
Sending the 8x GPU flushes before marking "I'm done" did not fix the corruption.
So the reason you see the old content may just be that the GPU writes are still buffered on the GPU. And you adding a clflushopt on the same address just changes the timing enough that you don't see the memory ordering any more (or it's just much harder to see, it might still be there).
Indeed. So I replaced the post-clflush_cache_range() clflush() with a udelay(10) instead, and the corruption vanished. Putting the udelay(10) before the clflush_cache_range() does not fix the corruption.
Maybe the reason you only see the problem with the last cacheline is simply that the "last" cacheline is also the last that was written by the GPU, and it's still in the GPU write buffers.
Exactly the fear.
Also, did you ever print out the value of clflush_size? Maybe we just got it wrong and it's bogus data.
It's 64 bytes as expected. And fudging it to any other value quickly explodes :)
Since:
/* lots of GPU flushes + GPU/CPU sync point */ udelay(10); clflush_cache_range(vaddr, size); memcpy(user, vaddr, size);
fails, but
/* lots of GPU flushes + GPU/CPU sync point */ clflush_cache_range(vaddr, size); udelay(10); memcpy(user, vaddr, size);
passes, I'm inclined to point the finger at the mb() following the clflush_cache_range(). -Chris
On Tue, Jan 12, 2016 at 1:13 PM, Chris Wilson chris@chris-wilson.co.uk wrote:
That is a continual worry. To try and assuage that fear, I sent 8x flush gpu writes between the end of the copy and setting the "I'm done" flag. The definition of the GPU flush is that it both flushes all previous writes before it completes and only after it completes does it do the post-sync write (before moving onto the next command). The spec is always a bit hazy on what order the memory writes will be visible on the CPU though.
Sending the 8x GPU flushes before marking "I'm done" did not fix the corruption.
Ok. So assuming the GPU flushes are supposed to work, it should be all good.
So the reason you see the old content may just be that the GPU writes are still buffered on the GPU. And you adding a clflushopt on the same address just changes the timing enough that you don't see the memory ordering any more (or it's just much harder to see, it might still be there).
Indeed. So I replaced the post-clflush_cache_range() clflush() with a udelay(10) instead, and the corruption vanished. Putting the udelay(10) before the clflush_cache_range() does not fix the corruption.
Odd.
passes, I'm inclined to point the finger at the mb() following the clflush_cache_range().
We have an entirely unrelated discussion about the value of "mfence" as a memory barrier.
Mind trying to just make the memory barrier (in arch/x86/include/asm/barrier.h) be a locked op instead?
The docs say "Executions of the CLFLUSHOPT instruction are ordered with respect to fence instructions and to locked read-modify-write instructions; ..", so the mfence should be plenty good enough. But nobody sane uses mfence for memory ordering (that's the other discussion we're having), since a locked rmw instruction is faster.
So maybe it's a CPU bug. I'd still consider a GPU memory ordering bug *way* more likely (the CPU core tensd to be better validated in my experience), but since you're trying odd things anyway, try changing the "mfence" to "lock; addl $0,0(%%rsp)" instead.
I doubt it makes any difference, but ..
Linus
On Tue, Jan 12, 2016 at 02:07:35PM -0800, Linus Torvalds wrote:
On Tue, Jan 12, 2016 at 1:13 PM, Chris Wilson chris@chris-wilson.co.uk wrote:
Indeed. So I replaced the post-clflush_cache_range() clflush() with a udelay(10) instead, and the corruption vanished. Putting the udelay(10) before the clflush_cache_range() does not fix the corruption.
Odd.
passes, I'm inclined to point the finger at the mb() following the clflush_cache_range().
We have an entirely unrelated discussion about the value of "mfence" as a memory barrier.
Mind trying to just make the memory barrier (in arch/x86/include/asm/barrier.h) be a locked op instead?
Replacing the following mb() in clflush_cache_range() with "lock; addl $0,0(%%rsp)" gave no improvement. Neither did replacing all mb(). Undoubtably the memory stream as seen by the CPU is serialised. The concern then is back to the visibility of the stream from the external device. Adding an uncached mmio read after the clflush_cache_range() also works, but I'm not clear if this achieves anything other than inserting a delay or if it is flushing some other write buffer in the chipset. It is odd that is required after the clflush_cache_range() and ineffective before. It is also odd that such a flush would be needed multiple times after the GPU write.
The double clflush() remains a mystery. -Chris
On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson chris@chris-wilson.co.uk wrote:
The double clflush() remains a mystery.
Actually, I think it's explainable.
It's wrong to do the clflush *after* the GPU has done the write, which seems to be what you are doing.
Why?
If the GPU really isn't cache coherent, what can happen is:
- the CPU has the line cached
- the GPU writes the data
- you do the clflushopt to invalidate the cacheline
- you expect to see the GPU data.
Right?
Wrong. The above is complete crap.
Why?
Very simple reason: the CPU may have had the cacheline dirty at some level in its caches, so when you did the clflushopt, it didn't just invalidate the CPU cacheline, it wrote it back to memory. And in the process over-wrote the data that the GPU had written.
Now you can say "but the CPU never wrote to the cacheline, so it's not dirty in the CPU caches". That may or may not be trie. The CPU may have written to it quite a long time ago.
So if you are doing a GPU write, and you want to see the data that the GPU wrote, you had better do the clflushopt long *before* the GPU ever writes to memory.
Your pattern of doing "flush and read" is simply fundamentally buggy. There are only two valid CPU flushing patterns:
- write and flush (to make the writes visible to the GPU)
- flush before starting GPU accesses, and then read
At no point can "flush and read" be right.
Now, I haven't actually seen your code, so I'm just going by your high-level description of where the CPU flush and CPU read were done, but it *sounds* like you did that invalid "flush and read" behavior.
Linus
On Tue, Jan 12, 2016 at 6:06 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson chris@chris-wilson.co.uk wrote:
The double clflush() remains a mystery.
Actually, I think it's explainable.
It's wrong to do the clflush *after* the GPU has done the write, which seems to be what you are doing.
Why?
If the GPU really isn't cache coherent, what can happen is:
the CPU has the line cached
the GPU writes the data
you do the clflushopt to invalidate the cacheline
you expect to see the GPU data.
Right?
Wrong. The above is complete crap.
Why?
Very simple reason: the CPU may have had the cacheline dirty at some level in its caches, so when you did the clflushopt, it didn't just invalidate the CPU cacheline, it wrote it back to memory. And in the process over-wrote the data that the GPU had written.
Now you can say "but the CPU never wrote to the cacheline, so it's not dirty in the CPU caches". That may or may not be trie. The CPU may have written to it quite a long time ago.
So if you are doing a GPU write, and you want to see the data that the GPU wrote, you had better do the clflushopt long *before* the GPU ever writes to memory.
Your pattern of doing "flush and read" is simply fundamentally buggy. There are only two valid CPU flushing patterns:
write and flush (to make the writes visible to the GPU)
flush before starting GPU accesses, and then read
At no point can "flush and read" be right.
Now, I haven't actually seen your code, so I'm just going by your high-level description of where the CPU flush and CPU read were done, but it *sounds* like you did that invalid "flush and read" behavior.
Since barriers are on my mind: how strong a barrier is needed to prevent cache fills from being speculated across the barrier?
Concretely, if I do:
clflush A clflush B load A load B
the architecture guarantees that (unless store forwarding happens) the value I see for B is at least as new as the value I see for A *with respect to other access within the coherency domain*. But the GPU isn't in the coherency domain at all.
Is:
clflush A clflush B load A MFENCE load B
good enough? If it is, and if
clflush A clflush B load A LOCK whatever load B
is *not*, then this might account for the performance difference.
In any event, it seems to me that what i915 is trying to do isn't really intended to be supported for WB memory. i915 really wants to force a read from main memory and to simultaneously prevent the CPU from writing back to main memory. Ick. I'd assume that:
clflush A clflush B load A serializing instruction here load B
is good enough, as long as you make sure that the GPU does its writes after the clflushes make it all the way out to main memory (which might require a second serializing instruction in the case of clflushopt), but this still relies on the hardware prefetcher not prefetching B too early, which it's permitted to do even in the absence of any explicit access at all.
Presumably this is good enough on any implementation:
clflush A clflush B load A clflush B load B
But that will be really, really slow. And you're still screwed if the hardware is permitted to arbitrarily change cache lines from S to M.
In other words, I'm not really convinced that x86 was ever intended to have well-defined behavior if something outside the coherency domain writes to a page of memory while that page is mapped WB. Of course, I'm also not sure how to reliably switch a page from WB to any other memory type short of remapping it and doing CLFLUSH after remapping.
SDM Volume 3 11.12.4 seems to agree with me.
Could the driver be changed to use WC or UC and to use MOVNTDQA on supported CPUs to get the performance back? It sounds like i915 is effectively doing PIO here, and reasonably modern CPUs have a nice set of fast PIO instructions.
--Andy
On Tue, Jan 12, 2016 at 6:42 PM, Andy Lutomirski luto@amacapital.net wrote:
Since barriers are on my mind: how strong a barrier is needed to prevent cache fills from being speculated across the barrier?
I don't think there are *any* architectural guarantees.
I suspect that a real serializing instruction should do it. But I don't think even that is guaranteed.
Non-coherent IO is crazy. I really thought Intel had learnt their lesson, and finally made all the GPU's coherent. I'm afraid to even ask why Chris is actually working on some sh*t that requires clflush.
In general, you should probably do something nasty like
- flush before starting IO that generates data (to make sure you have no dirty cachelines that will write back and mess up)
- start the IO, wait for it to complete
- flush after finishing IO that generates the data (to make sure you have no speculative clean cachelines with stale data)
- read the data now.
Of course, what people actually end up doing to avoid all this is to mark the memory noncacheable.
And finally, the *correct* thing is to not have crap hardware, and have IO be cache coherent. Things that don't do that are shit. Really.
Linus
On Tue, Jan 12, 2016 at 06:06:34PM -0800, Linus Torvalds wrote:
On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson chris@chris-wilson.co.uk wrote:
The double clflush() remains a mystery.
Actually, I think it's explainable.
It's wrong to do the clflush *after* the GPU has done the write, which seems to be what you are doing.
Why?
If the GPU really isn't cache coherent, what can happen is:
the CPU has the line cached
the GPU writes the data
you do the clflushopt to invalidate the cacheline
you expect to see the GPU data.
Right?
Wrong. The above is complete crap.
Why?
Very simple reason: the CPU may have had the cacheline dirty at some level in its caches, so when you did the clflushopt, it didn't just invalidate the CPU cacheline, it wrote it back to memory. And in the process over-wrote the data that the GPU had written.
Forgive me for being dense, but if we overwrite the GPU data in the backing struct page with the cacheline from the CPU, how do we see the results from the GPU afterwards?
Now you can say "but the CPU never wrote to the cacheline, so it's not dirty in the CPU caches". That may or may not be trie. The CPU may have written to it quite a long time ago.
What we do is we clflush the entire object after it is written to by the CPU (including newly allocated objects from shmemfs, or pages being returned to us by shmemfs) before any operation on that object by the GPU. We have to so that the GPU sees the correct page contents. (If I change the clflush of the written objects to a wbinvd, that's not sufficient for the tests to pass.)
We do not clflush the object after we read the backing pages on the CPU before the next GPU operation, even if it is a GPU write. This leaves us with clean but stale cachelines. Should. That's why we then clflush_cache_range() prior to the next read on the object, it is intended to be a pure cache line invalidation.
If we clflush the entire object between every CPU read back and the *next* GPU operation, it fails. If we clflush the object before every GPU write to it, it passes. And to refresh, we always clflush after a CPU write.
I am reasonably confident that any cachelines we dirty (or inherited) are flushed. What you are suggesting is that there are dirty cachelines regardless. I am also reasonably confident that even if we clflush the entire object after touching it before the GPU write, and clflush the individual cachelines again after the GPU write, we see the errors.
I haven't found the hole yet, or been convincingly able to explain the differences between gen. -Chris
On Wed, Jan 13, 2016 at 4:34 AM, Chris Wilson chris@chris-wilson.co.uk wrote:
Forgive me for being dense, but if we overwrite the GPU data in the backing struct page with the cacheline from the CPU, how do we see the results from the GPU afterwards?
Hmm. Good point.
Ok, all the symptoms just say "writes from GPU are delayed and out of order".
Do you have access to the GPU hardware people?
I thought that all the modern Intel GPU's are cache-coherent. If this is some castrated chip where coherence is removed (perhaps because it is not working? perhaps config setting?) maybe it needs some extra ghardware setting to make the GPU "flush" operation actually do something. In a cache-coherent model, a flush could/should be a noop, so maybe the hardware is set for that kind of "flush does nothing" behavior.
Or maybe the GPU is just a buggy pile of crap.
Linus
On January 11, 2016 3:28:01 AM PST, Chris Wilson chris@chris-wilson.co.uk wrote:
On Sat, Jan 09, 2016 at 02:36:03PM -0800, Andy Lutomirski wrote:
On Sat, Jan 9, 2016 at 12:01 AM, Chris Wilson
chris@chris-wilson.co.uk wrote:
On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote:
On 01/07/16 14:29, H. Peter Anvin wrote:
I would be very interested in knowing if replacing the final
clflushopt
with a clflush would resolve your problems (in which case the
last mb()
shouldn't be necessary either.)
Nevermind. CLFLUSH is not ordered with regards to CLFLUSHOPT to
the
same cache line.
Could you add a sync_cpu(); call to the end (can replace the final
mb())
and see if that helps your case?
s/sync_cpu()/sync_core()/
No. I still see failures on Baytrail and Braswell (Pineview is not affected) with the final mb() replaced with sync_core(). I can
reproduce
failures on Pineview by tweaking the clflush_cache_range()
parameters,
so I am fairly confident that it is validating the current code.
iirc sync_core() is cpuid, a heavy serialising instruction, an alternative to mfence. Is there anything that else I can infer
about
the nature of my bug from this result?
No clue, but I don't know much about the underlying architecture.
Can you try clflush_cache_ranging one cacheline less and then
manually
doing clflushopt; mb on the last cache line, just to make sure that the helper is really doing the right thing? You could also try clflush instead of clflushopt to see if that makes a difference.
I had looked at increasing the range over which clflush_cache_range() runs (using roundup/rounddown by cache lines), but it took something like +/- 256 bytes to pass all the tests. And also did s/clflushopt/clflush/ to confirm that made no differnce.
Bizarrely,
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 6000ad7..cf074400 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int size) for (; p < vend; p += clflush_size) clflushopt(p);
clflushopt(vend-1); mb();
} EXPORT_SYMBOL_GPL(clflush_cache_range);
works like a charm. -Chris
That clflushopt touches a cache line already touched and therefore serializes with it.
dri-devel@lists.freedesktop.org