This patch set adds support for two new persistent memory instructions, pcommit and clwb. These instructions were announced in the document "Intel Architecture Instruction Set Extensions Programming Reference" with reference number 319433-022.
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
These patches apply cleanly to v3.18-rc4.
Here are some things of note:
- As with the clflushopt patches before this, I'm assuming that the addressing mode generated by the original clflush instruction will match the new clflush instruction with the 0x66 prefix for clflushopt, and for the xsaveopt instruction with the 0x66 prefix for clwb. For all the test cases that I've come up with and for the new clwb code generated by this patch series, this has proven to be true on my test machine.
- According to the SDM, xsaveopt has a form where it has a REX.W prefix. I believe that this prefix will not be generated by gcc in x86_64 kernel code. Based on this, I don't believe I need to account for this extra prefix when dealing with the assembly language created for clwb. Please correct me if I'm wrong.
- The last three patches in this series update existing uses of clflushopt to use clwb instead. The assertion is that clwb is preferable to clflushopt in these cases because after a clwb the cache line will be clean and ready for eviction, but that there is a possibility that it might be referenced again in the future while it is still in the CPU cache, giving us a performance boost.
Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org
Ross Zwisler (6): x86: Add support for the pcommit instruction x86/alternative: Add alternative_io_2 x86: Add support for the clwb instruction x86: Use clwb in clflush_cache_range x86: Use clwb in drm_clflush_page x86: Use clwb in drm_clflush_virt_range
arch/x86/include/asm/alternative.h | 14 ++++++++++++++ arch/x86/include/asm/cpufeature.h | 2 ++ arch/x86/include/asm/special_insns.h | 16 ++++++++++++++++ arch/x86/mm/pageattr.c | 8 ++++---- drivers/gpu/drm/drm_cache.c | 12 ++++++------ 5 files changed, 42 insertions(+), 10 deletions(-)
Add support for the new pcommit instruction. This instruction was announced in the document "Intel Architecture Instruction Set Extensions Programming Reference" with reference number 319433-022.
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/special_insns.h | 6 ++++++ 2 files changed, 7 insertions(+)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 0bb1335..b3e6b89 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -225,6 +225,7 @@ #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */ #define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */ #define X86_FEATURE_SMAP ( 9*32+20) /* Supervisor Mode Access Prevention */ +#define X86_FEATURE_PCOMMIT ( 9*32+22) /* PCOMMIT instruction */ #define X86_FEATURE_CLFLUSHOPT ( 9*32+23) /* CLFLUSHOPT instruction */ #define X86_FEATURE_AVX512PF ( 9*32+26) /* AVX-512 Prefetch */ #define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and Reciprocal */ diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index e820c08..1709a2e 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,12 @@ static inline void clflushopt(volatile void *__p) "+m" (*(volatile char __force *)__p)); }
+static inline void pcommit(void) +{ + alternative(ASM_NOP4, ".byte 0x66, 0x0f, 0xae, 0xf8", + X86_FEATURE_PCOMMIT); +} + #define nop() asm volatile ("nop")
On 11/11/2014 10:43 AM, Ross Zwisler wrote:
Add support for the new pcommit instruction. This instruction was announced in the document "Intel Architecture Instruction Set Extensions Programming Reference" with reference number 319433-022.
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org
arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/special_insns.h | 6 ++++++ 2 files changed, 7 insertions(+)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 0bb1335..b3e6b89 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -225,6 +225,7 @@ #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */ #define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */ #define X86_FEATURE_SMAP ( 9*32+20) /* Supervisor Mode Access Prevention */ +#define X86_FEATURE_PCOMMIT ( 9*32+22) /* PCOMMIT instruction */ #define X86_FEATURE_CLFLUSHOPT ( 9*32+23) /* CLFLUSHOPT instruction */ #define X86_FEATURE_AVX512PF ( 9*32+26) /* AVX-512 Prefetch */ #define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and Reciprocal */ diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index e820c08..1709a2e 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,12 @@ static inline void clflushopt(volatile void *__p) "+m" (*(volatile char __force *)__p)); }
+static inline void pcommit(void) +{
- alternative(ASM_NOP4, ".byte 0x66, 0x0f, 0xae, 0xf8",
X86_FEATURE_PCOMMIT);
+}
Should this patch add the feature bit and cpuinfo entry to go with it?
--Andy
On Wed, 2014-11-12 at 19:25 -0800, Andy Lutomirski wrote:
On 11/11/2014 10:43 AM, Ross Zwisler wrote:
Add support for the new pcommit instruction. This instruction was announced in the document "Intel Architecture Instruction Set Extensions Programming Reference" with reference number 319433-022.
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org
arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/special_insns.h | 6 ++++++ 2 files changed, 7 insertions(+)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 0bb1335..b3e6b89 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -225,6 +225,7 @@ #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */ #define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */ #define X86_FEATURE_SMAP ( 9*32+20) /* Supervisor Mode Access Prevention */ +#define X86_FEATURE_PCOMMIT ( 9*32+22) /* PCOMMIT instruction */ #define X86_FEATURE_CLFLUSHOPT ( 9*32+23) /* CLFLUSHOPT instruction */ #define X86_FEATURE_AVX512PF ( 9*32+26) /* AVX-512 Prefetch */ #define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and Reciprocal */ diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index e820c08..1709a2e 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,12 @@ static inline void clflushopt(volatile void *__p) "+m" (*(volatile char __force *)__p)); }
+static inline void pcommit(void) +{
- alternative(ASM_NOP4, ".byte 0x66, 0x0f, 0xae, 0xf8",
X86_FEATURE_PCOMMIT);
+}
Should this patch add the feature bit and cpuinfo entry to go with it?
--Andy
I think this patch does everything we need? The text for cpuinfo is auto-generated in arch/x86/kernel/cpu/capflags.c from the flags defined in arch/x86/include/asm/cpufeature.h, I think. Here's what I get in cpuinfo on my system with a faked-out CPUID saying that clwb and pcommit are present:
$ grep 'flags' /proc/cpuinfo flags : fpu <snip> erms pcommit clflushopt clwb xsaveopt
The X86_FEATURE_CLWB and X86_FEATURE_PCOMMIT flags are being set up according to what's in CPUID, and the proper alternatives are being triggered. I stuck some debug code in the alternatives code to see what was being patched in the presence and absence of each of the flags.
Is there something else I'm missing?
Thanks, - Ross
On Fri, Nov 14, 2014 at 1:07 PM, Ross Zwisler ross.zwisler@linux.intel.com wrote:
On Wed, 2014-11-12 at 19:25 -0800, Andy Lutomirski wrote:
On 11/11/2014 10:43 AM, Ross Zwisler wrote:
Add support for the new pcommit instruction. This instruction was announced in the document "Intel Architecture Instruction Set Extensions Programming Reference" with reference number 319433-022.
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org
arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/special_insns.h | 6 ++++++ 2 files changed, 7 insertions(+)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index 0bb1335..b3e6b89 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -225,6 +225,7 @@ #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */ #define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */ #define X86_FEATURE_SMAP ( 9*32+20) /* Supervisor Mode Access Prevention */ +#define X86_FEATURE_PCOMMIT ( 9*32+22) /* PCOMMIT instruction */ #define X86_FEATURE_CLFLUSHOPT ( 9*32+23) /* CLFLUSHOPT instruction */ #define X86_FEATURE_AVX512PF ( 9*32+26) /* AVX-512 Prefetch */ #define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and Reciprocal */ diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index e820c08..1709a2e 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,12 @@ static inline void clflushopt(volatile void *__p) "+m" (*(volatile char __force *)__p)); }
+static inline void pcommit(void) +{
- alternative(ASM_NOP4, ".byte 0x66, 0x0f, 0xae, 0xf8",
X86_FEATURE_PCOMMIT);
+}
Should this patch add the feature bit and cpuinfo entry to go with it?
--Andy
I think this patch does everything we need? The text for cpuinfo is auto-generated in arch/x86/kernel/cpu/capflags.c from the flags defined in arch/x86/include/asm/cpufeature.h, I think. Here's what I get in cpuinfo on my system with a faked-out CPUID saying that clwb and pcommit are present:
$ grep 'flags' /proc/cpuinfo flags : fpu <snip> erms pcommit clflushopt clwb xsaveopt
The X86_FEATURE_CLWB and X86_FEATURE_PCOMMIT flags are being set up according to what's in CPUID, and the proper alternatives are being triggered. I stuck some debug code in the alternatives code to see what was being patched in the presence and absence of each of the flags.
Is there something else I'm missing?
No. I just missed the magical auto-generation part.
--Andy
Thanks,
- Ross
Add alternative_io_2 in the spirit of alternative_input_2 and alternative_io. This will allow us to have instructions with an output parameter that vary based on two CPU features.
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org --- arch/x86/include/asm/alternative.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h index 473bdbe..7d9ead9 100644 --- a/arch/x86/include/asm/alternative.h +++ b/arch/x86/include/asm/alternative.h @@ -180,6 +180,20 @@ static inline int alternatives_text_reserved(void *start, void *end) asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \ : output : "i" (0), ## input)
+/* + * This is similar to alternative_io. But it has two features and + * respective instructions. + * + * If CPU has feature2, newinstr2 is used. + * Otherwise, if CPU has feature1, newinstr1 is used. + * Otherwise, oldinstr is used. + */ +#define alternative_io_2(oldinstr, newinstr1, feature1, newinstr2, \ + feature2, output, input...) \ + asm volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, \ + newinstr2, feature2) \ + : output : "i" (0), ## input) + /* Like alternative_io, but for replacing a direct call with another one. */ #define alternative_call(oldfunc, newfunc, feature, output, input...) \ asm volatile (ALTERNATIVE("call %P[old]", "call %P[new]", feature) \
Add support for the new clwb instruction. This instruction was announced in the document "Intel Architecture Instruction Set Extensions Programming Reference" with reference number 319433-022.
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Here are some things of note:
- As with the clflushopt patches before this, I'm assuming that the addressing mode generated by the original clflush instruction will match the new clflush instruction with the 0x66 prefix for clflushopt, and for the xsaveopt instruction with the 0x66 prefix for clwb. For all the test cases that I've come up with and for the new clwb code generated by this patch series, this has proven to be true on my test machine.
- According to the SDM, xsaveopt has a form where it has a REX.W prefix. I believe that this prefix will not be generated by gcc in x86_64 kernel code. Based on this, I don't believe I need to account for this extra prefix when dealing with the assembly language created for clwb. Please correct me if I'm wrong.
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org --- arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/special_insns.h | 10 ++++++++++ 2 files changed, 11 insertions(+)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index b3e6b89..fbbed34 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -227,6 +227,7 @@ #define X86_FEATURE_SMAP ( 9*32+20) /* Supervisor Mode Access Prevention */ #define X86_FEATURE_PCOMMIT ( 9*32+22) /* PCOMMIT instruction */ #define X86_FEATURE_CLFLUSHOPT ( 9*32+23) /* CLFLUSHOPT instruction */ +#define X86_FEATURE_CLWB ( 9*32+24) /* CLWB instruction */ #define X86_FEATURE_AVX512PF ( 9*32+26) /* AVX-512 Prefetch */ #define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and Reciprocal */ #define X86_FEATURE_AVX512CD ( 9*32+28) /* AVX-512 Conflict Detection */ diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 1709a2e..a328460 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,16 @@ static inline void clflushopt(volatile void *__p) "+m" (*(volatile char __force *)__p)); }
+static inline void clwb(volatile void *__p) +{ + alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %P0", + ".byte 0x66; clflush %P0", + X86_FEATURE_CLFLUSHOPT, + ".byte 0x66; xsaveopt %P0", + X86_FEATURE_CLWB, + "+m" (*(volatile char __force *)__p)); +} + static inline void pcommit(void) { alternative(ASM_NOP4, ".byte 0x66, 0x0f, 0xae, 0xf8",
On Tue, Nov 11, 2014 at 11:43:13AM -0700, Ross Zwisler wrote:
Add support for the new clwb instruction. This instruction was announced in the document "Intel Architecture Instruction Set Extensions Programming Reference" with reference number 319433-022.
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Here are some things of note:
As with the clflushopt patches before this, I'm assuming that the addressing mode generated by the original clflush instruction will match the new clflush instruction with the 0x66 prefix for clflushopt, and for the xsaveopt instruction with the 0x66 prefix for clwb. For all the test cases that I've come up with and for the new clwb code generated by this patch series, this has proven to be true on my test machine.
According to the SDM, xsaveopt has a form where it has a REX.W prefix. I believe that this prefix will not be generated by gcc in x86_64 kernel code. Based on this, I don't believe I need to account for this extra prefix when dealing with the assembly language created for clwb. Please correct me if I'm wrong.
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org
arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/special_insns.h | 10 ++++++++++ 2 files changed, 11 insertions(+)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index b3e6b89..fbbed34 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -227,6 +227,7 @@ #define X86_FEATURE_SMAP ( 9*32+20) /* Supervisor Mode Access Prevention */ #define X86_FEATURE_PCOMMIT ( 9*32+22) /* PCOMMIT instruction */ #define X86_FEATURE_CLFLUSHOPT ( 9*32+23) /* CLFLUSHOPT instruction */ +#define X86_FEATURE_CLWB ( 9*32+24) /* CLWB instruction */ #define X86_FEATURE_AVX512PF ( 9*32+26) /* AVX-512 Prefetch */ #define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and Reciprocal */ #define X86_FEATURE_AVX512CD ( 9*32+28) /* AVX-512 Conflict Detection */ diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 1709a2e..a328460 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,16 @@ static inline void clflushopt(volatile void *__p) "+m" (*(volatile char __force *)__p)); }
+static inline void clwb(volatile void *__p) +{
- alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %P0",
Any particular reason for using 0x3e as a prefix to have the insns be the same size or is it simply because CLFLUSH can stomach it?
:-)
".byte 0x66; clflush %P0",
X86_FEATURE_CLFLUSHOPT,
".byte 0x66; xsaveopt %P0",
Huh, XSAVEOPT?!? Shouldn't that be CLWB??
X86_FEATURE_CLWB,
"+m" (*(volatile char __force *)__p));
+}
static inline void pcommit(void) { alternative(ASM_NOP4, ".byte 0x66, 0x0f, 0xae, 0xf8", -- 1.9.3
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Nov 11, 2014 at 08:12:39PM +0100, Borislav Petkov wrote:
".byte 0x66; xsaveopt %P0",
Huh, XSAVEOPT?!? Shouldn't that be CLWB??
Bah, the same opcodes, only 0x66 prefix makes it into CLWB. Could use a comment I guess.
On Tue, 2014-11-11 at 20:19 +0100, Borislav Petkov wrote:
On Tue, Nov 11, 2014 at 08:12:39PM +0100, Borislav Petkov wrote:
".byte 0x66; xsaveopt %P0",
Huh, XSAVEOPT?!? Shouldn't that be CLWB??
Bah, the same opcodes, only 0x66 prefix makes it into CLWB. Could use a comment I guess.
Yep, it's weird, I know. :) I'll add a comment.
Thanks, - Ross
On Tue, Nov 11, 2014 at 12:40:00PM -0700, Ross Zwisler wrote:
Yep, it's weird, I know. :)
But sure, saving opcode space, makes sense to me.
Btw, I'd still be interested about this:
+static inline void clwb(volatile void *__p) +{
alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %P0",
Any particular reason for using 0x3e as a prefix to have the insns be the same size or is it simply because CLFLUSH can stomach it?
Thanks.
On Tue, 2014-11-11 at 20:46 +0100, Borislav Petkov wrote:
On Tue, Nov 11, 2014 at 12:40:00PM -0700, Ross Zwisler wrote:
Yep, it's weird, I know. :)
But sure, saving opcode space, makes sense to me.
Btw, I'd still be interested about this:
+static inline void clwb(volatile void *__p) +{
alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %P0",
Any particular reason for using 0x3e as a prefix to have the insns be the same size or is it simply because CLFLUSH can stomach it?
Ah, sorry, I was still responding to your first mail. :) Response copied here to save searching:
Essentially we need one additional byte at the beginning of the clflush so that we can flip it into a clflushopt by changing that byte into a 0x66 prefix. Two options are to either insert a 1 byte ASM_NOP1, or to add a 1 byte NOP_DS_PREFIX. Both have no functional effect with the plain clflush, but I've been told that executing a clflush + prefix should be faster than executing a clflush + NOP.
I agree, this is useful info - I'll add it to the patch comments for v2.
Thank you for the feedback.
- Ross
On Tue, 2014-11-11 at 20:12 +0100, Borislav Petkov wrote:
On Tue, Nov 11, 2014 at 11:43:13AM -0700, Ross Zwisler wrote:
Add support for the new clwb instruction. This instruction was announced in the document "Intel Architecture Instruction Set Extensions Programming Reference" with reference number 319433-022.
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Here are some things of note:
As with the clflushopt patches before this, I'm assuming that the addressing mode generated by the original clflush instruction will match the new clflush instruction with the 0x66 prefix for clflushopt, and for the xsaveopt instruction with the 0x66 prefix for clwb. For all the test cases that I've come up with and for the new clwb code generated by this patch series, this has proven to be true on my test machine.
According to the SDM, xsaveopt has a form where it has a REX.W prefix. I believe that this prefix will not be generated by gcc in x86_64 kernel code. Based on this, I don't believe I need to account for this extra prefix when dealing with the assembly language created for clwb. Please correct me if I'm wrong.
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org
arch/x86/include/asm/cpufeature.h | 1 + arch/x86/include/asm/special_insns.h | 10 ++++++++++ 2 files changed, 11 insertions(+)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index b3e6b89..fbbed34 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -227,6 +227,7 @@ #define X86_FEATURE_SMAP ( 9*32+20) /* Supervisor Mode Access Prevention */ #define X86_FEATURE_PCOMMIT ( 9*32+22) /* PCOMMIT instruction */ #define X86_FEATURE_CLFLUSHOPT ( 9*32+23) /* CLFLUSHOPT instruction */ +#define X86_FEATURE_CLWB ( 9*32+24) /* CLWB instruction */ #define X86_FEATURE_AVX512PF ( 9*32+26) /* AVX-512 Prefetch */ #define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and Reciprocal */ #define X86_FEATURE_AVX512CD ( 9*32+28) /* AVX-512 Conflict Detection */ diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 1709a2e..a328460 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -199,6 +199,16 @@ static inline void clflushopt(volatile void *__p) "+m" (*(volatile char __force *)__p)); }
+static inline void clwb(volatile void *__p) +{
- alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %P0",
Any particular reason for using 0x3e as a prefix to have the insns be the same size or is it simply because CLFLUSH can stomach it?
:-)
Essentially we need one additional byte at the beginning of the clflush so that we can flip it into a clflushopt by changing that byte into a 0x66 prefix. Two options are to either insert a 1 byte ASM_NOP1, or to add a 1 byte NOP_DS_PREFIX. Both have no functional effect with the plain clflush, but I've been told that executing a clflush + prefix should be faster than executing a clflush + NOP.
On Tue, Nov 11, 2014 at 12:48:52PM -0700, Ross Zwisler wrote:
Essentially we need one additional byte at the beginning of the clflush so that we can flip it into a clflushopt by changing that byte into a 0x66 prefix. Two options are to either insert a 1 byte ASM_NOP1, or to add a 1 byte NOP_DS_PREFIX. Both have no functional effect with the plain clflush, but I've been told that executing a clflush + prefix should be faster than executing a clflush + NOP.
I see. :)
Thanks.
On Tue, Nov 11, 2014 at 11:43:13AM -0700, Ross Zwisler wrote:
+static inline void clwb(volatile void *__p) +{
- alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %P0",
".byte 0x66; clflush %P0",
X86_FEATURE_CLFLUSHOPT,
".byte 0x66; xsaveopt %P0",
Btw, I'm afraid you're going to have to spell out those new instruction mnemonics as they aren't supported by older compilers:
CC drivers/gpu/drm/drm_cache.o {standard input}: Assembler messages: {standard input}:63: Error: no such instruction: `xsaveopt (%rax)' {standard input}:249: Error: no such instruction: `xsaveopt (%rdi)' {standard input}:283: Error: no such instruction: `xsaveopt -1(%rdx)' make[1]: *** [drivers/gpu/drm/drm_cache.o] Error 1 make: *** [drivers/gpu/drm/drm_cache.o] Error 2 [boris@etch:14:35:40:k:14)-> gcc --version gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
This is on a very old guest I have which has this gcc in it and the kernel supports everything gcc >= 3.2.
No, it doesn't. x86 requires 3.4+ at a minimum.
________________________________ From: Borislav Petkov Sent: Wednesday, November 12, 2014 4:39:09 AM To: Ross Zwisler Cc: linux-kernel@vger.kernel.org; Anvin, H Peter; Ingo Molnar; Thomas Gleixner; David Airlie; dri-devel@lists.freedesktop.org; x86@kernel.org Subject: Re: [PATCH 3/6] x86: Add support for the clwb instruction
On Tue, Nov 11, 2014 at 11:43:13AM -0700, Ross Zwisler wrote:
+static inline void clwb(volatile void *__p) +{
alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %P0",
".byte 0x66; clflush %P0",
X86_FEATURE_CLFLUSHOPT,
".byte 0x66; xsaveopt %P0",
Btw, I'm afraid you're going to have to spell out those new instruction mnemonics as they aren't supported by older compilers:
CC drivers/gpu/drm/drm_cache.o {standard input}: Assembler messages: {standard input}:63: Error: no such instruction: `xsaveopt (%rax)' {standard input}:249: Error: no such instruction: `xsaveopt (%rdi)' {standard input}:283: Error: no such instruction: `xsaveopt -1(%rdx)' make[1]: *** [drivers/gpu/drm/drm_cache.o] Error 1 make: *** [drivers/gpu/drm/drm_cache.o] Error 2 [boris@etch:14:35:40:k:14)-> gcc --version gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
This is on a very old guest I have which has this gcc in it and the kernel supports everything gcc >= 3.2.
-- Regards/Gruss, Boris.
Sent from a fat crate under my desk. Formatting is fine. --
On Wed, Nov 12, 2014 at 01:38:45PM +0000, Anvin, H Peter wrote:
No, it doesn't. x86 requires 3.4+ at a minimum.
The only test I see is:
#if GCC_VERSION < 30200 # error Sorry, your compiler is too old - please upgrade it. #endif
And even if we do require 3.4, the build fails with 4.1+ so...
On Wed, 2014-11-12 at 15:12 +0100, Borislav Petkov wrote:
On Wed, Nov 12, 2014 at 01:38:45PM +0000, Anvin, H Peter wrote:
No, it doesn't. x86 requires 3.4+ at a minimum.
The only test I see is:
#if GCC_VERSION < 30200 # error Sorry, your compiler is too old - please upgrade it. #endif
And even if we do require 3.4, the build fails with 4.1+ so...
Ah, dang, you're right. Okay, I'll figure out how to do this without using xsaveopt.
Thank you for pointing this out.
- Ross
If clwb is available on the system, use it in clflush_cache_range. If clwb is not available, fall back to clflushopt if you can. If clflushopt is not supported, fall all the way back to clflush.
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org --- arch/x86/mm/pageattr.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 36de293..5229d45 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -126,8 +126,8 @@ within(unsigned long addr, unsigned long start, unsigned long end) * @vaddr: virtual start address * @size: number of bytes to flush * - * clflushopt is an unordered instruction which needs fencing with mfence or - * sfence to avoid ordering issues. + * clflushopt and clwb are unordered instructions which need fencing with + * mfence or sfence to avoid ordering issues. */ void clflush_cache_range(void *vaddr, unsigned int size) { @@ -136,11 +136,11 @@ void clflush_cache_range(void *vaddr, unsigned int size) mb();
for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size) - clflushopt(vaddr); + clwb(vaddr); /* * Flush any possible final partial cacheline: */ - clflushopt(vend); + clwb(vend);
mb(); }
If clwb is available on the system, use it in drm_clflush_page. If clwb is not available, fall back to clflushopt if you can. If clflushopt is not supported, fall all the way back to clflush.
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org --- drivers/gpu/drm/drm_cache.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c index a6b6906..aad9d82 100644 --- a/drivers/gpu/drm/drm_cache.c +++ b/drivers/gpu/drm/drm_cache.c @@ -34,9 +34,9 @@ #if defined(CONFIG_X86)
/* - * clflushopt is an unordered instruction which needs fencing with mfence or - * sfence to avoid ordering issues. For drm_clflush_page this fencing happens - * in the caller. + * clwb and clflushopt are unordered instructions which need fencing with + * mfence or sfence to avoid ordering issues. For drm_clflush_page this + * fencing happens in the caller. */ static void drm_clflush_page(struct page *page) @@ -50,7 +50,7 @@ drm_clflush_page(struct page *page)
page_virtual = kmap_atomic(page); for (i = 0; i < PAGE_SIZE; i += size) - clflushopt(page_virtual + i); + clwb(page_virtual + i); kunmap_atomic(page_virtual); }
If clwb is available on the system, use it in drm_clflush_virt_range. If clwb is not available, fall back to clflushopt if you can. If clflushopt is not supported, fall all the way back to clflush.
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org --- drivers/gpu/drm/drm_cache.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c index aad9d82..84e9a04 100644 --- a/drivers/gpu/drm/drm_cache.c +++ b/drivers/gpu/drm/drm_cache.c @@ -138,8 +138,8 @@ drm_clflush_virt_range(void *addr, unsigned long length) void *end = addr + length; mb(); for (; addr < end; addr += boot_cpu_data.x86_clflush_size) - clflushopt(addr); - clflushopt(end - 1); + clwb(addr); + clwb(end - 1); mb(); return; }
On 11/11/2014 10:43 AM, Ross Zwisler wrote:
If clwb is available on the system, use it in drm_clflush_virt_range. If clwb is not available, fall back to clflushopt if you can. If clflushopt is not supported, fall all the way back to clflush.
I don't know exactly what drm_clflush_virt_range (and the other functions you're modifying similarly) are for, but it seems plausible to me that they're used before reads to make sure that non-coherent memory sees updated data. If that's true, then this will break it.
But maybe all the users are write to coherent memory that just need to ensure that whatever's backing the memory knows about the write.
FWIW, it may make sense to rename this function to drm_clwb_virt_range if you make this change.
--Andy
Signed-off-by: Ross Zwisler ross.zwisler@linux.intel.com Cc: H Peter Anvin h.peter.anvin@intel.com Cc: Ingo Molnar mingo@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: David Airlie airlied@linux.ie Cc: dri-devel@lists.freedesktop.org Cc: x86@kernel.org
drivers/gpu/drm/drm_cache.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c index aad9d82..84e9a04 100644 --- a/drivers/gpu/drm/drm_cache.c +++ b/drivers/gpu/drm/drm_cache.c @@ -138,8 +138,8 @@ drm_clflush_virt_range(void *addr, unsigned long length) void *end = addr + length; mb(); for (; addr < end; addr += boot_cpu_data.x86_clflush_size)
clflushopt(addr);
clflushopt(end - 1);
clwb(addr);
mb(); return; }clwb(end - 1);
On Wed, Nov 12, 2014 at 07:14:21PM -0800, Andy Lutomirski wrote:
On 11/11/2014 10:43 AM, Ross Zwisler wrote:
If clwb is available on the system, use it in drm_clflush_virt_range. If clwb is not available, fall back to clflushopt if you can. If clflushopt is not supported, fall all the way back to clflush.
I don't know exactly what drm_clflush_virt_range (and the other functions you're modifying similarly) are for, but it seems plausible to me that they're used before reads to make sure that non-coherent memory sees updated data. If that's true, then this will break it.
Why would it break it? The updated cachelines will be in memory and subsequent reads will be serviced from the cache instead from going to memory as it is not invalidated as it would be by CLFLUSH.
/me is puzzled.
On Nov 13, 2014 3:20 AM, "Borislav Petkov" bp@alien8.de wrote:
On Wed, Nov 12, 2014 at 07:14:21PM -0800, Andy Lutomirski wrote:
On 11/11/2014 10:43 AM, Ross Zwisler wrote:
If clwb is available on the system, use it in drm_clflush_virt_range. If clwb is not available, fall back to clflushopt if you can. If clflushopt is not supported, fall all the way back to clflush.
I don't know exactly what drm_clflush_virt_range (and the other functions you're modifying similarly) are for, but it seems plausible to me that they're used before reads to make sure that non-coherent memory sees updated data. If that's true, then this will break it.
Why would it break it? The updated cachelines will be in memory and subsequent reads will be serviced from the cache instead from going to memory as it is not invalidated as it would be by CLFLUSH.
/me is puzzled.
Suppose you map some device memory WB, and then the device non-coherently updates. If you want the CPU to see it, you need clflush or clflushopt. Some architectures might do this for dma_sync_single_for_cpu with DMA_FROM_DEVICE.
I'm not sure that such a thing exists on x86.
--Andy
On Thu, Nov 13, 2014 at 08:38:23AM -0800, Andy Lutomirski wrote:
On Nov 13, 2014 3:20 AM, "Borislav Petkov" bp@alien8.de wrote:
On Wed, Nov 12, 2014 at 07:14:21PM -0800, Andy Lutomirski wrote:
On 11/11/2014 10:43 AM, Ross Zwisler wrote:
If clwb is available on the system, use it in drm_clflush_virt_range. If clwb is not available, fall back to clflushopt if you can. If clflushopt is not supported, fall all the way back to clflush.
I don't know exactly what drm_clflush_virt_range (and the other functions you're modifying similarly) are for, but it seems plausible to me that they're used before reads to make sure that non-coherent memory sees updated data. If that's true, then this will break it.
Why would it break it? The updated cachelines will be in memory and subsequent reads will be serviced from the cache instead from going to memory as it is not invalidated as it would be by CLFLUSH.
/me is puzzled.
Suppose you map some device memory WB, and then the device non-coherently updates. If you want the CPU to see it, you need clflush or clflushopt. Some architectures might do this for dma_sync_single_for_cpu with DMA_FROM_DEVICE.
Ah, you're talking about the other way around - the device does the writes. Well, the usage sites are all in i915*, maybe we should ask them - it looks to me like this is only the CPU making stuff visible in the shared buffer but I don't know that code... intel-gfx CCed although dri-devel is already on CC.
On Thu, Nov 13, 2014 at 06:11:33PM +0100, Borislav Petkov wrote:
On Thu, Nov 13, 2014 at 08:38:23AM -0800, Andy Lutomirski wrote:
On Nov 13, 2014 3:20 AM, "Borislav Petkov" bp@alien8.de wrote:
On Wed, Nov 12, 2014 at 07:14:21PM -0800, Andy Lutomirski wrote:
On 11/11/2014 10:43 AM, Ross Zwisler wrote:
If clwb is available on the system, use it in drm_clflush_virt_range. If clwb is not available, fall back to clflushopt if you can. If clflushopt is not supported, fall all the way back to clflush.
I don't know exactly what drm_clflush_virt_range (and the other functions you're modifying similarly) are for, but it seems plausible to me that they're used before reads to make sure that non-coherent memory sees updated data. If that's true, then this will break it.
Why would it break it? The updated cachelines will be in memory and subsequent reads will be serviced from the cache instead from going to memory as it is not invalidated as it would be by CLFLUSH.
/me is puzzled.
Suppose you map some device memory WB, and then the device non-coherently updates. If you want the CPU to see it, you need clflush or clflushopt. Some architectures might do this for dma_sync_single_for_cpu with DMA_FROM_DEVICE.
Ah, you're talking about the other way around - the device does the writes. Well, the usage sites are all in i915*, maybe we should ask them - it looks to me like this is only the CPU making stuff visible in the shared buffer but I don't know that code... intel-gfx CCed although dri-devel is already on CC.
We use it both ways in i915. So please don't break it.
On Thu, Nov 13, 2014 at 07:33:54PM +0200, Ville Syrjälä wrote:
We use it both ways in i915. So please don't break it.
Haha, we started from Intel with Ross' patch and made a full circle back. Maybe you guys should talk about it.
LOL. :-)
On Thu, Nov 13, 2014 at 06:47:34PM +0100, Borislav Petkov wrote:
On Thu, Nov 13, 2014 at 07:33:54PM +0200, Ville Syrjälä wrote:
We use it both ways in i915. So please don't break it.
Haha, we started from Intel with Ross' patch and made a full circle back. Maybe you guys should talk about it.
LOL. :-)
Indeed. The problem I see with these patches is that they don't actually tell what the new instruction does, so a casual glance doesn't really raise any red flags. Another excuse I can use is that I just got used to the fact that the x86 hasn't historically bothered separating invalidate and writeback and just does both. In my previous life on the dark^Warm side I did actually know the difference :)
But there's plenty of blame to go around to the other side of the fence too. We should have documented what we expect from these functions. Currently you just have to know/guess, and that's just not good enough.
On Thu, Nov 13, 2014 at 07:33:54PM +0200, Ville Syrjälä wrote:
On Thu, Nov 13, 2014 at 06:11:33PM +0100, Borislav Petkov wrote:
On Thu, Nov 13, 2014 at 08:38:23AM -0800, Andy Lutomirski wrote:
On Nov 13, 2014 3:20 AM, "Borislav Petkov" bp@alien8.de wrote:
On Wed, Nov 12, 2014 at 07:14:21PM -0800, Andy Lutomirski wrote:
On 11/11/2014 10:43 AM, Ross Zwisler wrote:
If clwb is available on the system, use it in drm_clflush_virt_range. If clwb is not available, fall back to clflushopt if you can. If clflushopt is not supported, fall all the way back to clflush.
I don't know exactly what drm_clflush_virt_range (and the other functions you're modifying similarly) are for, but it seems plausible to me that they're used before reads to make sure that non-coherent memory sees updated data. If that's true, then this will break it.
Why would it break it? The updated cachelines will be in memory and subsequent reads will be serviced from the cache instead from going to memory as it is not invalidated as it would be by CLFLUSH.
/me is puzzled.
Suppose you map some device memory WB, and then the device non-coherently updates. If you want the CPU to see it, you need clflush or clflushopt. Some architectures might do this for dma_sync_single_for_cpu with DMA_FROM_DEVICE.
Ah, you're talking about the other way around - the device does the writes. Well, the usage sites are all in i915*, maybe we should ask them - it looks to me like this is only the CPU making stuff visible in the shared buffer but I don't know that code... intel-gfx CCed although dri-devel is already on CC.
We use it both ways in i915. So please don't break it.
Actually to clarify a bit, I think we shouldn't actually need the invalidate part for any modern platform with shared LLC. Apart from the display those tend to be fully coherent, so we only have to care about the display controller not seeing stale data in memory.
I have no idea which platforms support this instruction. If Baytrail/Braswell/Cherrytrail are on the list, then we have a problem. Otherwise we should probably be fine with it. But I'm not 100% sure about any future platforms that are still under wraps.
Also ttm seems to use some of this stuff. Not sure what they expect from it.
dri-devel@lists.freedesktop.org