Folks!
While working on various preempt count related things, I stumbled (again) over the inconsistency of our preempt count handling.
The handling of preempt_count() is inconsistent accross kernel configurations. On kernels which have PREEMPT_COUNT=n preempt_disable/enable() and the lock/unlock functions are not affecting the preempt count, only local_bh_disable/enable() and _bh variants of locking, soft interrupt delivery, hard interrupt and NMI context affect it.
It's therefore impossible to have a consistent set of checks which provide information about the context in which a function is called. In many cases it makes sense to have seperate functions for seperate contexts, but there are valid reasons to avoid that and handle different calling contexts conditionally.
The lack of such indicators which work on all kernel configuratios is a constant source of trouble because developers either do not understand the implications or try to work around this inconsistency in weird ways. Neither seem these issues be catched by reviewers and testing.
Recently merged code does:
gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC;
Looks obviously correct, except for the fact that preemptible() is unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in that code use GFP_ATOMIC on such kernels.
Attempts to make preempt count unconditional and consistent have been rejected in the past with handwaving performance arguments.
Freshly conducted benchmarks did not reveal any measurable impact from enabling preempt count unconditionally. On kernels with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY the preempt count is only incremented and decremented but the result of the decrement is not tested. Contrary to that enabling CONFIG_PREEMPT which tests the result has a small but measurable impact due to the conditional branch/call.
It's about time to make essential functionality of the kernel consistent accross the various preemption models.
The series is also available from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git preempt
That's the first part of a larger effort related to preempt count:
1) The analysis of the usage sites of in_interrupt(), in_atomic(), in_softirq() is still ongoing, but so far the number of buggy users is clearly the vast majority. There will be seperate patch series (currently 46 and counting) to address these issues once the analysis is complete in the next days.
2) The long discussed state tracking of local irq disable in preempt count which accounts interrupt disabled sections as atomic and avoids issuing costly instructions (sti, cli, popf or their non X86 counterparts) when the state does not change, i.e. nested irq_save() or irq_restore(). I have this working on X86 already and contrary to my earlier attempts this was reasonably straight forward due to the recent entry/exit code consolidation.
What I've not done yet is to optimize the preempt count handling of the [un]lock_irq* operations so they handle the interrupt disabled state and the preempt count modification in one go. That's an obvious add on, but correctness first ...
3) Lazy interrupt disabling as a straight forward extension to #2. This avoids the actual disabling at the CPU level completely and catches an incoming interrupt in the low level entry code, modifies the interrupt disabled state on the return stack, notes the interrupt as pending in software and raises it again when interrupts are reenabled. This has still a few issues which I'm hunting down (cpuidle is unhappy ...)
Thanks,
tglx --- arch/arm/include/asm/assembler.h | 11 -- arch/arm/kernel/iwmmxt.S | 2 arch/arm/mach-ep93xx/crunch-bits.S | 2 arch/xtensa/kernel/entry.S | 2 drivers/gpu/drm/i915/Kconfig.debug | 1 drivers/gpu/drm/i915/i915_utils.h | 3 include/linux/bit_spinlock.h | 4 - include/linux/lockdep.h | 6 - include/linux/pagemap.h | 4 - include/linux/preempt.h | 37 +--------- include/linux/uaccess.h | 6 - kernel/Kconfig.preempt | 4 - kernel/sched/core.c | 6 - lib/Kconfig.debug | 3 lib/Kconfig.debug.rej | 14 +-- tools/testing/selftests/rcutorture/configs/rcu/SRCU-t | 1 tools/testing/selftests/rcutorture/configs/rcu/SRCU-u | 1 tools/testing/selftests/rcutorture/configs/rcu/TINY01 | 1 tools/testing/selftests/rcutorture/doc/TINY_RCU.txt | 5 - tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 1 tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h | 1 21 files changed, 23 insertions(+), 92 deletions(-)
ARCH_NO_PREEMPT disables the selection of CONFIG_PREEMPT_VOLUNTARY and CONFIG_PREEMPT, but architectures which set this config option still support preempt count for hard and softirq accounting.
There is absolutely no reason to prevent lockdep from using the preempt counter nor is there a reason to prevent the enablement of CONFIG_DEBUG_ATOMIC_SLEEP on such architectures.
Remove the dependencies, which affects ALPHA, HEXAGON, M68K and UM.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Richard Henderson rth@twiddle.net Cc: Ivan Kokshaysky ink@jurassic.park.msu.ru Cc: Matt Turner mattst88@gmail.com Cc: linux-alpha@vger.kernel.org Cc: Jeff Dike jdike@addtoit.com Cc: Richard Weinberger richard@nod.at Cc: Anton Ivanov anton.ivanov@cambridgegreys.com Cc: linux-um@lists.infradead.org Cc: Brian Cain bcain@codeaurora.org Cc: linux-hexagon@vger.kernel.org Cc: Geert Uytterhoeven geert@linux-m68k.org Cc: linux-m68k@lists.linux-m68k.org --- lib/Kconfig.debug | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
--- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1161,7 +1161,7 @@ config PROVE_LOCKING select DEBUG_RWSEMS select DEBUG_WW_MUTEX_SLOWPATH select DEBUG_LOCK_ALLOC - select PREEMPT_COUNT if !ARCH_NO_PREEMPT + select PREEMPT_COUNT select TRACE_IRQFLAGS default n help @@ -1323,7 +1323,6 @@ config DEBUG_ATOMIC_SLEEP bool "Sleep inside atomic section checking" select PREEMPT_COUNT depends on DEBUG_KERNEL - depends on !ARCH_NO_PREEMPT help If you say Y here, various routines which may sleep will become very noisy if they are called inside atomic sections: when a spinlock is
The handling of preempt_count() is inconsistent accross kernel configurations. On kernels which have PREEMPT_COUNT=n preempt_disable/enable() and the lock/unlock functions are not affecting the preempt count, only local_bh_disable/enable() and _bh variants of locking, soft interrupt delivery, hard interrupt and NMI context affect it.
It's therefore impossible to have a consistent set of checks which provide information about the context in which a function is called. In many cases it makes sense to have seperate functions for seperate contexts, but there are valid reasons to avoid that and handle different calling contexts conditionally.
The lack of such indicators which work on all kernel configuratios is a constant source of trouble because developers either do not understand the implications or try to work around this inconsistency in weird ways. Neither seem these issues be catched by reviewers and testing.
Recently merged code does:
gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC;
Looks obviously correct, except for the fact that preemptible() is unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in that code use GFP_ATOMIC on such kernels.
Attempts to make preempt count unconditional and consistent have been rejected in the past with handwaving performance arguments.
Freshly conducted benchmarks did not reveal any measurable impact from enabling preempt count unconditionally. On kernels with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY the preempt count is only incremented and decremented but the result of the decrement is not tested. Contrary to that enabling CONFIG_PREEMPT which tests the result has a small but measurable impact due to the conditional branch/call.
It's about time to make essential functionality of the kernel consistent accross the various preemption models.
Enable CONFIG_PREEMPT_COUNT unconditionally. Follow up changes will remove the #ifdeffery and remove the config option at the end.
Signed-off-by: Thomas Gleixner tglx@linutronix.de --- kernel/Kconfig.preempt | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
--- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -75,8 +75,7 @@ config PREEMPT_RT endchoice
config PREEMPT_COUNT - bool + def_bool y
config PREEMPTION bool - select PREEMPT_COUNT
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Juri Lelli juri.lelli@redhat.com Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Dietmar Eggemann dietmar.eggemann@arm.com Cc: Steven Rostedt rostedt@goodmis.org Cc: Ben Segall bsegall@google.com Cc: Mel Gorman mgorman@suse.de Cc: Daniel Bristot de Oliveira bristot@redhat.com --- include/linux/preempt.h | 37 ++++--------------------------------- 1 file changed, 4 insertions(+), 33 deletions(-)
--- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -56,8 +56,7 @@ #define PREEMPT_DISABLED (PREEMPT_DISABLE_OFFSET + PREEMPT_ENABLED)
/* - * Disable preemption until the scheduler is running -- use an unconditional - * value so that it also works on !PREEMPT_COUNT kernels. + * Disable preemption until the scheduler is running. * * Reset by start_kernel()->sched_init()->init_idle()->init_idle_preempt_count(). */ @@ -69,7 +68,6 @@ * * preempt_count() == 2*PREEMPT_DISABLE_OFFSET * - * Note: PREEMPT_DISABLE_OFFSET is 0 for !PREEMPT_COUNT kernels. * Note: See finish_task_switch(). */ #define FORK_PREEMPT_COUNT (2*PREEMPT_DISABLE_OFFSET + PREEMPT_ENABLED) @@ -106,11 +104,7 @@ /* * The preempt_count offset after preempt_disable(); */ -#if defined(CONFIG_PREEMPT_COUNT) -# define PREEMPT_DISABLE_OFFSET PREEMPT_OFFSET -#else -# define PREEMPT_DISABLE_OFFSET 0 -#endif +#define PREEMPT_DISABLE_OFFSET PREEMPT_OFFSET
/* * The preempt_count offset after spin_lock() @@ -122,8 +116,8 @@ * * spin_lock_bh() * - * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and - * softirqs, such that unlock sequences of: + * Which need to disable both preemption and softirqs, such that unlock + * sequences of: * * spin_unlock(); * local_bh_enable(); @@ -164,8 +158,6 @@ extern void preempt_count_sub(int val); #define preempt_count_inc() preempt_count_add(1) #define preempt_count_dec() preempt_count_sub(1)
-#ifdef CONFIG_PREEMPT_COUNT - #define preempt_disable() \ do { \ preempt_count_inc(); \ @@ -231,27 +223,6 @@ do { \ __preempt_count_dec(); \ } while (0)
-#else /* !CONFIG_PREEMPT_COUNT */ - -/* - * Even if we don't have any preemption, we need preempt disable/enable - * to be barriers, so that we don't have things like get_user/put_user - * that can cause faults and scheduling migrate into our preempt-protected - * region. - */ -#define preempt_disable() barrier() -#define sched_preempt_enable_no_resched() barrier() -#define preempt_enable_no_resched() barrier() -#define preempt_enable() barrier() -#define preempt_check_resched() do { } while (0) - -#define preempt_disable_notrace() barrier() -#define preempt_enable_no_resched_notrace() barrier() -#define preempt_enable_notrace() barrier() -#define preemptible() 0 - -#endif /* CONFIG_PREEMPT_COUNT */ - #ifdef MODULE /* * Modules have no business playing preemption tricks.
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@kernel.org Cc: Will Deacon will@kernel.org --- include/linux/lockdep.h | 6 ++---- lib/Kconfig.debug | 1 - 2 files changed, 2 insertions(+), 5 deletions(-)
--- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -585,16 +585,14 @@ do { \
#define lockdep_assert_preemption_enabled() \ do { \ - WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ - debug_locks && \ + WARN_ON_ONCE(debug_locks && \ (preempt_count() != 0 || \ !raw_cpu_read(hardirqs_enabled))); \ } while (0)
#define lockdep_assert_preemption_disabled() \ do { \ - WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ - debug_locks && \ + WARN_ON_ONCE(debug_locks && \ (preempt_count() == 0 && \ raw_cpu_read(hardirqs_enabled))); \ } while (0) --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1161,7 +1161,6 @@ config PROVE_LOCKING select DEBUG_RWSEMS select DEBUG_WW_MUTEX_SLOWPATH select DEBUG_LOCK_ALLOC - select PREEMPT_COUNT select TRACE_IRQFLAGS default n help
On Mon, Sep 14, 2020 at 10:42:13PM +0200, Thomas Gleixner wrote:
Acked-by: Will Deacon will@kernel.org
Will
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Andrew Morton akpm@linux-foundation.org Cc: linux-mm@kvack.org --- include/linux/pagemap.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
--- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -168,9 +168,7 @@ void release_pages(struct page **pages, static inline int __page_cache_add_speculative(struct page *page, int count) { #ifdef CONFIG_TINY_RCU -# ifdef CONFIG_PREEMPT_COUNT - VM_BUG_ON(!in_atomic() && !irqs_disabled()); -# endif + VM_BUG_ON(preemptible()) /* * Preempt must be disabled here - we rely on rcu_read_lock doing * this for us.
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner tglx@linutronix.de --- include/linux/bit_spinlock.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
--- a/include/linux/bit_spinlock.h +++ b/include/linux/bit_spinlock.h @@ -90,10 +90,8 @@ static inline int bit_spin_is_locked(int { #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) return test_bit(bitnum, addr); -#elif defined CONFIG_PREEMPT_COUNT - return preempt_count(); #else - return 1; + return preempt_count(); #endif }
On Mon, Sep 14, 2020 at 10:42:15PM +0200, Thomas Gleixner wrote:
Acked-by: Will Deacon will@kernel.org
Will
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner tglx@linutronix.de --- include/linux/uaccess.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
--- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -230,9 +230,9 @@ static inline bool pagefault_disabled(vo * * This function should only be used by the fault handlers. Other users should * stick to pagefault_disabled(). - * Please NEVER use preempt_disable() to disable the fault handler. With - * !CONFIG_PREEMPT_COUNT, this is like a NOP. So the handler won't be disabled. - * in_atomic() will report different values based on !CONFIG_PREEMPT_COUNT. + * + * Please NEVER use preempt_disable() or local_irq_disable() to disable the + * fault handler. */ #define faulthandler_disabled() (pagefault_disabled() || in_atomic())
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Juri Lelli juri.lelli@redhat.com Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Dietmar Eggemann dietmar.eggemann@arm.com Cc: Steven Rostedt rostedt@goodmis.org Cc: Ben Segall bsegall@google.com Cc: Mel Gorman mgorman@suse.de Cc: Daniel Bristot de Oliveira bristot@redhat.com --- kernel/sched/core.c | 6 +----- lib/Kconfig.debug | 1 - 2 files changed, 1 insertion(+), 6 deletions(-)
--- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3706,8 +3706,7 @@ asmlinkage __visible void schedule_tail( * finish_task_switch() for details. * * finish_task_switch() will drop rq->lock() and lower preempt_count - * and the preempt_enable() will end up enabling preemption (on - * PREEMPT_COUNT kernels). + * and the preempt_enable() will end up enabling preemption. */
rq = finish_task_switch(prev); @@ -7311,9 +7310,6 @@ void __cant_sleep(const char *file, int if (irqs_disabled()) return;
- if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) - return; - if (preempt_count() > preempt_offset) return;
--- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1320,7 +1320,6 @@ config DEBUG_LOCKDEP
config DEBUG_ATOMIC_SLEEP bool "Sleep inside atomic section checking" - select PREEMPT_COUNT depends on DEBUG_KERNEL help If you say Y here, various routines which may sleep will become very
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Russell King linux@armlinux.org.uk Cc: linux-arm-kernel@lists.infradead.org --- arch/arm/include/asm/assembler.h | 11 ----------- arch/arm/kernel/iwmmxt.S | 2 -- arch/arm/mach-ep93xx/crunch-bits.S | 2 -- 3 files changed, 15 deletions(-)
--- a/arch/arm/include/asm/assembler.h +++ b/arch/arm/include/asm/assembler.h @@ -212,7 +212,6 @@ /* * Increment/decrement the preempt count. */ -#ifdef CONFIG_PREEMPT_COUNT .macro inc_preempt_count, ti, tmp ldr \tmp, [\ti, #TI_PREEMPT] @ get preempt count add \tmp, \tmp, #1 @ increment it @@ -229,16 +228,6 @@ get_thread_info \ti dec_preempt_count \ti, \tmp .endm -#else - .macro inc_preempt_count, ti, tmp - .endm - - .macro dec_preempt_count, ti, tmp - .endm - - .macro dec_preempt_count_ti, ti, tmp - .endm -#endif
#define USERL(l, x...) \ 9999: x; \ --- a/arch/arm/kernel/iwmmxt.S +++ b/arch/arm/kernel/iwmmxt.S @@ -94,9 +94,7 @@ ENTRY(iwmmxt_task_enable) mov r2, r2 @ cpwait bl concan_save
-#ifdef CONFIG_PREEMPT_COUNT get_thread_info r10 -#endif 4: dec_preempt_count r10, r3 ret r9 @ normal exit from exception
--- a/arch/arm/mach-ep93xx/crunch-bits.S +++ b/arch/arm/mach-ep93xx/crunch-bits.S @@ -191,9 +191,7 @@ ENTRY(crunch_task_enable) cfldr64 mvdx15, [r0, #CRUNCH_MVDX15]
1: -#ifdef CONFIG_PREEMPT_COUNT get_thread_info r10 -#endif 2: dec_preempt_count r10, r3 ret lr
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Chris Zankel chris@zankel.net Cc: Max Filippov jcmvbkbc@gmail.com Cc: linux-xtensa@linux-xtensa.org --- arch/xtensa/kernel/entry.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/xtensa/kernel/entry.S +++ b/arch/xtensa/kernel/entry.S @@ -819,7 +819,7 @@ ENTRY(debug_exception) * preemption if we have HW breakpoints to preserve DEBUGCAUSE.DBNUM * meaning. */ -#if defined(CONFIG_PREEMPT_COUNT) && defined(CONFIG_HAVE_HW_BREAKPOINT) +#ifdef CONFIG_HAVE_HW_BREAKPOINT GET_THREAD_INFO(a2, a1) l32i a3, a2, TI_PRE_COUNT addi a3, a3, 1
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Jani Nikula jani.nikula@linux.intel.com Cc: Joonas Lahtinen joonas.lahtinen@linux.intel.com Cc: Rodrigo Vivi rodrigo.vivi@intel.com Cc: David Airlie airlied@linux.ie Cc: Daniel Vetter daniel@ffwll.ch Cc: intel-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org --- drivers/gpu/drm/i915/Kconfig.debug | 1 - drivers/gpu/drm/i915/i915_utils.h | 3 +-- 2 files changed, 1 insertion(+), 3 deletions(-)
--- a/drivers/gpu/drm/i915/Kconfig.debug +++ b/drivers/gpu/drm/i915/Kconfig.debug @@ -20,7 +20,6 @@ config DRM_I915_DEBUG bool "Enable additional driver debugging" depends on DRM_I915 select DEBUG_FS - select PREEMPT_COUNT select I2C_CHARDEV select STACKDEPOT select DRM_DP_AUX_CHARDEV --- a/drivers/gpu/drm/i915/i915_utils.h +++ b/drivers/gpu/drm/i915/i915_utils.h @@ -337,8 +337,7 @@ wait_remaining_ms_from_jiffies(unsigned (Wmax)) #define wait_for(COND, MS) _wait_for((COND), (MS) * 1000, 10, 1000)
-/* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */ -#if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT) +#ifdef CONFIG_DRM_I915_DEBUG # define _WAIT_FOR_ATOMIC_CHECK(ATOMIC) WARN_ON_ONCE((ATOMIC) && !in_atomic()) #else # define _WAIT_FOR_ATOMIC_CHECK(ATOMIC) do { } while (0)
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Josh Triplett josh@joshtriplett.org Cc: Steven Rostedt rostedt@goodmis.org Cc: Mathieu Desnoyers mathieu.desnoyers@efficios.com Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Shuah Khan shuah@kernel.org Cc: rcu@vger.kernel.org Cc: linux-kselftest@vger.kernel.org --- tools/testing/selftests/rcutorture/configs/rcu/SRCU-t | 1 - tools/testing/selftests/rcutorture/configs/rcu/SRCU-u | 1 - tools/testing/selftests/rcutorture/configs/rcu/TINY01 | 1 - tools/testing/selftests/rcutorture/doc/TINY_RCU.txt | 5 ++--- tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 1 - tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h | 1 - 6 files changed, 2 insertions(+), 8 deletions(-)
--- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-t +++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-t @@ -7,4 +7,3 @@ CONFIG_RCU_TRACE=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_DEBUG_ATOMIC_SLEEP=y -#CHECK#CONFIG_PREEMPT_COUNT=y --- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-u +++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-u @@ -7,4 +7,3 @@ CONFIG_RCU_TRACE=n CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_OBJECTS_RCU_HEAD=n -CONFIG_PREEMPT_COUNT=n --- a/tools/testing/selftests/rcutorture/configs/rcu/TINY01 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TINY01 @@ -10,4 +10,3 @@ CONFIG_RCU_TRACE=n #CHECK#CONFIG_RCU_STALL_COMMON=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n -CONFIG_PREEMPT_COUNT=n --- a/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt +++ b/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt @@ -3,11 +3,10 @@ This document gives a brief rationale fo
Kconfig Parameters:
-CONFIG_DEBUG_LOCK_ALLOC -- Do all three and none of the three. -CONFIG_PREEMPT_COUNT +CONFIG_DEBUG_LOCK_ALLOC -- Do both and none of the two. CONFIG_RCU_TRACE
-The theory here is that randconfig testing will hit the other six possible +The theory here is that randconfig testing will hit the other two possible combinations of these parameters.
--- a/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt +++ b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt @@ -43,7 +43,6 @@ CONFIG_64BIT
Used only to check CONFIG_RCU_FANOUT value, inspection suffices.
-CONFIG_PREEMPT_COUNT CONFIG_PREEMPT_RCU
Redundant with CONFIG_PREEMPT, ignore. --- a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h +++ b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h @@ -8,7 +8,6 @@ #undef CONFIG_HOTPLUG_CPU #undef CONFIG_MODULES #undef CONFIG_NO_HZ_FULL_SYSIDLE -#undef CONFIG_PREEMPT_COUNT #undef CONFIG_PREEMPT_RCU #undef CONFIG_PROVE_RCU #undef CONFIG_RCU_NOCB_CPU
All conditionals and irritations are gone.
Signed-off-by: Thomas Gleixner tglx@linutronix.de --- kernel/Kconfig.preempt | 3 --- 1 file changed, 3 deletions(-)
--- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -74,8 +74,5 @@ config PREEMPT_RT
endchoice
-config PREEMPT_COUNT - def_bool y - config PREEMPTION bool
On Mon, 14 Sep 2020 22:42:09 +0200 Thomas Gleixner tglx@linutronix.de wrote:
21 files changed, 23 insertions(+), 92 deletions(-)
This alone makes it look promising, and hopefully acceptable by Linus :-)
-- Steve
On Mon, Sep 14, 2020 at 1:45 PM Thomas Gleixner tglx@linutronix.de wrote:
I don't think this is a good reason to entirely get rid of the no-preempt thing.
The above is just garbage. It's bogus. You can't do it.
Blaming the no-preempt code for this bug is extremely unfair, imho.
And the no-preempt code does help make for much better code generation for simple spinlocks.
Where is that horribly buggy recent code? It's not in that exact format, certainly, since 'grep' doesn't find it.
Linus
On Mon, Sep 14 2020 at 13:59, Linus Torvalds wrote:
I did not say that this is a good reason. It just illustrates the problem.
The above is just garbage. It's bogus. You can't do it.
Blaming the no-preempt code for this bug is extremely unfair, imho.
I'm not blaming the no-preempt code. I'm blaming inconsistency and there is no real good argument for inconsistent behaviour, TBH.
And the no-preempt code does help make for much better code generation for simple spinlocks.
Yes it does generate better code, but I tried hard to spot a difference in various metrics exposed by perf. It's all in the noise and I only can spot a difference when the actual preemption check after the decrement which still depends on CONFIG_PREEMPT is in place, but that's not the case for PREEMPT_NONE or PREEMPT_VOLUNTARY kernels where the decrement is just a decrement w/o any conditional behind it.
Where is that horribly buggy recent code? It's not in that exact format, certainly, since 'grep' doesn't find it.
Bah, that was stuff in next which got dropped again.
But just look at any check which uses preemptible(), especially those which check !preemptible():
In the X86 #GP handler we have:
/* * To be potentially processing a kprobe fault and to trust the result * from kprobe_running(), we have to be non-preemptible. */ if (!preemptible() && kprobe_running() && kprobe_fault_handler(regs, X86_TRAP_GP)) goto exit;
and a similar check in the S390 code in kprobe_exceptions_notify(). That all magically 'works' because that code might have been actually tested with lockdep enabled which enforces PREEMPT_COUNT...
The SG code has some interesting usage as well:
if (miter->__flags & SG_MITER_ATOMIC) { WARN_ON_ONCE(preemptible()); kunmap_atomic(miter->addr);
How is that WARN_ON_ONCE() supposed to catch anything? Especially as calling code does:
flags = SG_MITER_TO_SG; if (!preemptible()) flags |= SG_MITER_ATOMIC;
which is equally useless on kernels which have PREEMPT_COUNT=n.
There are bugs which are related to in_atomic() or other in_***() usage all over the place as well.
Inconsistency at the core level is a clear recipe for disaster and at some point we have to bite the bullet and accept that consistency is more important than the non measurable 3 cycles?
Thanks,
tglx
On Mon, Sep 14, 2020 at 2:55 PM Thomas Gleixner tglx@linutronix.de wrote:
I'm somewhat more worried about the small-device case.
That said, the diffstat certainly has its very clear charm, and I do agree that it makes things simpler.
I'm just not convinced people should ever EVER do things like that "if (preemptible())" garbage. It sounds like somebody is doing seriously bad things.
The chacha20poly1305 code does look fundamentally broken. But no, the fix is not to make "preemptible" work with spinlocks, the fix is to not *do* that kind of broken things.
Those things would be broken even if you changed the semantics of preemptible. There's no way that it's valid to say "use this debug flag to decide if we should do atomic allocations or not".
It smells like "I got a debug failure, so I'm papering it over by checking the thing the debug code checks for".
The debug check is to catch the obvious bad cases. It's not the _only_ bad cases, so copying the debug check test is just completely wrong.
Ard and Herbert added to participants: see chacha20poly1305_crypt_sg_inplace(), which does
flags = SG_MITER_TO_SG; if (!preemptible()) flags |= SG_MITER_ATOMIC;
introduced in commit d95312a3ccc0 ("crypto: lib/chacha20poly1305 - reimplement crypt_from_sg() routine").
You *fundamentally* cannot do that. Seriously. It's completely wrong. Pick one or the other, or have the caller *tell* you what the context is.
Linus
On Mon, Sep 14, 2020 at 3:24 PM Linus Torvalds torvalds@linux-foundation.org wrote:
As far as I can tell, the only reason for this all is to try to use "kmap()" rather than "kmap_atomic()".
And kmap() actually has the much more complex "might_sleep()" tests, and apparently the "preemptible()" check wasn't even the proper full debug check, it was just a complete hack to catch the one that triggered.
From a quick look, that code should probably just get rid of
SG_MITER_ATOMIC entirely, and alwayse use kmap_atomic().
kmap_atomic() is actually the faster and proper interface to use anyway (never mind that any of this matters on any sane hardware). The old kmap() and kunmap() interfaces should generally be avoided like the plague - yes, they allow sleeping in the middle and that is sometimes required, but if you don't need that, you should never ever use them.
We used to have a very nasty kmap_atomic() that required people to be very careful and know exactly which atomic entry to use, and that was admitedly quite nasty.
So it _looks_ like this code started using kmap() - probably back when kmap_atomic() was so cumbersome to use - and was then converted (conditionally) to kmap_atomic() rather than just changed whole-sale. Is there actually something that wants to use those sg_miter functions and sleep?
Because if there is, that choice should come from the outside, not from inside lib/scatterlist.c trying to make some bad guess based on the wrong thing entirely.
Linus
On Mon, Sep 14, 2020 at 03:37:49PM -0700, Linus Torvalds wrote:
I dug up the old zinc patch submissions and this wasn't present at all in the original. The original zinc code used blkcipher_walk which unconditinoally does kmap_atomic.
So it's only the SG miter conversion that introduced this change, which appears to be a simple oversight (I think Ard was working on latency issues at that time, perhaps he was worried about keeping preemption off unnecessarily).
---8<--- There is no reason for the chacha20poly1305 SG miter code to use kmap instead of kmap_atomic as the critical section doesn't sleep anyway. So we can simply get rid of the preemptible check and set SG_MITER_ATOMIC unconditionally.
Even if we need to reenable preemption to lower latency we should be doing that by interrupting the SG miter walk rather than using kmap.
Reported-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Herbert Xu herbert@gondor.apana.org.au
diff --git a/lib/crypto/chacha20poly1305.c b/lib/crypto/chacha20poly1305.c index 431e04280332..5850f3b87359 100644 --- a/lib/crypto/chacha20poly1305.c +++ b/lib/crypto/chacha20poly1305.c @@ -251,9 +251,7 @@ bool chacha20poly1305_crypt_sg_inplace(struct scatterlist *src, poly1305_update(&poly1305_state, pad0, 0x10 - (ad_len & 0xf)); }
- flags = SG_MITER_TO_SG; - if (!preemptible()) - flags |= SG_MITER_ATOMIC; + flags = SG_MITER_TO_SG | SG_MITER_ATOMIC;
sg_miter_start(&miter, src, sg_nents(src), flags);
On Tue, 15 Sep 2020 at 01:43, Linus Torvalds torvalds@linux-foundation.org wrote:
This was not driven by a failing check.
The documentation of kmap_atomic() states the following:
* The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap * gives a more generic (and caching) interface. But kmap_atomic can * be used in IRQ contexts, so in some (very limited) cases we need * it.
so if this is no longer accurate, perhaps we should fix it?
But another reason I tried to avoid kmap_atomic() is that it disables preemption unconditionally, even on 64-bit architectures where HIGHMEM is irrelevant. So using kmap_atomic() here means that the bulk of WireGuard packet encryption runs with preemption disabled, essentially for legacy reasons.
On Tue, Sep 15, 2020 at 09:20:59AM +0300, Ard Biesheuvel wrote:
This hasn't been accurate for at least ten years :)
Agreed. We should definitely fix that.
Thanks,
On Mon, Sep 14, 2020 at 11:24 PM Herbert Xu herbert@gondor.apana.org.au wrote:
Yeah, that used to be true a long long time ago, but the comment is very stale.
Well, honestly, one big reason for that is debugging.
The *semantics* of the kmap_atomic() is in the name - you can't sleep in between it and the kunmap_atomic().
On any sane architecture, kmap_atomic() ends up being a no-op from an implementation standpoint, and sleeping would work just fine.
But we very much want to make sure that people don't then write code that doesn't work on the bad old 32-bit machines where it really needs that sequence to be safe from preemption.
So it's mostly a debug thing.
I say "mostly", because there might be small other details too, like shared code, and perhaps even a couple of users out in the wild that depend on the pagefault_disable() inherent in the current kmap_atomic(), who knows..
So no, the preemption disabling isn't inherent in the operation itself. But it does have some argument for it.
Linus
On Mon, Sep 14, 2020 at 11:55:24PM +0200, Thomas Gleixner wrote:
But just look at any check which uses preemptible(), especially those which check !preemptible():
hmm.
+++ b/include/linux/preempt.h @@ -180,7 +180,9 @@ do { \
#define preempt_enable_no_resched() sched_preempt_enable_no_resched()
+#ifndef MODULE #define preemptible() (preempt_count() == 0 && !irqs_disabled()) +#endif
#ifdef CONFIG_PREEMPTION #define preempt_enable() \
$ git grep -w preemptible drivers (slightly trimmed by hand to remove, eg, comments) drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); drivers/firmware/arm_sdei.c: WARN_ON(preemptible()); drivers/firmware/efi/efi-pstore.c: preemptible(), record->size, record->psi->buf); drivers/irqchip/irq-gic-v4.c: WARN_ON(preemptible()); drivers/irqchip/irq-gic-v4.c: WARN_ON(preemptible()); drivers/scsi/hisi_sas/hisi_sas_main.c: if (!preemptible()) drivers/xen/time.c: BUG_ON(preemptible());
That only looks like two drivers that need more than WARNectomies.
Although maybe rcu_read_load_sched_held() or rcu_read_lock_any_held() might get called from a module ...
On Wed, Sep 16, 2020 at 08:23:52PM +0100, Matthew Wilcox wrote:
I could easily imagine someone thinking that these did something in CONFIG_PREEMPT_NONE=y kernels. In fact, I could easily imagine myself making that mistake. :-/
Although maybe rcu_read_load_sched_held() or rcu_read_lock_any_held() might get called from a module ...
But yes, from the rcutorture module for certain and also from any other RCU-using module that includes the usual RCU debug checks.
Thanx, Paul
On Mon, Sep 14, 2020 at 01:59:15PM -0700, Linus Torvalds wrote:
It would be convenient for that "gfp =" code to work, as this would allow better cache locality while invoking RCU callbacks, and would further provide better robustness to callback floods. The full story is quite long, but here are alternatives have not yet been proven to be abject failures:
1. Use workqueues to do the allocations in a clean context. While waiting for the allocations, the callbacks are queued in the old cache-busting manner. This functions correctly, but in the meantime (which on busy systems can be some time) the cache locality and robustness are lost.
2. Provide the ability to allocate memory in raw atomic context. This is extremely effective, especially when used in combination with #1 above, but as you might suspect, the MM guys don't like it much.
In contrast, with Thomas's patch series, call_rcu() and kvfree_rcu() could just look at preemptible() to see whether or not it was safe to allocate memory, even in !PREEMPT kernels -- and in the common case, it almost always would be safe. It is quite possible that this approach would work in isolation, or failing that, that adding #1 above would do the trick.
I understand that this is all very hand-wavy, and I do apologize for that. If you really want the full sad story with performance numbers and the works, let me know!
Thanx, Paul
dri-devel@lists.freedesktop.org