First of all, sorry for the horribly big Cc list!
Following up to the discussion in:
https://lore.kernel.org/r/20200914204209.256266093@linutronix.de
this provides a preemptible variant of kmap_atomic & related interfaces. This is achieved by:
- Consolidating all kmap atomic implementations in generic code
- Switching from per CPU storage of the kmap index to a per task storage
- Adding a pteval array to the per task storage which contains the ptevals of the currently active temporary kmaps
- Adding context switch code which checks whether the outgoing or the incoming task has active temporary kmaps. If so, the outgoing task's kmaps are removed and the incoming task's kmaps are restored.
- Adding new interfaces k[un]map_temporary*() which are not disabling preemption and can be called from any context (except NMI).
Contrary to kmap() which provides preemptible and "persistant" mappings, these interfaces are meant to replace the temporary mappings provided by kmap_atomic*() today.
This allows to get rid of conditional mapping choices and allows to have preemptible short term mappings on 64bit which are today enforced to be non-preemptible due to the highmem constraints. It clearly puts overhead on the highmem users, but highmem is slow anyway.
This is not a wholesale conversion which makes kmap_atomic magically preemptible because there might be usage sites which rely on the implicit preempt disable. So this needs to be done on a case by case basis and the call sites converted to kmap_temporary.
Note, that this is only lightly tested on X86 and completely untested on all other architectures.
The lot is also available from
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem
Thanks,
tglx --- a/arch/arm/mm/highmem.c | 121 --------------------- a/arch/microblaze/mm/highmem.c | 78 ------------- a/arch/nds32/mm/highmem.c | 48 -------- a/arch/powerpc/mm/highmem.c | 67 ----------- a/arch/sparc/mm/highmem.c | 115 -------------------- arch/arc/Kconfig | 1 arch/arc/include/asm/highmem.h | 8 + arch/arc/mm/highmem.c | 44 ------- arch/arm/Kconfig | 1 arch/arm/include/asm/highmem.h | 30 +++-- arch/arm/mm/Makefile | 1 arch/csky/Kconfig | 1 arch/csky/include/asm/highmem.h | 4 arch/csky/mm/highmem.c | 75 ------------- arch/microblaze/Kconfig | 1 arch/microblaze/include/asm/highmem.h | 6 - arch/microblaze/mm/Makefile | 1 arch/microblaze/mm/init.c | 6 - arch/mips/Kconfig | 1 arch/mips/include/asm/highmem.h | 4 arch/mips/mm/highmem.c | 77 ------------- arch/mips/mm/init.c | 3 arch/nds32/Kconfig.cpu | 1 arch/nds32/include/asm/highmem.h | 21 ++- arch/nds32/mm/Makefile | 1 arch/powerpc/Kconfig | 1 arch/powerpc/include/asm/highmem.h | 6 - arch/powerpc/mm/Makefile | 1 arch/powerpc/mm/mem.c | 7 - arch/sparc/Kconfig | 1 arch/sparc/include/asm/highmem.h | 7 - arch/sparc/mm/Makefile | 3 arch/sparc/mm/srmmu.c | 2 arch/x86/include/asm/fixmap.h | 1 arch/x86/include/asm/highmem.h | 12 +- arch/x86/include/asm/iomap.h | 29 +++-- arch/x86/mm/highmem_32.c | 59 ---------- arch/x86/mm/init_32.c | 15 -- arch/x86/mm/iomap_32.c | 57 ---------- arch/xtensa/Kconfig | 1 arch/xtensa/include/asm/highmem.h | 9 + arch/xtensa/mm/highmem.c | 44 ------- b/arch/x86/Kconfig | 3 include/linux/highmem.h | 141 +++++++++++++++--------- include/linux/io-mapping.h | 2 include/linux/sched.h | 9 + kernel/sched/core.c | 10 + mm/Kconfig | 3 mm/highmem.c | 192 ++++++++++++++++++++++++++++++++-- 49 files changed, 422 insertions(+), 909 deletions(-)
Nothing in modules can use that.
Signed-off-by: Thomas Gleixner tglx@linutronix.de --- mm/highmem.c | 2 -- 1 file changed, 2 deletions(-)
--- a/mm/highmem.c +++ b/mm/highmem.c @@ -108,8 +108,6 @@ static inline wait_queue_head_t *get_pkm atomic_long_t _totalhigh_pages __read_mostly; EXPORT_SYMBOL(_totalhigh_pages);
-EXPORT_PER_CPU_SYMBOL(__kmap_atomic_idx); - unsigned int nr_free_highpages (void) { struct zone *zone;
The kmap_atomic* interfaces in all architectures are pretty much the same except for post map operations (flush) and pre- and post unmap operations.
Provide a generic variant for that.
Signed-off-by: Thomas Gleixner tglx@linutronix.de --- include/linux/highmem.h | 87 ++++++++++++++++++++++++++++------- mm/Kconfig | 3 + mm/highmem.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 192 insertions(+), 17 deletions(-)
--- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -31,9 +31,22 @@ static inline void invalidate_kernel_vma
#include <asm/kmap_types.h>
+/* + * Outside of CONFIG_HIGHMEM to support X86 32bit iomap_atomic() cruft. + */ +#ifdef CONFIG_KMAP_ATOMIC_GENERIC +void *kmap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot); +void *kmap_atomic_page_prot(struct page *page, pgprot_t prot); +void kunmap_atomic_indexed(void *vaddr); +# ifndef ARCH_NEEDS_KMAP_HIGH_GET +static inline void *arch_kmap_temporary_high_get(struct page *page) +{ + return NULL; +} +# endif +#endif + #ifdef CONFIG_HIGHMEM -extern void *kmap_atomic_high_prot(struct page *page, pgprot_t prot); -extern void kunmap_atomic_high(void *kvaddr); #include <asm/highmem.h>
#ifndef ARCH_HAS_KMAP_FLUSH_TLB @@ -81,6 +94,11 @@ static inline void kunmap(struct page *p * be used in IRQ contexts, so in some (very limited) cases we need * it. */ + +#ifndef CONFIG_KMAP_ATOMIC_GENERIC +void *kmap_atomic_high_prot(struct page *page, pgprot_t prot); +void kunmap_atomic_high(void *kvaddr); + static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot) { preempt_disable(); @@ -89,7 +107,38 @@ static inline void *kmap_atomic_prot(str return page_address(page); return kmap_atomic_high_prot(page, prot); } -#define kmap_atomic(page) kmap_atomic_prot(page, kmap_prot) + +static inline void __kunmap_atomic(void *vaddr) +{ + kunmap_atomic_high(vaddr); + pagefault_enable(); +} +#else /* !CONFIG_KMAP_ATOMIC_GENERIC */ + +static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot) +{ + preempt_disable(); + return kmap_atomic_page_prot(page, prot); +} + +static inline void *kmap_atomic_pfn(unsigned long pfn) +{ + preempt_disable(); + return kmap_atomic_pfn_prot(pfn, kmap_prot); +} + +static inline void __kunmap_atomic(void *addr) +{ + kumap_atomic_indexed(addr); +} + + +#endif /* CONFIG_KMAP_ATOMIC_GENERIC */ + +static inline void *kmap_atomic(struct page *page) +{ + return kmap_atomic_prot(page, kmap_prot); +}
/* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); @@ -157,21 +206,29 @@ static inline void *kmap_atomic(struct p pagefault_disable(); return page_address(page); } -#define kmap_atomic_prot(page, prot) kmap_atomic(page)
-static inline void kunmap_atomic_high(void *addr) +static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot) +{ + return kmap_atomic(page); +} + +static inline void *kmap_atomic_pfn(unsigned long pfn) +{ + return kmap_atomic(pfn_to_page(pfn)); +} + +static inline void __kunmap_atomic(void *addr) { /* * Mostly nothing to do in the CONFIG_HIGHMEM=n case as kunmap_atomic() - * handles re-enabling faults + preemption + * handles preemption */ #ifdef ARCH_HAS_FLUSH_ON_KUNMAP kunmap_flush_on_unmap(addr); #endif + pagefault_enable(); }
-#define kmap_atomic_pfn(pfn) kmap_atomic(pfn_to_page(pfn)) - #define kmap_flush_unused() do {} while(0)
#endif /* CONFIG_HIGHMEM */ @@ -213,14 +270,12 @@ static inline void kmap_atomic_idx_pop(v * Prevent people trying to call kunmap_atomic() as if it were kunmap() * kunmap_atomic() should get the return value of kmap_atomic, not the page. */ -#define kunmap_atomic(addr) \ -do { \ - BUILD_BUG_ON(__same_type((addr), struct page *)); \ - kunmap_atomic_high(addr); \ - pagefault_enable(); \ - preempt_enable(); \ -} while (0) - +#define kunmap_atomic(addr) \ + do { \ + BUILD_BUG_ON(__same_type((addr), struct page *)); \ + __kunmap_atomic(addr); \ + preempt_enable(); \ + } while (0)
/* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */ #ifndef clear_user_highpage --- a/mm/Kconfig +++ b/mm/Kconfig @@ -868,4 +868,7 @@ config ARCH_HAS_HUGEPD config MAPPING_DIRTY_HELPERS bool
+config KMAP_ATOMIC_GENERIC + bool + endmenu --- a/mm/highmem.c +++ b/mm/highmem.c @@ -314,6 +314,15 @@ void *kmap_high_get(struct page *page) unlock_kmap_any(flags); return (void*) vaddr; } + +/* Unmap a temporary mapping which was obtained by kmap_high_get() */ +static void kmap_high_unmap_temporary(unsigned long vaddr) +{ + if (vaddr >= PKMAP_ADDR(0) && vaddr < PKMAP_ADDR(LAST_PKMAP)) + kunmap_high(pte_page(pkmap_page_table[PKMAP_NR(vaddr)])); +} +#else +static inline void kmap_high_unmap_temporary(unsigned long vaddr) { } #endif
/** @@ -365,8 +374,116 @@ void kunmap_high(struct page *page) if (need_wakeup) wake_up(pkmap_map_wait); } - EXPORT_SYMBOL(kunmap_high); +#else +static inline void kmap_high_unmap_temporary(unsigned long vaddr) { } +#endif /* CONFIG_HIGHMEM */ + +#ifdef CONFIG_KMAP_ATOMIC_GENERIC +#ifndef arch_kmap_temp_post_map +# define arch_kmap_temp_post_map(vaddr, pteval) do { } while (0) +#endif + +#ifndef arch_kmap_temp_pre_unmap +# define arch_kmap_temp_pre_unmap(vaddr) do { } while (0) +#endif + +#ifndef arch_kmap_temp_post_unmap +# define arch_kmap_temp_post_unmap(vaddr) do { } while (0) +#endif + +#ifndef arch_kmap_temp_map_idx +#define arch_kmap_temp_map_idx(type, pfn) kmap_temp_idx(type) +#endif + +#ifndef arch_kmap_temp_unmap_idx +#define arch_kmap_temp_unmap_idx(type, vaddr) kmap_temp_idx(type) +#endif + +static inline int kmap_temp_idx(int type) +{ + return type + KM_TYPE_NR * smp_processor_id(); +} + +static pte_t *__kmap_pte; + +static pte_t *kmap_get_pte(void) +{ + if (!__kmap_pte) + __kmap_pte = virt_to_kpte(__fix_to_virt(FIX_KMAP_BEGIN)); + return __kmap_pte; +} + +static void *__kmap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot) +{ + pte_t pteval, *kmap_pte = kmap_get_pte(); + unsigned long vaddr; + int idx; + + preempt_disable(); + idx = arch_kmap_temp_map_idx(kmap_atomic_idx_push(), pfn); + vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); + BUG_ON(!pte_none(*(kmap_pte - idx))); + pteval = pfn_pte(pfn, prot); + set_pte(kmap_pte - idx, pteval); + arch_kmap_temp_post_map(vaddr, pteval); + preempt_enable(); + + return (void *)vaddr; +} + +void *kmap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot) +{ + pagefault_disable(); + return __kmap_atomic_pfn_prot(pfn, prot); +} +EXPORT_SYMBOL(kmap_atomic_pfn_prot); + +void *kmap_atomic_page_prot(struct page *page, pgprot_t prot) +{ + void *kmap; + + pagefault_disable(); + if (!PageHighMem(page)) + return page_address(page); + + /* Try kmap_high_get() if architecture has it enabled */ + kmap = arch_kmap_temporary_high_get(page); + if (kmap) + return kmap; + + return __kmap_atomic_pfn_prot(page_to_pfn(page), prot); +} +EXPORT_SYMBOL(kmap_atomic_page_prot); + +void kunmap_atomic_indexed(void *vaddr) +{ + unsigned long addr = (unsigned long) vaddr & PAGE_MASK; + pte_t *kmap_pte = kmap_get_pte(); + int idx; + + if (addr < __fix_to_virt(FIX_KMAP_END) || + addr > __fix_to_virt(FIX_KMAP_BEGIN)) { + WARN_ON_ONCE(addr < PAGE_OFFSET); + + /* Handle mappings which were obtained by kmap_high_get() */ + kmap_high_unmap_temporary(addr); + pagefault_enable(); + return; + } + + preempt_disable(); + idx = arch_kmap_temp_unmap_idx(kmap_atomic_idx(), addr); + WARN_ON_ONCE(addr != __fix_to_virt(FIX_KMAP_BEGIN + idx)); + + arch_kmap_temp_pre_unmap(addr); + pte_clear(&init_mm, addr, kmap_pte - idx); + arch_kmap_temp_post_unmap(addr); + kmap_atomic_idx_pop(); + preempt_enable(); + pagefault_enable(); +} +EXPORT_SYMBOL(kunmap_atomic_indexed); #endif
#if defined(HASHED_PAGE_VIRTUAL)
Convert X86 to the generic kmap atomic implementation.
Make the iomap_atomic() naming convention consistent while at it.
Signed-off-by: Thomas Gleixner tglx@linutronix.de --- arch/x86/Kconfig | 3 +- arch/x86/include/asm/fixmap.h | 1 arch/x86/include/asm/highmem.h | 12 ++++++-- arch/x86/include/asm/iomap.h | 17 ++++++----- arch/x86/mm/highmem_32.c | 59 ----------------------------------------- arch/x86/mm/init_32.c | 15 ---------- arch/x86/mm/iomap_32.c | 58 ++-------------------------------------- include/linux/io-mapping.h | 2 - 8 files changed, 25 insertions(+), 142 deletions(-)
--- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -14,10 +14,11 @@ config X86_32 select ARCH_WANT_IPC_PARSE_VERSION select CLKSRC_I8253 select CLONE_BACKWARDS + select GENERIC_VDSO_32 select HAVE_DEBUG_STACKOVERFLOW + select KMAP_ATOMIC_GENERIC select MODULES_USE_ELF_REL select OLD_SIGACTION - select GENERIC_VDSO_32
config X86_64 def_bool y --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -151,7 +151,6 @@ extern void reserve_top_address(unsigned
extern int fixmaps_set;
-extern pte_t *kmap_pte; extern pte_t *pkmap_page_table;
void __native_set_fixmap(enum fixed_addresses idx, pte_t pte); --- a/arch/x86/include/asm/highmem.h +++ b/arch/x86/include/asm/highmem.h @@ -58,11 +58,17 @@ extern unsigned long highstart_pfn, high #define PKMAP_NR(virt) ((virt-PKMAP_BASE) >> PAGE_SHIFT) #define PKMAP_ADDR(nr) (PKMAP_BASE + ((nr) << PAGE_SHIFT))
-void *kmap_atomic_pfn(unsigned long pfn); -void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot); - #define flush_cache_kmaps() do { } while (0)
+#define arch_kmap_temp_post_map(vaddr, pteval) \ + arch_flush_lazy_mmu_mode() + +#define arch_kmap_temp_post_unmap(vaddr) \ + do { \ + flush_tlb_one_kernel((vaddr)); \ + arch_flush_lazy_mmu_mode(); \ + } while (0) + extern void add_highpages_with_active_regions(int nid, unsigned long start_pfn, unsigned long end_pfn);
--- a/arch/x86/include/asm/iomap.h +++ b/arch/x86/include/asm/iomap.h @@ -9,19 +9,20 @@ #include <linux/fs.h> #include <linux/mm.h> #include <linux/uaccess.h> +#include <linux/highmem.h> #include <asm/cacheflush.h> #include <asm/tlbflush.h>
-void __iomem * -iomap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot); +void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot);
-void -iounmap_atomic(void __iomem *kvaddr); +static inline void iounmap_atomic(void __iomem *vaddr) +{ + kunmap_atomic_indexed((void __force *)vaddr); + preempt_enable(); +}
-int -iomap_create_wc(resource_size_t base, unsigned long size, pgprot_t *prot); +int iomap_create_wc(resource_size_t base, unsigned long size, pgprot_t *prot);
-void -iomap_free(resource_size_t base, unsigned long size); +void iomap_free(resource_size_t base, unsigned long size);
#endif /* _ASM_X86_IOMAP_H */ --- a/arch/x86/mm/highmem_32.c +++ b/arch/x86/mm/highmem_32.c @@ -4,65 +4,6 @@ #include <linux/swap.h> /* for totalram_pages */ #include <linux/memblock.h>
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) -{ - unsigned long vaddr; - int idx, type; - - type = kmap_atomic_idx_push(); - idx = type + KM_TYPE_NR*smp_processor_id(); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - BUG_ON(!pte_none(*(kmap_pte-idx))); - set_pte(kmap_pte-idx, mk_pte(page, prot)); - arch_flush_lazy_mmu_mode(); - - return (void *)vaddr; -} -EXPORT_SYMBOL(kmap_atomic_high_prot); - -/* - * This is the same as kmap_atomic() but can map memory that doesn't - * have a struct page associated with it. - */ -void *kmap_atomic_pfn(unsigned long pfn) -{ - return kmap_atomic_prot_pfn(pfn, kmap_prot); -} -EXPORT_SYMBOL_GPL(kmap_atomic_pfn); - -void kunmap_atomic_high(void *kvaddr) -{ - unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; - - if (vaddr >= __fix_to_virt(FIX_KMAP_END) && - vaddr <= __fix_to_virt(FIX_KMAP_BEGIN)) { - int idx, type; - - type = kmap_atomic_idx(); - idx = type + KM_TYPE_NR * smp_processor_id(); - -#ifdef CONFIG_DEBUG_HIGHMEM - WARN_ON_ONCE(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx)); -#endif - /* - * Force other mappings to Oops if they'll try to access this - * pte without first remap it. Keeping stale mappings around - * is a bad idea also, in case the page changes cacheability - * attributes or becomes a protected page in a hypervisor. - */ - kpte_clear_flush(kmap_pte-idx, vaddr); - kmap_atomic_idx_pop(); - arch_flush_lazy_mmu_mode(); - } -#ifdef CONFIG_DEBUG_HIGHMEM - else { - BUG_ON(vaddr < PAGE_OFFSET); - BUG_ON(vaddr >= (unsigned long)high_memory); - } -#endif -} -EXPORT_SYMBOL(kunmap_atomic_high); - void __init set_highmem_pages_init(void) { struct zone *zone; --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -394,19 +394,6 @@ kernel_physical_mapping_init(unsigned lo return last_map_addr; }
-pte_t *kmap_pte; - -static void __init kmap_init(void) -{ - unsigned long kmap_vstart; - - /* - * Cache the first kmap pte: - */ - kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN); - kmap_pte = virt_to_kpte(kmap_vstart); -} - #ifdef CONFIG_HIGHMEM static void __init permanent_kmaps_init(pgd_t *pgd_base) { @@ -712,8 +699,6 @@ void __init paging_init(void)
__flush_tlb_all();
- kmap_init(); - /* * NOTE: at this point the bootmem allocator is fully available. */ --- a/arch/x86/mm/iomap_32.c +++ b/arch/x86/mm/iomap_32.c @@ -44,28 +44,7 @@ void iomap_free(resource_size_t base, un } EXPORT_SYMBOL_GPL(iomap_free);
-void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot) -{ - unsigned long vaddr; - int idx, type; - - preempt_disable(); - pagefault_disable(); - - type = kmap_atomic_idx_push(); - idx = type + KM_TYPE_NR * smp_processor_id(); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - set_pte(kmap_pte - idx, pfn_pte(pfn, prot)); - arch_flush_lazy_mmu_mode(); - - return (void *)vaddr; -} - -/* - * Map 'pfn' using protections 'prot' - */ -void __iomem * -iomap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot) +void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot) { /* * For non-PAT systems, translate non-WB request to UC- just in @@ -81,36 +60,7 @@ iomap_atomic_prot_pfn(unsigned long pfn, /* Filter out unsupported __PAGE_KERNEL* bits: */ pgprot_val(prot) &= __default_kernel_pte_mask;
- return (void __force __iomem *) kmap_atomic_prot_pfn(pfn, prot); -} -EXPORT_SYMBOL_GPL(iomap_atomic_prot_pfn); - -void -iounmap_atomic(void __iomem *kvaddr) -{ - unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; - - if (vaddr >= __fix_to_virt(FIX_KMAP_END) && - vaddr <= __fix_to_virt(FIX_KMAP_BEGIN)) { - int idx, type; - - type = kmap_atomic_idx(); - idx = type + KM_TYPE_NR * smp_processor_id(); - -#ifdef CONFIG_DEBUG_HIGHMEM - WARN_ON_ONCE(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx)); -#endif - /* - * Force other mappings to Oops if they'll try to access this - * pte without first remap it. Keeping stale mappings around - * is a bad idea also, in case the page changes cacheability - * attributes or becomes a protected page in a hypervisor. - */ - kpte_clear_flush(kmap_pte-idx, vaddr); - kmap_atomic_idx_pop(); - } - - pagefault_enable(); - preempt_enable(); + preempt_disable(); + return (void __force __iomem *)kmap_atomic_pfn_prot(pfn, prot); } -EXPORT_SYMBOL_GPL(iounmap_atomic); +EXPORT_SYMBOL_GPL(iomap_atomic_pfn_prot); --- a/include/linux/io-mapping.h +++ b/include/linux/io-mapping.h @@ -69,7 +69,7 @@ io_mapping_map_atomic_wc(struct io_mappi
BUG_ON(offset >= mapping->size); phys_addr = mapping->base + offset; - return iomap_atomic_prot_pfn(PHYS_PFN(phys_addr), mapping->prot); + return iomap_atomic_pfn_prot(PHYS_PFN(phys_addr), mapping->prot); }
static inline void
Adopt the map ordering to match the other architectures and the generic code.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Vineet Gupta vgupta@synopsys.com Cc: linux-snps-arc@lists.infradead.org --- Note: Completely untested --- arch/arc/Kconfig | 1 arch/arc/include/asm/highmem.h | 8 ++++++- arch/arc/mm/highmem.c | 44 ----------------------------------------- 3 files changed, 9 insertions(+), 44 deletions(-)
--- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -508,6 +508,7 @@ config LINUX_RAM_BASE config HIGHMEM bool "High Memory Support" select ARCH_DISCONTIGMEM_ENABLE + select KMAP_ATOMIC_GENERIC help With ARC 2G:2G address split, only upper 2G is directly addressable by kernel. Enable this to potentially allow access to rest of 2G and PAE --- a/arch/arc/include/asm/highmem.h +++ b/arch/arc/include/asm/highmem.h @@ -15,7 +15,10 @@ #define FIXMAP_BASE (PAGE_OFFSET - FIXMAP_SIZE - PKMAP_SIZE) #define FIXMAP_SIZE PGDIR_SIZE /* only 1 PGD worth */ #define KM_TYPE_NR ((FIXMAP_SIZE >> PAGE_SHIFT)/NR_CPUS) -#define FIXMAP_ADDR(nr) (FIXMAP_BASE + ((nr) << PAGE_SHIFT)) + +#define FIX_KMAP_BEGIN (0) +#define FIX_KMAP_END ((FIXMAP_SIZE >> PAGE_SHIFT) - 1) +#define FIXADDR_TOP (FIXMAP_BASE + FIXMAP_SIZE - PAGE_SIZE)
/* start after fixmap area */ #define PKMAP_BASE (FIXMAP_BASE + FIXMAP_SIZE) @@ -29,6 +32,9 @@
extern void kmap_init(void);
+#define arch_kmap_temp_post_unmap(vaddr) \ + local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE) + static inline void flush_cache_kmaps(void) { flush_cache_all(); --- a/arch/arc/mm/highmem.c +++ b/arch/arc/mm/highmem.c @@ -47,48 +47,6 @@ */
extern pte_t * pkmap_page_table; -static pte_t * fixmap_page_table; - -void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) -{ - int idx, cpu_idx; - unsigned long vaddr; - - cpu_idx = kmap_atomic_idx_push(); - idx = cpu_idx + KM_TYPE_NR * smp_processor_id(); - vaddr = FIXMAP_ADDR(idx); - - set_pte_at(&init_mm, vaddr, fixmap_page_table + idx, - mk_pte(page, prot)); - - return (void *)vaddr; -} -EXPORT_SYMBOL(kmap_atomic_high_prot); - -void kunmap_atomic_high(void *kv) -{ - unsigned long kvaddr = (unsigned long)kv; - - if (kvaddr >= FIXMAP_BASE && kvaddr < (FIXMAP_BASE + FIXMAP_SIZE)) { - - /* - * Because preemption is disabled, this vaddr can be associated - * with the current allocated index. - * But in case of multiple live kmap_atomic(), it still relies on - * callers to unmap in right order. - */ - int cpu_idx = kmap_atomic_idx(); - int idx = cpu_idx + KM_TYPE_NR * smp_processor_id(); - - WARN_ON(kvaddr != FIXMAP_ADDR(idx)); - - pte_clear(&init_mm, kvaddr, fixmap_page_table + idx); - local_flush_tlb_kernel_range(kvaddr, kvaddr + PAGE_SIZE); - - kmap_atomic_idx_pop(); - } -} -EXPORT_SYMBOL(kunmap_atomic_high);
static noinline pte_t * __init alloc_kmap_pgtable(unsigned long kvaddr) { @@ -113,5 +71,5 @@ void __init kmap_init(void) pkmap_page_table = alloc_kmap_pgtable(PKMAP_BASE);
BUILD_BUG_ON(LAST_PKMAP > PTRS_PER_PTE); - fixmap_page_table = alloc_kmap_pgtable(FIXMAP_BASE); + alloc_kmap_pgtable(FIXMAP_BASE); }
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Russell King linux@armlinux.org.uk Cc: Arnd Bergmann arnd@arndb.de Cc: linux-arm-kernel@lists.infradead.org --- Note: Completely untested --- arch/arm/Kconfig | 1 arch/arm/include/asm/highmem.h | 30 +++++++--- arch/arm/mm/Makefile | 1 arch/arm/mm/highmem.c | 121 ----------------------------------------- 4 files changed, 23 insertions(+), 130 deletions(-)
--- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1499,6 +1499,7 @@ config HAVE_ARCH_PFN_VALID config HIGHMEM bool "High Memory Support" depends on MMU + select KMAP_ATOMIC_GENERIC help The address space of ARM processors is only 4 Gigabytes large and it has to accommodate user address space, kernel address --- a/arch/arm/include/asm/highmem.h +++ b/arch/arm/include/asm/highmem.h @@ -46,19 +46,33 @@ extern pte_t *pkmap_page_table;
#ifdef ARCH_NEEDS_KMAP_HIGH_GET extern void *kmap_high_get(struct page *page); + +#ifdef CONFIG_DEBUG_HIGHMEM +extern void *arch_kmap_temporary_high_get(struct page *page); #else +static inline void *arch_kmap_temporary_high_get(struct page *page) +{ + return kmap_high_get(page); +} +#endif /* !CONFIG_DEBUG_HIGHMEM */ + +#else /* ARCH_NEEDS_KMAP_HIGH_GET */ static inline void *kmap_high_get(struct page *page) { return NULL; } -#endif +#endif /* !ARCH_NEEDS_KMAP_HIGH_GET */
-/* - * The following functions are already defined by <linux/highmem.h> - * when CONFIG_HIGHMEM is not set. - */ -#ifdef CONFIG_HIGHMEM -extern void *kmap_atomic_pfn(unsigned long pfn); -#endif +#define arch_kmap_temp_post_map(vaddr, pteval) \ + local_flush_tlb_kernel_page(vaddr) + +#define arch_kmap_temp_pre_unmap(vaddr) \ +do { \ + if (cache_is_vivt()) \ + __cpuc_flush_dcache_area((void *)vaddr, PAGE_SIZE); \ +} while (0) + +#define arch_kmap_temp_post_unmap(vaddr) \ + local_flush_tlb_kernel_page(vaddr)
#endif --- a/arch/arm/mm/Makefile +++ b/arch/arm/mm/Makefile @@ -19,7 +19,6 @@ obj-$(CONFIG_MODULES) += proc-syms.o obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
obj-$(CONFIG_ALIGNMENT_TRAP) += alignment.o -obj-$(CONFIG_HIGHMEM) += highmem.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o obj-$(CONFIG_ARM_PV_FIXUP) += pv-fixup-asm.o
--- a/arch/arm/mm/highmem.c +++ /dev/null @@ -1,121 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * arch/arm/mm/highmem.c -- ARM highmem support - * - * Author: Nicolas Pitre - * Created: september 8, 2008 - * Copyright: Marvell Semiconductors Inc. - */ - -#include <linux/module.h> -#include <linux/highmem.h> -#include <linux/interrupt.h> -#include <asm/fixmap.h> -#include <asm/cacheflush.h> -#include <asm/tlbflush.h> -#include "mm.h" - -static inline void set_fixmap_pte(int idx, pte_t pte) -{ - unsigned long vaddr = __fix_to_virt(idx); - pte_t *ptep = virt_to_kpte(vaddr); - - set_pte_ext(ptep, pte, 0); - local_flush_tlb_kernel_page(vaddr); -} - -static inline pte_t get_fixmap_pte(unsigned long vaddr) -{ - pte_t *ptep = virt_to_kpte(vaddr); - - return *ptep; -} - -void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) -{ - unsigned int idx; - unsigned long vaddr; - void *kmap; - int type; - -#ifdef CONFIG_DEBUG_HIGHMEM - /* - * There is no cache coherency issue when non VIVT, so force the - * dedicated kmap usage for better debugging purposes in that case. - */ - if (!cache_is_vivt()) - kmap = NULL; - else -#endif - kmap = kmap_high_get(page); - if (kmap) - return kmap; - - type = kmap_atomic_idx_push(); - - idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id(); - vaddr = __fix_to_virt(idx); -#ifdef CONFIG_DEBUG_HIGHMEM - /* - * With debugging enabled, kunmap_atomic forces that entry to 0. - * Make sure it was indeed properly unmapped. - */ - BUG_ON(!pte_none(get_fixmap_pte(vaddr))); -#endif - /* - * When debugging is off, kunmap_atomic leaves the previous mapping - * in place, so the contained TLB flush ensures the TLB is updated - * with the new mapping. - */ - set_fixmap_pte(idx, mk_pte(page, prot)); - - return (void *)vaddr; -} -EXPORT_SYMBOL(kmap_atomic_high_prot); - -void kunmap_atomic_high(void *kvaddr) -{ - unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; - int idx, type; - - if (kvaddr >= (void *)FIXADDR_START) { - type = kmap_atomic_idx(); - idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id(); - - if (cache_is_vivt()) - __cpuc_flush_dcache_area((void *)vaddr, PAGE_SIZE); -#ifdef CONFIG_DEBUG_HIGHMEM - BUG_ON(vaddr != __fix_to_virt(idx)); - set_fixmap_pte(idx, __pte(0)); -#else - (void) idx; /* to kill a warning */ -#endif - kmap_atomic_idx_pop(); - } else if (vaddr >= PKMAP_ADDR(0) && vaddr < PKMAP_ADDR(LAST_PKMAP)) { - /* this address was obtained through kmap_high_get() */ - kunmap_high(pte_page(pkmap_page_table[PKMAP_NR(vaddr)])); - } -} -EXPORT_SYMBOL(kunmap_atomic_high); - -void *kmap_atomic_pfn(unsigned long pfn) -{ - unsigned long vaddr; - int idx, type; - struct page *page = pfn_to_page(pfn); - - preempt_disable(); - pagefault_disable(); - if (!PageHighMem(page)) - return page_address(page); - - type = kmap_atomic_idx_push(); - idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id(); - vaddr = __fix_to_virt(idx); -#ifdef CONFIG_DEBUG_HIGHMEM - BUG_ON(!pte_none(get_fixmap_pte(vaddr))); -#endif - set_fixmap_pte(idx, pfn_pte(pfn, kmap_prot)); - - return (void *)vaddr; -}
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Guo Ren guoren@kernel.org Cc: linux-csky@vger.kernel.org --- Note: Completely untested --- arch/csky/Kconfig | 1 arch/csky/include/asm/highmem.h | 4 +- arch/csky/mm/highmem.c | 75 ---------------------------------------- 3 files changed, 5 insertions(+), 75 deletions(-)
--- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -285,6 +285,7 @@ config NR_CPUS config HIGHMEM bool "High Memory Support" depends on !CPU_CK610 + select KMAP_ATOMIC_GENERIC default y
config FORCE_MAX_ZONEORDER --- a/arch/csky/include/asm/highmem.h +++ b/arch/csky/include/asm/highmem.h @@ -32,10 +32,12 @@ extern pte_t *pkmap_page_table;
#define ARCH_HAS_KMAP_FLUSH_TLB extern void kmap_flush_tlb(unsigned long addr); -extern void *kmap_atomic_pfn(unsigned long pfn);
#define flush_cache_kmaps() do {} while (0)
+#define arch_kmap_temp_post_map(vaddr, pteval) kmap_flush_tlb(vaddr) +#define arch_kmap_temp_post_unmap(vaddr) kmap_flush_tlb(vaddr) + extern void kmap_init(void);
#endif /* __KERNEL__ */ --- a/arch/csky/mm/highmem.c +++ b/arch/csky/mm/highmem.c @@ -9,8 +9,6 @@ #include <asm/tlbflush.h> #include <asm/cacheflush.h>
-static pte_t *kmap_pte; - unsigned long highstart_pfn, highend_pfn;
void kmap_flush_tlb(unsigned long addr) @@ -19,67 +17,7 @@ void kmap_flush_tlb(unsigned long addr) } EXPORT_SYMBOL(kmap_flush_tlb);
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) -{ - unsigned long vaddr; - int idx, type; - - type = kmap_atomic_idx_push(); - idx = type + KM_TYPE_NR*smp_processor_id(); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); -#ifdef CONFIG_DEBUG_HIGHMEM - BUG_ON(!pte_none(*(kmap_pte - idx))); -#endif - set_pte(kmap_pte-idx, mk_pte(page, prot)); - flush_tlb_one((unsigned long)vaddr); - - return (void *)vaddr; -} -EXPORT_SYMBOL(kmap_atomic_high_prot); - -void kunmap_atomic_high(void *kvaddr) -{ - unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; - int idx; - - if (vaddr < FIXADDR_START) - return; - -#ifdef CONFIG_DEBUG_HIGHMEM - idx = KM_TYPE_NR*smp_processor_id() + kmap_atomic_idx(); - - BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx)); - - pte_clear(&init_mm, vaddr, kmap_pte - idx); - flush_tlb_one(vaddr); -#else - (void) idx; /* to kill a warning */ -#endif - kmap_atomic_idx_pop(); -} -EXPORT_SYMBOL(kunmap_atomic_high); - -/* - * This is the same as kmap_atomic() but can map memory that doesn't - * have a struct page associated with it. - */ -void *kmap_atomic_pfn(unsigned long pfn) -{ - unsigned long vaddr; - int idx, type; - - pagefault_disable(); - - type = kmap_atomic_idx_push(); - idx = type + KM_TYPE_NR*smp_processor_id(); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - set_pte(kmap_pte-idx, pfn_pte(pfn, PAGE_KERNEL)); - flush_tlb_one(vaddr); - - return (void *) vaddr; -} - -static void __init kmap_pages_init(void) +void __init kmap_init(void) { unsigned long vaddr; pgd_t *pgd; @@ -96,14 +34,3 @@ static void __init kmap_pages_init(void) pte = pte_offset_kernel(pmd, vaddr); pkmap_page_table = pte; } - -void __init kmap_init(void) -{ - unsigned long vaddr; - - kmap_pages_init(); - - vaddr = __fix_to_virt(FIX_KMAP_BEGIN); - - kmap_pte = pte_offset_kernel((pmd_t *)pgd_offset_k(vaddr), vaddr); -}
Acked-by: Guo Ren guoren@kernel.org
On Sat, Sep 19, 2020 at 5:50 PM Thomas Gleixner tglx@linutronix.de wrote:
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Guo Ren guoren@kernel.org Cc: linux-csky@vger.kernel.org
Note: Completely untested
arch/csky/Kconfig | 1 arch/csky/include/asm/highmem.h | 4 +- arch/csky/mm/highmem.c | 75 ---------------------------------------- 3 files changed, 5 insertions(+), 75 deletions(-)
--- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -285,6 +285,7 @@ config NR_CPUS config HIGHMEM bool "High Memory Support" depends on !CPU_CK610
select KMAP_ATOMIC_GENERIC default y
config FORCE_MAX_ZONEORDER --- a/arch/csky/include/asm/highmem.h +++ b/arch/csky/include/asm/highmem.h @@ -32,10 +32,12 @@ extern pte_t *pkmap_page_table;
#define ARCH_HAS_KMAP_FLUSH_TLB extern void kmap_flush_tlb(unsigned long addr); -extern void *kmap_atomic_pfn(unsigned long pfn);
#define flush_cache_kmaps() do {} while (0)
+#define arch_kmap_temp_post_map(vaddr, pteval) kmap_flush_tlb(vaddr) +#define arch_kmap_temp_post_unmap(vaddr) kmap_flush_tlb(vaddr)
extern void kmap_init(void);
#endif /* __KERNEL__ */ --- a/arch/csky/mm/highmem.c +++ b/arch/csky/mm/highmem.c @@ -9,8 +9,6 @@ #include <asm/tlbflush.h> #include <asm/cacheflush.h>
-static pte_t *kmap_pte;
unsigned long highstart_pfn, highend_pfn;
void kmap_flush_tlb(unsigned long addr) @@ -19,67 +17,7 @@ void kmap_flush_tlb(unsigned long addr) } EXPORT_SYMBOL(kmap_flush_tlb);
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) -{
unsigned long vaddr;
int idx, type;
type = kmap_atomic_idx_push();
idx = type + KM_TYPE_NR*smp_processor_id();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-#ifdef CONFIG_DEBUG_HIGHMEM
BUG_ON(!pte_none(*(kmap_pte - idx)));
-#endif
set_pte(kmap_pte-idx, mk_pte(page, prot));
flush_tlb_one((unsigned long)vaddr);
return (void *)vaddr;
-} -EXPORT_SYMBOL(kmap_atomic_high_prot);
-void kunmap_atomic_high(void *kvaddr) -{
unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
int idx;
if (vaddr < FIXADDR_START)
return;
-#ifdef CONFIG_DEBUG_HIGHMEM
idx = KM_TYPE_NR*smp_processor_id() + kmap_atomic_idx();
BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx));
pte_clear(&init_mm, vaddr, kmap_pte - idx);
flush_tlb_one(vaddr);
-#else
(void) idx; /* to kill a warning */
-#endif
kmap_atomic_idx_pop();
-} -EXPORT_SYMBOL(kunmap_atomic_high);
-/*
- This is the same as kmap_atomic() but can map memory that doesn't
- have a struct page associated with it.
- */
-void *kmap_atomic_pfn(unsigned long pfn) -{
unsigned long vaddr;
int idx, type;
pagefault_disable();
type = kmap_atomic_idx_push();
idx = type + KM_TYPE_NR*smp_processor_id();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
set_pte(kmap_pte-idx, pfn_pte(pfn, PAGE_KERNEL));
flush_tlb_one(vaddr);
return (void *) vaddr;
-}
-static void __init kmap_pages_init(void) +void __init kmap_init(void) { unsigned long vaddr; pgd_t *pgd; @@ -96,14 +34,3 @@ static void __init kmap_pages_init(void) pte = pte_offset_kernel(pmd, vaddr); pkmap_page_table = pte; }
-void __init kmap_init(void) -{
unsigned long vaddr;
kmap_pages_init();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN);
kmap_pte = pte_offset_kernel((pmd_t *)pgd_offset_k(vaddr), vaddr);
-}
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Michal Simek monstr@monstr.eu --- Note: Completely untested --- arch/microblaze/Kconfig | 1 arch/microblaze/include/asm/highmem.h | 6 ++ arch/microblaze/mm/Makefile | 1 arch/microblaze/mm/highmem.c | 78 ---------------------------------- arch/microblaze/mm/init.c | 6 -- 5 files changed, 6 insertions(+), 86 deletions(-)
--- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -170,6 +170,7 @@ config XILINX_UNCACHED_SHADOW config HIGHMEM bool "High memory support" depends on MMU + select KMAP_ATOMIC_GENERIC help The address space of Microblaze processors is only 4 Gigabytes large and it has to accommodate user address space, kernel address --- a/arch/microblaze/include/asm/highmem.h +++ b/arch/microblaze/include/asm/highmem.h @@ -25,7 +25,6 @@ #include <linux/uaccess.h> #include <asm/fixmap.h>
-extern pte_t *kmap_pte; extern pte_t *pkmap_page_table;
/* @@ -52,6 +51,11 @@ extern pte_t *pkmap_page_table;
#define flush_cache_kmaps() { flush_icache(); flush_dcache(); }
+#define arch_kmap_temp_post_map(vaddr, pteval) \ + local_flush_tlb_page(NULL, vaddr); +#define arch_kmap_temp_post_unmap(vaddr) \ + local_flush_tlb_page(NULL, vaddr); + #endif /* __KERNEL__ */
#endif /* _ASM_HIGHMEM_H */ --- a/arch/microblaze/mm/Makefile +++ b/arch/microblaze/mm/Makefile @@ -6,4 +6,3 @@ obj-y := consistent.o init.o
obj-$(CONFIG_MMU) += pgtable.o mmu_context.o fault.o -obj-$(CONFIG_HIGHMEM) += highmem.o --- a/arch/microblaze/mm/highmem.c +++ /dev/null @@ -1,78 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * highmem.c: virtual kernel memory mappings for high memory - * - * PowerPC version, stolen from the i386 version. - * - * Used in CONFIG_HIGHMEM systems for memory pages which - * are not addressable by direct kernel virtual addresses. - * - * Copyright (C) 1999 Gerhard Wichert, Siemens AG - * Gerhard.Wichert@pdb.siemens.de - * - * - * Redesigned the x86 32-bit VM architecture to deal with - * up to 16 Terrabyte physical memory. With current x86 CPUs - * we now support up to 64 Gigabytes physical RAM. - * - * Copyright (C) 1999 Ingo Molnar mingo@redhat.com - * - * Reworked for PowerPC by various contributors. Moved from - * highmem.h by Benjamin Herrenschmidt (c) 2009 IBM Corp. - */ - -#include <linux/export.h> -#include <linux/highmem.h> - -/* - * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap - * gives a more generic (and caching) interface. But kmap_atomic can - * be used in IRQ contexts, so in some (very limited) cases we need - * it. - */ -#include <asm/tlbflush.h> - -void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) -{ - - unsigned long vaddr; - int idx, type; - - type = kmap_atomic_idx_push(); - idx = type + KM_TYPE_NR*smp_processor_id(); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); -#ifdef CONFIG_DEBUG_HIGHMEM - BUG_ON(!pte_none(*(kmap_pte-idx))); -#endif - set_pte_at(&init_mm, vaddr, kmap_pte-idx, mk_pte(page, prot)); - local_flush_tlb_page(NULL, vaddr); - - return (void *) vaddr; -} -EXPORT_SYMBOL(kmap_atomic_high_prot); - -void kunmap_atomic_high(void *kvaddr) -{ - unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; - int type; - unsigned int idx; - - if (vaddr < __fix_to_virt(FIX_KMAP_END)) - return; - - type = kmap_atomic_idx(); - - idx = type + KM_TYPE_NR * smp_processor_id(); -#ifdef CONFIG_DEBUG_HIGHMEM - BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx)); -#endif - /* - * force other mappings to Oops if they'll try to access - * this pte without first remap it - */ - pte_clear(&init_mm, vaddr, kmap_pte-idx); - local_flush_tlb_page(NULL, vaddr); - - kmap_atomic_idx_pop(); -} -EXPORT_SYMBOL(kunmap_atomic_high); --- a/arch/microblaze/mm/init.c +++ b/arch/microblaze/mm/init.c @@ -49,17 +49,11 @@ unsigned long lowmem_size; EXPORT_SYMBOL(min_low_pfn); EXPORT_SYMBOL(max_low_pfn);
-#ifdef CONFIG_HIGHMEM -pte_t *kmap_pte; -EXPORT_SYMBOL(kmap_pte); - static void __init highmem_init(void) { pr_debug("%x\n", (u32)PKMAP_BASE); map_page(PKMAP_BASE, 0, 0); /* XXX gross */ pkmap_page_table = virt_to_kpte(PKMAP_BASE); - - kmap_pte = virt_to_kpte(__fix_to_virt(FIX_KMAP_BEGIN)); }
static void highmem_setup(void)
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Thomas Bogendoerfer tsbogend@alpha.franken.de Cc: linux-mips@vger.kernel.org --- Note: Completely untested --- arch/mips/Kconfig | 1 arch/mips/include/asm/highmem.h | 4 +- arch/mips/mm/highmem.c | 77 ---------------------------------------- arch/mips/mm/init.c | 3 - 4 files changed, 3 insertions(+), 82 deletions(-)
--- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -2654,6 +2654,7 @@ config MIPS_CRC_SUPPORT config HIGHMEM bool "High Memory Support" depends on 32BIT && CPU_SUPPORTS_HIGHMEM && SYS_SUPPORTS_HIGHMEM && !CPU_MIPS32_3_5_EVA + select KMAP_ATOMIC_GENERIC
config CPU_SUPPORTS_HIGHMEM bool --- a/arch/mips/include/asm/highmem.h +++ b/arch/mips/include/asm/highmem.h @@ -48,11 +48,11 @@ extern pte_t *pkmap_page_table;
#define ARCH_HAS_KMAP_FLUSH_TLB extern void kmap_flush_tlb(unsigned long addr); -extern void *kmap_atomic_pfn(unsigned long pfn);
#define flush_cache_kmaps() BUG_ON(cpu_has_dc_aliases)
-extern void kmap_init(void); +#define arch_kmap_temp_post_map(vaddr, pteval) local_flush_tlb_one(vaddr) +#define arch_kmap_temp_post_unmap(vaddr) local_flush_tlb_one(vaddr)
#endif /* __KERNEL__ */
--- a/arch/mips/mm/highmem.c +++ b/arch/mips/mm/highmem.c @@ -8,8 +8,6 @@ #include <asm/fixmap.h> #include <asm/tlbflush.h>
-static pte_t *kmap_pte; - unsigned long highstart_pfn, highend_pfn;
void kmap_flush_tlb(unsigned long addr) @@ -17,78 +15,3 @@ void kmap_flush_tlb(unsigned long addr) flush_tlb_one(addr); } EXPORT_SYMBOL(kmap_flush_tlb); - -void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) -{ - unsigned long vaddr; - int idx, type; - - type = kmap_atomic_idx_push(); - idx = type + KM_TYPE_NR*smp_processor_id(); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); -#ifdef CONFIG_DEBUG_HIGHMEM - BUG_ON(!pte_none(*(kmap_pte - idx))); -#endif - set_pte(kmap_pte-idx, mk_pte(page, prot)); - local_flush_tlb_one((unsigned long)vaddr); - - return (void*) vaddr; -} -EXPORT_SYMBOL(kmap_atomic_high_prot); - -void kunmap_atomic_high(void *kvaddr) -{ - unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; - int type __maybe_unused; - - if (vaddr < FIXADDR_START) - return; - - type = kmap_atomic_idx(); -#ifdef CONFIG_DEBUG_HIGHMEM - { - int idx = type + KM_TYPE_NR * smp_processor_id(); - - BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx)); - - /* - * force other mappings to Oops if they'll try to access - * this pte without first remap it - */ - pte_clear(&init_mm, vaddr, kmap_pte-idx); - local_flush_tlb_one(vaddr); - } -#endif - kmap_atomic_idx_pop(); -} -EXPORT_SYMBOL(kunmap_atomic_high); - -/* - * This is the same as kmap_atomic() but can map memory that doesn't - * have a struct page associated with it. - */ -void *kmap_atomic_pfn(unsigned long pfn) -{ - unsigned long vaddr; - int idx, type; - - preempt_disable(); - pagefault_disable(); - - type = kmap_atomic_idx_push(); - idx = type + KM_TYPE_NR*smp_processor_id(); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - set_pte(kmap_pte-idx, pfn_pte(pfn, PAGE_KERNEL)); - flush_tlb_one(vaddr); - - return (void*) vaddr; -} - -void __init kmap_init(void) -{ - unsigned long kmap_vstart; - - /* cache the first kmap pte */ - kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN); - kmap_pte = virt_to_kpte(kmap_vstart); -} --- a/arch/mips/mm/init.c +++ b/arch/mips/mm/init.c @@ -402,9 +402,6 @@ void __init paging_init(void)
pagetable_init();
-#ifdef CONFIG_HIGHMEM - kmap_init(); -#endif #ifdef CONFIG_ZONE_DMA max_zone_pfns[ZONE_DMA] = MAX_DMA_PFN; #endif
The mapping code is odd and looks broken. See FIXME in the comment.
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Nick Hu nickhu@andestech.com Cc: Greentime Hu green.hu@gmail.com Cc: Vincent Chen deanbo422@gmail.com --- Note: Completely untested --- arch/nds32/Kconfig.cpu | 1 arch/nds32/include/asm/highmem.h | 21 +++++++++++++---- arch/nds32/mm/Makefile | 1 arch/nds32/mm/highmem.c | 48 --------------------------------------- 4 files changed, 17 insertions(+), 54 deletions(-)
--- a/arch/nds32/Kconfig.cpu +++ b/arch/nds32/Kconfig.cpu @@ -157,6 +157,7 @@ config HW_SUPPORT_UNALIGNMENT_ACCESS config HIGHMEM bool "High Memory Support" depends on MMU && !CPU_CACHE_ALIASING + select KMAP_ATOMIC_GENERIC help The address space of Andes processors is only 4 Gigabytes large and it has to accommodate user address space, kernel address --- a/arch/nds32/include/asm/highmem.h +++ b/arch/nds32/include/asm/highmem.h @@ -45,11 +45,22 @@ extern pte_t *pkmap_page_table; extern void kmap_init(void);
/* - * The following functions are already defined by <linux/highmem.h> - * when CONFIG_HIGHMEM is not set. + * FIXME: The below looks broken vs. a kmap_atomic() in task context which + * is interupted and another kmap_atomic() happens in interrupt context. + * But what do I know about nds32. -- tglx */ -#ifdef CONFIG_HIGHMEM -extern void *kmap_atomic_pfn(unsigned long pfn); -#endif +#define arch_kmap_temp_post_map(vaddr, pteval) \ + do { \ + __nds32__tlbop_inv(vaddr); \ + __nds32__mtsr_dsb(vaddr, NDS32_SR_TLB_VPN); \ + __nds32__tlbop_rwr(pteval); \ + __nds32__isb(); \ + } while (0) + +#define arch_kmap_temp_pre_unmap(vaddr, pte) \ + do { \ + __nds32__tlbop_inv(vaddr); \ + __nds32__isb(); \ + } while (0)
#endif --- a/arch/nds32/mm/Makefile +++ b/arch/nds32/mm/Makefile @@ -3,7 +3,6 @@ obj-y := extable.o tlb.o fault.o init mm-nds32.o cacheflush.o proc.o
obj-$(CONFIG_ALIGNMENT_TRAP) += alignment.o -obj-$(CONFIG_HIGHMEM) += highmem.o
ifdef CONFIG_FUNCTION_TRACER CFLAGS_REMOVE_proc.o = $(CC_FLAGS_FTRACE) --- a/arch/nds32/mm/highmem.c +++ /dev/null @@ -1,48 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -// Copyright (C) 2005-2017 Andes Technology Corporation - -#include <linux/export.h> -#include <linux/highmem.h> -#include <linux/sched.h> -#include <linux/smp.h> -#include <linux/interrupt.h> -#include <linux/memblock.h> -#include <asm/fixmap.h> -#include <asm/tlbflush.h> - -void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) -{ - unsigned int idx; - unsigned long vaddr, pte; - int type; - pte_t *ptep; - - type = kmap_atomic_idx_push(); - - idx = type + KM_TYPE_NR * smp_processor_id(); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - pte = (page_to_pfn(page) << PAGE_SHIFT) | prot; - ptep = pte_offset_kernel(pmd_off_k(vaddr), vaddr); - set_pte(ptep, pte); - - __nds32__tlbop_inv(vaddr); - __nds32__mtsr_dsb(vaddr, NDS32_SR_TLB_VPN); - __nds32__tlbop_rwr(pte); - __nds32__isb(); - return (void *)vaddr; -} -EXPORT_SYMBOL(kmap_atomic_high_prot); - -void kunmap_atomic_high(void *kvaddr) -{ - if (kvaddr >= (void *)FIXADDR_START) { - unsigned long vaddr = (unsigned long)kvaddr; - pte_t *ptep; - kmap_atomic_idx_pop(); - __nds32__tlbop_inv(vaddr); - __nds32__isb(); - ptep = pte_offset_kernel(pmd_off_k(vaddr), vaddr); - set_pte(ptep, 0); - } -} -EXPORT_SYMBOL(kunmap_atomic_high);
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Michael Ellerman mpe@ellerman.id.au Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Paul Mackerras paulus@samba.org Cc: linuxppc-dev@lists.ozlabs.org --- Note: Completely untested --- arch/powerpc/Kconfig | 1 arch/powerpc/include/asm/highmem.h | 6 ++- arch/powerpc/mm/Makefile | 1 arch/powerpc/mm/highmem.c | 67 ------------------------------------- arch/powerpc/mm/mem.c | 7 --- 5 files changed, 6 insertions(+), 76 deletions(-)
--- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -406,6 +406,7 @@ menu "Kernel options" config HIGHMEM bool "High memory support" depends on PPC32 + select KMAP_ATOMIC_GENERIC
source "kernel/Kconfig.hz"
--- a/arch/powerpc/include/asm/highmem.h +++ b/arch/powerpc/include/asm/highmem.h @@ -29,7 +29,6 @@ #include <asm/page.h> #include <asm/fixmap.h>
-extern pte_t *kmap_pte; extern pte_t *pkmap_page_table;
/* @@ -60,6 +59,11 @@ extern pte_t *pkmap_page_table;
#define flush_cache_kmaps() flush_cache_all()
+#define arch_kmap_temp_post_map(vaddr, pteval) \ + local_flush_tlb_page(NULL, vaddr) +#define arch_kmap_temp_post_unmap(vaddr) \ + local_flush_tlb_page(NULL, vaddr) + #endif /* __KERNEL__ */
#endif /* _ASM_HIGHMEM_H */ --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -16,7 +16,6 @@ obj-$(CONFIG_NEED_MULTIPLE_NODES) += num obj-$(CONFIG_PPC_MM_SLICES) += slice.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o -obj-$(CONFIG_HIGHMEM) += highmem.o obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o obj-$(CONFIG_PPC_PTDUMP) += ptdump/ obj-$(CONFIG_KASAN) += kasan/ --- a/arch/powerpc/mm/highmem.c +++ /dev/null @@ -1,67 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * highmem.c: virtual kernel memory mappings for high memory - * - * PowerPC version, stolen from the i386 version. - * - * Used in CONFIG_HIGHMEM systems for memory pages which - * are not addressable by direct kernel virtual addresses. - * - * Copyright (C) 1999 Gerhard Wichert, Siemens AG - * Gerhard.Wichert@pdb.siemens.de - * - * - * Redesigned the x86 32-bit VM architecture to deal with - * up to 16 Terrabyte physical memory. With current x86 CPUs - * we now support up to 64 Gigabytes physical RAM. - * - * Copyright (C) 1999 Ingo Molnar mingo@redhat.com - * - * Reworked for PowerPC by various contributors. Moved from - * highmem.h by Benjamin Herrenschmidt (c) 2009 IBM Corp. - */ - -#include <linux/highmem.h> -#include <linux/module.h> - -void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) -{ - unsigned long vaddr; - int idx, type; - - type = kmap_atomic_idx_push(); - idx = type + KM_TYPE_NR*smp_processor_id(); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - WARN_ON(IS_ENABLED(CONFIG_DEBUG_HIGHMEM) && !pte_none(*(kmap_pte - idx))); - __set_pte_at(&init_mm, vaddr, kmap_pte-idx, mk_pte(page, prot), 1); - local_flush_tlb_page(NULL, vaddr); - - return (void*) vaddr; -} -EXPORT_SYMBOL(kmap_atomic_high_prot); - -void kunmap_atomic_high(void *kvaddr) -{ - unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; - - if (vaddr < __fix_to_virt(FIX_KMAP_END)) - return; - - if (IS_ENABLED(CONFIG_DEBUG_HIGHMEM)) { - int type = kmap_atomic_idx(); - unsigned int idx; - - idx = type + KM_TYPE_NR * smp_processor_id(); - WARN_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN + idx)); - - /* - * force other mappings to Oops if they'll try to access - * this pte without first remap it - */ - pte_clear(&init_mm, vaddr, kmap_pte-idx); - local_flush_tlb_page(NULL, vaddr); - } - - kmap_atomic_idx_pop(); -} -EXPORT_SYMBOL(kunmap_atomic_high); --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -60,11 +60,6 @@ unsigned long long memory_limit; bool init_mem_is_free;
-#ifdef CONFIG_HIGHMEM -pte_t *kmap_pte; -EXPORT_SYMBOL(kmap_pte); -#endif - pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, unsigned long size, pgprot_t vma_prot) { @@ -233,8 +228,6 @@ void __init paging_init(void)
map_kernel_page(PKMAP_BASE, 0, __pgprot(0)); /* XXX gross */ pkmap_page_table = virt_to_kpte(PKMAP_BASE); - - kmap_pte = virt_to_kpte(__fix_to_virt(FIX_KMAP_BEGIN)); #endif /* CONFIG_HIGHMEM */
printk(KERN_DEBUG "Top of RAM: 0x%llx, Total RAM: 0x%llx\n",
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: "David S. Miller" davem@davemloft.net Cc: sparclinux@vger.kernel.org --- Note: Completely untested --- arch/sparc/Kconfig | 1 arch/sparc/include/asm/highmem.h | 7 +- arch/sparc/mm/Makefile | 3 - arch/sparc/mm/highmem.c | 115 --------------------------------------- arch/sparc/mm/srmmu.c | 2 5 files changed, 6 insertions(+), 122 deletions(-)
--- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -137,6 +137,7 @@ config MMU config HIGHMEM bool default y if SPARC32 + select KMAP_ATOMIC_GENERIC
config ZONE_DMA bool --- a/arch/sparc/include/asm/highmem.h +++ b/arch/sparc/include/asm/highmem.h @@ -33,8 +33,6 @@ extern unsigned long highstart_pfn, high #define kmap_prot __pgprot(SRMMU_ET_PTE | SRMMU_PRIV | SRMMU_CACHE) extern pte_t *pkmap_page_table;
-void kmap_init(void) __init; - /* * Right now we initialize only a single pte table. It can be extended * easily, subsequent pte tables have to be allocated in one physical @@ -53,6 +51,11 @@ void kmap_init(void) __init;
#define flush_cache_kmaps() flush_cache_all()
+/* FIXME: Use __flush_tlb_one(vaddr) instead of flush_cache_all() -- Anton */ +#define arch_kmap_temp_post_map(vaddr, pteval) flush_cache_all() +#define arch_kmap_temp_post_unmap(vaddr) flush_cache_all() + + #endif /* __KERNEL__ */
#endif /* _ASM_HIGHMEM_H */ --- a/arch/sparc/mm/Makefile +++ b/arch/sparc/mm/Makefile @@ -15,6 +15,3 @@ obj-$(CONFIG_SPARC32) += leon_mm.o
# Only used by sparc64 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o - -# Only used by sparc32 -obj-$(CONFIG_HIGHMEM) += highmem.o --- a/arch/sparc/mm/highmem.c +++ /dev/null @@ -1,115 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * highmem.c: virtual kernel memory mappings for high memory - * - * Provides kernel-static versions of atomic kmap functions originally - * found as inlines in include/asm-sparc/highmem.h. These became - * needed as kmap_atomic() and kunmap_atomic() started getting - * called from within modules. - * -- Tomas Szepe szepe@pinerecords.com, September 2002 - * - * But kmap_atomic() and kunmap_atomic() cannot be inlined in - * modules because they are loaded with btfixup-ped functions. - */ - -/* - * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap - * gives a more generic (and caching) interface. But kmap_atomic can - * be used in IRQ contexts, so in some (very limited) cases we need it. - * - * XXX This is an old text. Actually, it's good to use atomic kmaps, - * provided you remember that they are atomic and not try to sleep - * with a kmap taken, much like a spinlock. Non-atomic kmaps are - * shared by CPUs, and so precious, and establishing them requires IPI. - * Atomic kmaps are lightweight and we may have NCPUS more of them. - */ -#include <linux/highmem.h> -#include <linux/export.h> -#include <linux/mm.h> - -#include <asm/cacheflush.h> -#include <asm/tlbflush.h> -#include <asm/vaddrs.h> - -static pte_t *kmap_pte; - -void __init kmap_init(void) -{ - unsigned long address = __fix_to_virt(FIX_KMAP_BEGIN); - - /* cache the first kmap pte */ - kmap_pte = virt_to_kpte(address); -} - -void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) -{ - unsigned long vaddr; - long idx, type; - - type = kmap_atomic_idx_push(); - idx = type + KM_TYPE_NR*smp_processor_id(); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - -/* XXX Fix - Anton */ -#if 0 - __flush_cache_one(vaddr); -#else - flush_cache_all(); -#endif - -#ifdef CONFIG_DEBUG_HIGHMEM - BUG_ON(!pte_none(*(kmap_pte-idx))); -#endif - set_pte(kmap_pte-idx, mk_pte(page, prot)); -/* XXX Fix - Anton */ -#if 0 - __flush_tlb_one(vaddr); -#else - flush_tlb_all(); -#endif - - return (void*) vaddr; -} -EXPORT_SYMBOL(kmap_atomic_high_prot); - -void kunmap_atomic_high(void *kvaddr) -{ - unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; - int type; - - if (vaddr < FIXADDR_START) - return; - - type = kmap_atomic_idx(); - -#ifdef CONFIG_DEBUG_HIGHMEM - { - unsigned long idx; - - idx = type + KM_TYPE_NR * smp_processor_id(); - BUG_ON(vaddr != __fix_to_virt(FIX_KMAP_BEGIN+idx)); - - /* XXX Fix - Anton */ -#if 0 - __flush_cache_one(vaddr); -#else - flush_cache_all(); -#endif - - /* - * force other mappings to Oops if they'll try to access - * this pte without first remap it - */ - pte_clear(&init_mm, vaddr, kmap_pte-idx); - /* XXX Fix - Anton */ -#if 0 - __flush_tlb_one(vaddr); -#else - flush_tlb_all(); -#endif - } -#endif - - kmap_atomic_idx_pop(); -} -EXPORT_SYMBOL(kunmap_atomic_high); --- a/arch/sparc/mm/srmmu.c +++ b/arch/sparc/mm/srmmu.c @@ -971,8 +971,6 @@ void __init srmmu_paging_init(void)
sparc_context_init(num_contexts);
- kmap_init(); - { unsigned long max_zone_pfn[MAX_NR_ZONES] = { 0 };
Signed-off-by: Thomas Gleixner tglx@linutronix.de Cc: Chris Zankel chris@zankel.net Cc: Max Filippov jcmvbkbc@gmail.com Cc: linux-xtensa@linux-xtensa.org --- Note: Completely untested --- arch/xtensa/Kconfig | 1 arch/xtensa/include/asm/highmem.h | 9 +++++++ arch/xtensa/mm/highmem.c | 44 +++----------------------------------- 3 files changed, 14 insertions(+), 40 deletions(-)
--- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -679,6 +679,7 @@ endchoice config HIGHMEM bool "High Memory Support" depends on MMU + select KMAP_ATOMIC_GENERIC help Linux can use the full amount of RAM in the system by default. However, the default MMUv2 setup only maps the --- a/arch/xtensa/include/asm/highmem.h +++ b/arch/xtensa/include/asm/highmem.h @@ -68,6 +68,15 @@ static inline void flush_cache_kmaps(voi flush_cache_all(); }
+enum fixed_addresses kmap_temp_map_idx(int type, unsigned long pfn); +#define arch_kmap_temp_map_idx kmap_temp_map_idx + +enum fixed_addresses kmap_temp_unmap_idx(int type, unsigned long addr); +#define arch_kmap_temp_unmap_idx kmap_temp_unmap_idx + +#define arch_kmap_temp_post_unmap(vaddr) \ + local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE) + void kmap_init(void);
#endif --- a/arch/xtensa/mm/highmem.c +++ b/arch/xtensa/mm/highmem.c @@ -12,8 +12,6 @@ #include <linux/highmem.h> #include <asm/tlbflush.h>
-static pte_t *kmap_pte; - #if DCACHE_WAY_SIZE > PAGE_SIZE unsigned int last_pkmap_nr_arr[DCACHE_N_COLORS]; wait_queue_head_t pkmap_map_wait_arr[DCACHE_N_COLORS]; @@ -37,55 +35,21 @@ static inline enum fixed_addresses kmap_ color; }
-void *kmap_atomic_high_prot(struct page *page, pgprot_t prot) +enum fixed_addresses kmap_temp_map_idx(int type, unsigned long pfn) { - enum fixed_addresses idx; - unsigned long vaddr; - - idx = kmap_idx(kmap_atomic_idx_push(), - DCACHE_ALIAS(page_to_phys(page))); - vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); -#ifdef CONFIG_DEBUG_HIGHMEM - BUG_ON(!pte_none(*(kmap_pte + idx))); -#endif - set_pte(kmap_pte + idx, mk_pte(page, prot)); - - return (void *)vaddr; + return kmap_idx(type, DCACHE_ALIAS(pfn << PAGE_SHIFT); } -EXPORT_SYMBOL(kmap_atomic_high_prot);
-void kunmap_atomic_high(void *kvaddr) +enum fixed_addresses kmap_temp_unmap_idx(int type, unsigned long addr) { - if (kvaddr >= (void *)FIXADDR_START && - kvaddr < (void *)FIXADDR_TOP) { - int idx = kmap_idx(kmap_atomic_idx(), - DCACHE_ALIAS((unsigned long)kvaddr)); - - /* - * Force other mappings to Oops if they'll try to access this - * pte without first remap it. Keeping stale mappings around - * is a bad idea also, in case the page changes cacheability - * attributes or becomes a protected page in a hypervisor. - */ - pte_clear(&init_mm, kvaddr, kmap_pte + idx); - local_flush_tlb_kernel_range((unsigned long)kvaddr, - (unsigned long)kvaddr + PAGE_SIZE); - - kmap_atomic_idx_pop(); - } + return kmap_idx(type, DCACHE_ALIAS(addr)); } -EXPORT_SYMBOL(kunmap_atomic_high);
void __init kmap_init(void) { - unsigned long kmap_vstart; - /* Check if this memory layout is broken because PKMAP overlaps * page table. */ BUILD_BUG_ON(PKMAP_BASE < TLBTEMP_BASE_1 + TLBTEMP_SIZE); - /* cache the first kmap pte */ - kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN); - kmap_pte = virt_to_kpte(kmap_vstart); kmap_waitqueues_init(); }
Signed-off-by: Thomas Gleixner tglx@linutronix.de --- include/linux/highmem.h | 65 ++---------------------------------------------- mm/highmem.c | 28 +++++++++++++++++--- 2 files changed, 28 insertions(+), 65 deletions(-)
--- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -94,27 +94,6 @@ static inline void kunmap(struct page *p * be used in IRQ contexts, so in some (very limited) cases we need * it. */ - -#ifndef CONFIG_KMAP_ATOMIC_GENERIC -void *kmap_atomic_high_prot(struct page *page, pgprot_t prot); -void kunmap_atomic_high(void *kvaddr); - -static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot) -{ - preempt_disable(); - pagefault_disable(); - if (!PageHighMem(page)) - return page_address(page); - return kmap_atomic_high_prot(page, prot); -} - -static inline void __kunmap_atomic(void *vaddr) -{ - kunmap_atomic_high(vaddr); - pagefault_enable(); -} -#else /* !CONFIG_KMAP_ATOMIC_GENERIC */ - static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot) { preempt_disable(); @@ -127,17 +106,14 @@ static inline void *kmap_atomic_pfn(unsi return kmap_atomic_pfn_prot(pfn, kmap_prot); }
-static inline void __kunmap_atomic(void *addr) +static inline void *kmap_atomic(struct page *page) { - kumap_atomic_indexed(addr); + return kmap_atomic_prot(page, kmap_prot); }
- -#endif /* CONFIG_KMAP_ATOMIC_GENERIC */ - -static inline void *kmap_atomic(struct page *page) +static inline void __kunmap_atomic(void *addr) { - return kmap_atomic_prot(page, kmap_prot); + kumap_atomic_indexed(addr); }
/* declarations for linux/mm/highmem.c */ @@ -233,39 +209,6 @@ static inline void __kunmap_atomic(void
#endif /* CONFIG_HIGHMEM */
-#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32) - -DECLARE_PER_CPU(int, __kmap_atomic_idx); - -static inline int kmap_atomic_idx_push(void) -{ - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; - -#ifdef CONFIG_DEBUG_HIGHMEM - WARN_ON_ONCE(in_irq() && !irqs_disabled()); - BUG_ON(idx >= KM_TYPE_NR); -#endif - return idx; -} - -static inline int kmap_atomic_idx(void) -{ - return __this_cpu_read(__kmap_atomic_idx) - 1; -} - -static inline void kmap_atomic_idx_pop(void) -{ -#ifdef CONFIG_DEBUG_HIGHMEM - int idx = __this_cpu_dec_return(__kmap_atomic_idx); - - BUG_ON(idx < 0); -#else - __this_cpu_dec(__kmap_atomic_idx); -#endif -} - -#endif - /* * Prevent people trying to call kunmap_atomic() as if it were kunmap() * kunmap_atomic() should get the return value of kmap_atomic, not the page. --- a/mm/highmem.c +++ b/mm/highmem.c @@ -31,10 +31,6 @@ #include <asm/tlbflush.h> #include <linux/vmalloc.h>
-#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32) -DEFINE_PER_CPU(int, __kmap_atomic_idx); -#endif - /* * Virtual_count is not a pure "count". * 0 means that it is not mapped, and has not been mapped @@ -380,6 +376,30 @@ static inline void kmap_high_unmap_tempo #endif /* CONFIG_HIGHMEM */
#ifdef CONFIG_KMAP_ATOMIC_GENERIC + +static DEFINE_PER_CPU(int, __kmap_atomic_idx); + +static inline int kmap_atomic_idx_push(void) +{ + int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + + WARN_ON_ONCE(in_irq() && !irqs_disabled()); + BUG_ON(idx >= KM_TYPE_NR); + return idx; +} + +static inline int kmap_atomic_idx(void) +{ + return __this_cpu_read(__kmap_atomic_idx) - 1; +} + +static inline void kmap_atomic_idx_pop(void) +{ + int idx = __this_cpu_dec_return(__kmap_atomic_idx); + + BUG_ON(idx < 0); +} + #ifndef arch_kmap_temp_post_map # define arch_kmap_temp_post_map(vaddr, pteval) do { } while (0) #endif
Instead of storing the map per CPU provide and use per task storage. That prepares for temporary kmaps which are preemptible.
The context switch code is preparatory and not yet in use because kmap_atomic() runs with preemption disabled. Will be made usable in the next step.
Signed-off-by: Thomas Gleixner tglx@linutronix.de --- include/linux/highmem.h | 1 include/linux/sched.h | 9 +++++++ kernel/sched/core.c | 10 ++++++++ mm/highmem.c | 59 ++++++++++++++++++++++++++++++++++++++++++------ 4 files changed, 72 insertions(+), 7 deletions(-)
--- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -38,6 +38,7 @@ static inline void invalidate_kernel_vma void *kmap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot); void *kmap_atomic_page_prot(struct page *page, pgprot_t prot); void kunmap_atomic_indexed(void *vaddr); +void kmap_switch_temporary(struct task_struct *prev, struct task_struct *next); # ifndef ARCH_NEEDS_KMAP_HIGH_GET static inline void *arch_kmap_temporary_high_get(struct page *page) { --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -34,6 +34,7 @@ #include <linux/rseq.h> #include <linux/seqlock.h> #include <linux/kcsan.h> +#include <asm/kmap_types.h>
/* task_struct member predeclarations (sorted alphabetically): */ struct audit_context; @@ -628,6 +629,13 @@ struct wake_q_node { struct wake_q_node *next; };
+struct kmap_ctrl { +#ifdef CONFIG_KMAP_ATOMIC_GENERIC + int idx; + pte_t pteval[KM_TYPE_NR]; +#endif +}; + struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* @@ -1280,6 +1288,7 @@ struct task_struct { unsigned int sequential_io; unsigned int sequential_io_avg; #endif + struct kmap_ctrl kmap_ctrl; #ifdef CONFIG_DEBUG_ATOMIC_SLEEP unsigned long task_state_change; #endif --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3529,6 +3529,15 @@ static inline void finish_lock_switch(st # define finish_arch_post_lock_switch() do { } while (0) #endif
+static inline void kmap_temp_switch(struct task_struct *prev, + struct task_struct *next) +{ +#ifdef CONFIG_HIGHMEM + if (unlikely(prev->kmap_ctrl.idx || next->kmap_ctrl.idx)) + kmap_switch_temporary(prev, next); +#endif +} + /** * prepare_task_switch - prepare to switch tasks * @rq: the runqueue preparing to switch @@ -3551,6 +3560,7 @@ prepare_task_switch(struct rq *rq, struc perf_event_task_sched_out(prev, next); rseq_preempt(prev); fire_sched_out_preempt_notifiers(prev, next); + kmap_temp_switch(prev, next); prepare_task(next); prepare_arch_switch(next); } --- a/mm/highmem.c +++ b/mm/highmem.c @@ -370,6 +370,7 @@ void kunmap_high(struct page *page) if (need_wakeup) wake_up(pkmap_map_wait); } + EXPORT_SYMBOL(kunmap_high); #else static inline void kmap_high_unmap_temporary(unsigned long vaddr) { } @@ -377,11 +378,9 @@ static inline void kmap_high_unmap_tempo
#ifdef CONFIG_KMAP_ATOMIC_GENERIC
-static DEFINE_PER_CPU(int, __kmap_atomic_idx); - static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++;
WARN_ON_ONCE(in_irq() && !irqs_disabled()); BUG_ON(idx >= KM_TYPE_NR); @@ -390,14 +389,13 @@ static inline int kmap_atomic_idx_push(v
static inline int kmap_atomic_idx(void) { - return __this_cpu_read(__kmap_atomic_idx) - 1; + return current->kmap_ctrl.idx - 1; }
static inline void kmap_atomic_idx_pop(void) { - int idx = __this_cpu_dec_return(__kmap_atomic_idx); - - BUG_ON(idx < 0); + current->kmap_ctrl.idx--; + BUG_ON(current->kmap_ctrl.idx < 0); }
#ifndef arch_kmap_temp_post_map @@ -447,6 +445,7 @@ static void *__kmap_atomic_pfn_prot(unsi pteval = pfn_pte(pfn, prot); set_pte(kmap_pte - idx, pteval); arch_kmap_temp_post_map(vaddr, pteval); + current->kmap_ctrl.pteval[kmap_atomic_idx()] = pteval; preempt_enable();
return (void *)vaddr; @@ -499,11 +498,57 @@ void kunmap_atomic_indexed(void *vaddr) arch_kmap_temp_pre_unmap(addr); pte_clear(&init_mm, addr, kmap_pte - idx); arch_kmap_temp_post_unmap(addr); + current->kmap_ctrl.pteval[kmap_atomic_idx()] = __pte(0); kmap_atomic_idx_pop(); preempt_enable(); pagefault_enable(); } EXPORT_SYMBOL(kunmap_atomic_indexed); + +void kmap_switch_temporary(struct task_struct *prev, struct task_struct *next) +{ + pte_t *kmap_pte = kmap_get_pte(); + int i; + + /* Clear @prev's kmaps */ + for (i = 0; i < prev->kmap_ctrl.idx; i++) { + pte_t pteval = prev->kmap_ctrl.pteval[i]; + unsigned long addr; + int idx; + + if (WARN_ON_ONCE(pte_none(pteval))) + continue; + + /* + * This is a horrible hack for XTENSA to calculate the + * coloured PTE index. Uses the PFN encoded into the pteval + * and the map index calculation because the actual mapped + * virtual address is not stored in task::kmap_ctrl. + * + * For any sane architecture that address calculation is + * optimized out. + */ + idx = arch_kmap_temp_map_idx(i, pte_pfn(pteval)); + + arch_kmap_temp_pre_unmap(addr); + pte_clear(&init_mm, addr, kmap_pte - idx); + arch_kmap_temp_post_unmap(addr); + } + + /* Restore @next's kmaps */ + for (i = 0; i < next->kmap_ctrl.idx; i++) { + pte_t pteval = next->kmap_ctrl.pteval[i]; + int idx; + + if (WARN_ON_ONCE(pte_none(pteval))) + continue; + + idx = arch_kmap_temp_map_idx(i, pte_pfn(pteval)); + set_pte(kmap_pte - idx, pteval); + arch_kmap_temp_post_map(addr, pteval); + } +} + #endif
#if defined(HASHED_PAGE_VIRTUAL)
Now that the kmap atomic index is stored in task struct provide a preemptible variant. On context switch the maps of an outgoing task are removed and the map of the incoming task are restored. That's obviously slow, but highmem is slow anyway.
The kmap_temporary and iomap_temporary interfaces can be invoked from both preemptible and atomic context.
A wholesale conversion of kmap_atomic to be fully preemptible is not possible because some of the usage sites might rely on the preemption disable for serialization or per CPUness. Needs to be done on a case by case basis.
Signed-off-by: Thomas Gleixner tglx@linutronix.de --- arch/x86/include/asm/iomap.h | 16 ++++++++- arch/x86/mm/iomap_32.c | 7 +--- include/linux/highmem.h | 70 +++++++++++++++++++++++++++++++++---------- mm/highmem.c | 18 +++++------ 4 files changed, 80 insertions(+), 31 deletions(-)
--- a/arch/x86/include/asm/iomap.h +++ b/arch/x86/include/asm/iomap.h @@ -13,11 +13,23 @@ #include <asm/cacheflush.h> #include <asm/tlbflush.h>
-void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot); +void __iomem *iomap_temporary_pfn_prot(unsigned long pfn, pgprot_t prot); + +static inline void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, + pgprot_t prot) +{ + preempt_disable(); + return iomap_temporary_pfn_prot(pfn, prot); +} + +static inline void iounmap_temporary(void __iomem *vaddr) +{ + kunmap_temporary_indexed((void __force *)vaddr); +}
static inline void iounmap_atomic(void __iomem *vaddr) { - kunmap_atomic_indexed((void __force *)vaddr); + iounmap_temporary(vaddr); preempt_enable(); }
--- a/arch/x86/mm/iomap_32.c +++ b/arch/x86/mm/iomap_32.c @@ -44,7 +44,7 @@ void iomap_free(resource_size_t base, un } EXPORT_SYMBOL_GPL(iomap_free);
-void __iomem *iomap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot) +void __iomem *iomap_temporary_pfn_prot(unsigned long pfn, pgprot_t prot) { /* * For non-PAT systems, translate non-WB request to UC- just in @@ -60,7 +60,6 @@ void __iomem *iomap_atomic_pfn_prot(unsi /* Filter out unsupported __PAGE_KERNEL* bits: */ pgprot_val(prot) &= __default_kernel_pte_mask;
- preempt_disable(); - return (void __force __iomem *)kmap_atomic_pfn_prot(pfn, prot); + return (void __force __iomem *)__kmap_temporary_pfn_prot(pfn, prot); } -EXPORT_SYMBOL_GPL(iomap_atomic_pfn_prot); +EXPORT_SYMBOL_GPL(iomap_temporary_pfn_prot); --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -35,9 +35,9 @@ static inline void invalidate_kernel_vma * Outside of CONFIG_HIGHMEM to support X86 32bit iomap_atomic() cruft. */ #ifdef CONFIG_KMAP_ATOMIC_GENERIC -void *kmap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot); -void *kmap_atomic_page_prot(struct page *page, pgprot_t prot); -void kunmap_atomic_indexed(void *vaddr); +void *__kmap_temporary_pfn_prot(unsigned long pfn, pgprot_t prot); +void *__kmap_temporary_page_prot(struct page *page, pgprot_t prot); +void kunmap_temporary_indexed(void *vaddr); void kmap_switch_temporary(struct task_struct *prev, struct task_struct *next); # ifndef ARCH_NEEDS_KMAP_HIGH_GET static inline void *arch_kmap_temporary_high_get(struct page *page) @@ -95,16 +95,35 @@ static inline void kunmap(struct page *p * be used in IRQ contexts, so in some (very limited) cases we need * it. */ -static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot) +static inline void *kmap_temporary_page_prot(struct page *page, pgprot_t prot) { - preempt_disable(); - return kmap_atomic_page_prot(page, prot); + return __kmap_temporary_page_prot(page, prot); }
-static inline void *kmap_atomic_pfn(unsigned long pfn) +static inline void *kmap_temporary_page(struct page *page) +{ + return kmap_temporary_page_prot(page, kmap_prot); +} + +static inline void *kmap_temporary_pfn_prot(unsigned long pfn, pgprot_t prot) +{ + return __kmap_temporary_pfn_prot(pfn, prot); +} + +static inline void *kmap_temporary_pfn(unsigned long pfn) +{ + return kmap_temporary_pfn_prot(pfn, kmap_prot); +} + +static inline void __kunmap_temporary(void *vaddr) +{ + kunmap_temporary_indexed(vaddr); +} + +static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot) { preempt_disable(); - return kmap_atomic_pfn_prot(pfn, kmap_prot); + return kmap_temporary_page_prot(page, prot); }
static inline void *kmap_atomic(struct page *page) @@ -112,9 +131,10 @@ static inline void *kmap_atomic(struct p return kmap_atomic_prot(page, kmap_prot); }
-static inline void __kunmap_atomic(void *addr) +static inline void *kmap_atomic_pfn(unsigned long pfn) { - kumap_atomic_indexed(addr); + preempt_disable(); + return kmap_temporary_pfn_prot(pfn, kmap_prot); }
/* declarations for linux/mm/highmem.c */ @@ -177,6 +197,22 @@ static inline void kunmap(struct page *p #endif }
+static inline void *kmap_temporary_page(struct page *page) +{ + pagefault_disable(); + return page_address(page); +} + +static inline void *kmap_temporary_page_prot(struct page *page, pgprot_t prot) +{ + return kmap_temporary_page(page); +} + +static inline void *kmap_temporary_pfn(unsigned long pfn) +{ + return kmap_temporary_page(pfn_to_page(pfn)); +} + static inline void *kmap_atomic(struct page *page) { preempt_disable(); @@ -194,12 +230,8 @@ static inline void *kmap_atomic_pfn(unsi return kmap_atomic(pfn_to_page(pfn)); }
-static inline void __kunmap_atomic(void *addr) +static inline void __kunmap_temporary(void *addr) { - /* - * Mostly nothing to do in the CONFIG_HIGHMEM=n case as kunmap_atomic() - * handles preemption - */ #ifdef ARCH_HAS_FLUSH_ON_KUNMAP kunmap_flush_on_unmap(addr); #endif @@ -217,10 +249,16 @@ static inline void __kunmap_atomic(void #define kunmap_atomic(addr) \ do { \ BUILD_BUG_ON(__same_type((addr), struct page *)); \ - __kunmap_atomic(addr); \ + __kunmap_temporary(addr); \ preempt_enable(); \ } while (0)
+#define kunmap_temporary(addr) \ + do { \ + BUILD_BUG_ON(__same_type((addr), struct page *)); \ + __kunmap_temporary(addr); \ + } while (0) + /* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */ #ifndef clear_user_highpage static inline void clear_user_highpage(struct page *page, unsigned long vaddr) --- a/mm/highmem.c +++ b/mm/highmem.c @@ -432,7 +432,7 @@ static pte_t *kmap_get_pte(void) return __kmap_pte; }
-static void *__kmap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot) +static void *do_kmap_temporary_pfn_prot(unsigned long pfn, pgprot_t prot) { pte_t pteval, *kmap_pte = kmap_get_pte(); unsigned long vaddr; @@ -451,14 +451,14 @@ static void *__kmap_atomic_pfn_prot(unsi return (void *)vaddr; }
-void *kmap_atomic_pfn_prot(unsigned long pfn, pgprot_t prot) +void *__kmap_temporary_pfn_prot(unsigned long pfn, pgprot_t prot) { pagefault_disable(); - return __kmap_atomic_pfn_prot(pfn, prot); + return do_kmap_temporary_pfn_prot(pfn, prot); } -EXPORT_SYMBOL(kmap_atomic_pfn_prot); +EXPORT_SYMBOL(__kmap_temporary_pfn_prot);
-void *kmap_atomic_page_prot(struct page *page, pgprot_t prot) +void *__kmap_temporary_page_prot(struct page *page, pgprot_t prot) { void *kmap;
@@ -471,11 +471,11 @@ void *kmap_atomic_page_prot(struct page if (kmap) return kmap;
- return __kmap_atomic_pfn_prot(page_to_pfn(page), prot); + return do_kmap_temporary_pfn_prot(page_to_pfn(page), prot); } -EXPORT_SYMBOL(kmap_atomic_page_prot); +EXPORT_SYMBOL(__kmap_temporary_page_prot);
-void kunmap_atomic_indexed(void *vaddr) +void kunmap_temporary_indexed(void *vaddr) { unsigned long addr = (unsigned long) vaddr & PAGE_MASK; pte_t *kmap_pte = kmap_get_pte(); @@ -503,7 +503,7 @@ void kunmap_atomic_indexed(void *vaddr) preempt_enable(); pagefault_enable(); } -EXPORT_SYMBOL(kunmap_atomic_indexed); +EXPORT_SYMBOL(kunmap_temporary_indexed);
void kmap_switch_temporary(struct task_struct *prev, struct task_struct *next) {
On Sat, Sep 19, 2020 at 11:50 AM Thomas Gleixner tglx@linutronix.de wrote:
First of all, sorry for the horribly big Cc list!
Following up to the discussion in:
https://lore.kernel.org/r/20200914204209.256266093@linutronix.de
this provides a preemptible variant of kmap_atomic & related interfaces. This is achieved by:
Consolidating all kmap atomic implementations in generic code
Switching from per CPU storage of the kmap index to a per task storage
Adding a pteval array to the per task storage which contains the ptevals of the currently active temporary kmaps
Adding context switch code which checks whether the outgoing or the incoming task has active temporary kmaps. If so, the outgoing task's kmaps are removed and the incoming task's kmaps are restored.
Adding new interfaces k[un]map_temporary*() which are not disabling preemption and can be called from any context (except NMI).
Contrary to kmap() which provides preemptible and "persistant" mappings, these interfaces are meant to replace the temporary mappings provided by kmap_atomic*() today.
This allows to get rid of conditional mapping choices and allows to have preemptible short term mappings on 64bit which are today enforced to be non-preemptible due to the highmem constraints. It clearly puts overhead on the highmem users, but highmem is slow anyway.
This is not a wholesale conversion which makes kmap_atomic magically preemptible because there might be usage sites which rely on the implicit preempt disable. So this needs to be done on a case by case basis and the call sites converted to kmap_temporary.
Note, that this is only lightly tested on X86 and completely untested on all other architectures.
The lot is also available from
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem
I think it should be the case, but I want to double check: Will copy_*_user be allowed within a kmap_temporary section? This would allow us to ditch an absolute pile of slowpaths. -Daniel
Thanks,
tglx
a/arch/arm/mm/highmem.c | 121 --------------------- a/arch/microblaze/mm/highmem.c | 78 ------------- a/arch/nds32/mm/highmem.c | 48 -------- a/arch/powerpc/mm/highmem.c | 67 ----------- a/arch/sparc/mm/highmem.c | 115 -------------------- arch/arc/Kconfig | 1 arch/arc/include/asm/highmem.h | 8 + arch/arc/mm/highmem.c | 44 ------- arch/arm/Kconfig | 1 arch/arm/include/asm/highmem.h | 30 +++-- arch/arm/mm/Makefile | 1 arch/csky/Kconfig | 1 arch/csky/include/asm/highmem.h | 4 arch/csky/mm/highmem.c | 75 ------------- arch/microblaze/Kconfig | 1 arch/microblaze/include/asm/highmem.h | 6 - arch/microblaze/mm/Makefile | 1 arch/microblaze/mm/init.c | 6 - arch/mips/Kconfig | 1 arch/mips/include/asm/highmem.h | 4 arch/mips/mm/highmem.c | 77 ------------- arch/mips/mm/init.c | 3 arch/nds32/Kconfig.cpu | 1 arch/nds32/include/asm/highmem.h | 21 ++- arch/nds32/mm/Makefile | 1 arch/powerpc/Kconfig | 1 arch/powerpc/include/asm/highmem.h | 6 - arch/powerpc/mm/Makefile | 1 arch/powerpc/mm/mem.c | 7 - arch/sparc/Kconfig | 1 arch/sparc/include/asm/highmem.h | 7 - arch/sparc/mm/Makefile | 3 arch/sparc/mm/srmmu.c | 2 arch/x86/include/asm/fixmap.h | 1 arch/x86/include/asm/highmem.h | 12 +- arch/x86/include/asm/iomap.h | 29 +++-- arch/x86/mm/highmem_32.c | 59 ---------- arch/x86/mm/init_32.c | 15 -- arch/x86/mm/iomap_32.c | 57 ---------- arch/xtensa/Kconfig | 1 arch/xtensa/include/asm/highmem.h | 9 + arch/xtensa/mm/highmem.c | 44 ------- b/arch/x86/Kconfig | 3 include/linux/highmem.h | 141 +++++++++++++++--------- include/linux/io-mapping.h | 2 include/linux/sched.h | 9 + kernel/sched/core.c | 10 + mm/Kconfig | 3 mm/highmem.c | 192 ++++++++++++++++++++++++++++++++-- 49 files changed, 422 insertions(+), 909 deletions(-)
On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter daniel@ffwll.ch wrote:
On Sat, Sep 19, 2020 at 11:50 AM Thomas Gleixner tglx@linutronix.de wrote:
First of all, sorry for the horribly big Cc list!
Following up to the discussion in:
https://lore.kernel.org/r/20200914204209.256266093@linutronix.de
this provides a preemptible variant of kmap_atomic & related interfaces. This is achieved by:
Consolidating all kmap atomic implementations in generic code
Switching from per CPU storage of the kmap index to a per task storage
Adding a pteval array to the per task storage which contains the ptevals of the currently active temporary kmaps
Adding context switch code which checks whether the outgoing or the incoming task has active temporary kmaps. If so, the outgoing task's kmaps are removed and the incoming task's kmaps are restored.
Adding new interfaces k[un]map_temporary*() which are not disabling preemption and can be called from any context (except NMI).
Contrary to kmap() which provides preemptible and "persistant" mappings, these interfaces are meant to replace the temporary mappings provided by kmap_atomic*() today.
This allows to get rid of conditional mapping choices and allows to have preemptible short term mappings on 64bit which are today enforced to be non-preemptible due to the highmem constraints. It clearly puts overhead on the highmem users, but highmem is slow anyway.
This is not a wholesale conversion which makes kmap_atomic magically preemptible because there might be usage sites which rely on the implicit preempt disable. So this needs to be done on a case by case basis and the call sites converted to kmap_temporary.
Note, that this is only lightly tested on X86 and completely untested on all other architectures.
The lot is also available from
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem
I think it should be the case, but I want to double check: Will copy_*_user be allowed within a kmap_temporary section? This would allow us to ditch an absolute pile of slowpaths.
(coffee just kicked in) copy_*_user is ofc allowed, but if you hit a page fault you get a short read/write. This looks like it would remove the need to handle these in a slowpath, since page faults can now be served in this new kmap_temporary sections. But this sounds too good to be true, so I'm wondering what I'm missing. -Daniel
Thanks,
tglx
a/arch/arm/mm/highmem.c | 121 --------------------- a/arch/microblaze/mm/highmem.c | 78 ------------- a/arch/nds32/mm/highmem.c | 48 -------- a/arch/powerpc/mm/highmem.c | 67 ----------- a/arch/sparc/mm/highmem.c | 115 -------------------- arch/arc/Kconfig | 1 arch/arc/include/asm/highmem.h | 8 + arch/arc/mm/highmem.c | 44 ------- arch/arm/Kconfig | 1 arch/arm/include/asm/highmem.h | 30 +++-- arch/arm/mm/Makefile | 1 arch/csky/Kconfig | 1 arch/csky/include/asm/highmem.h | 4 arch/csky/mm/highmem.c | 75 ------------- arch/microblaze/Kconfig | 1 arch/microblaze/include/asm/highmem.h | 6 - arch/microblaze/mm/Makefile | 1 arch/microblaze/mm/init.c | 6 - arch/mips/Kconfig | 1 arch/mips/include/asm/highmem.h | 4 arch/mips/mm/highmem.c | 77 ------------- arch/mips/mm/init.c | 3 arch/nds32/Kconfig.cpu | 1 arch/nds32/include/asm/highmem.h | 21 ++- arch/nds32/mm/Makefile | 1 arch/powerpc/Kconfig | 1 arch/powerpc/include/asm/highmem.h | 6 - arch/powerpc/mm/Makefile | 1 arch/powerpc/mm/mem.c | 7 - arch/sparc/Kconfig | 1 arch/sparc/include/asm/highmem.h | 7 - arch/sparc/mm/Makefile | 3 arch/sparc/mm/srmmu.c | 2 arch/x86/include/asm/fixmap.h | 1 arch/x86/include/asm/highmem.h | 12 +- arch/x86/include/asm/iomap.h | 29 +++-- arch/x86/mm/highmem_32.c | 59 ---------- arch/x86/mm/init_32.c | 15 -- arch/x86/mm/iomap_32.c | 57 ---------- arch/xtensa/Kconfig | 1 arch/xtensa/include/asm/highmem.h | 9 + arch/xtensa/mm/highmem.c | 44 ------- b/arch/x86/Kconfig | 3 include/linux/highmem.h | 141 +++++++++++++++--------- include/linux/io-mapping.h | 2 include/linux/sched.h | 9 + kernel/sched/core.c | 10 + mm/Kconfig | 3 mm/highmem.c | 192 ++++++++++++++++++++++++++++++++-- 49 files changed, 422 insertions(+), 909 deletions(-)
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote:
On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter daniel@ffwll.ch wrote:
I think it should be the case, but I want to double check: Will copy_*_user be allowed within a kmap_temporary section? This would allow us to ditch an absolute pile of slowpaths.
(coffee just kicked in) copy_*_user is ofc allowed, but if you hit a page fault you get a short read/write. This looks like it would remove the need to handle these in a slowpath, since page faults can now be served in this new kmap_temporary sections. But this sounds too good to be true, so I'm wondering what I'm missing.
In principle we could allow pagefaults, but not with the currently proposed interface which can be called from any context. Obviously if called from atomic context it can't handle user page faults.
In theory we could make a variant which does not disable pagefaults, but that's what kmap() already provides.
Thanks,
tglx
On Sun, Sep 20, 2020 at 08:23:26AM +0200, Thomas Gleixner wrote:
On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote:
On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter daniel@ffwll.ch wrote:
I think it should be the case, but I want to double check: Will copy_*_user be allowed within a kmap_temporary section? This would allow us to ditch an absolute pile of slowpaths.
(coffee just kicked in) copy_*_user is ofc allowed, but if you hit a page fault you get a short read/write. This looks like it would remove the need to handle these in a slowpath, since page faults can now be served in this new kmap_temporary sections. But this sounds too good to be true, so I'm wondering what I'm missing.
In principle we could allow pagefaults, but not with the currently proposed interface which can be called from any context. Obviously if called from atomic context it can't handle user page faults.
Yeah that's clear, but does the implemention need to disable pagefaults unconditionally?
In theory we could make a variant which does not disable pagefaults, but that's what kmap() already provides.
Currently we have a bunch of code which roughly does
kmap_atomic(); copy_*_user(); kunmap_atomic();
if (short_copy_user) { kmap(); copy_*_user(remaining_stuff); kunmap(); }
And the only reason is that kmap is a notch slower, hence the fastpath. If we get a kmap which is fast and allows pagefaults (only in contexts that allow pagefaults already ofc) then we can ditch a pile of kmap users.
Cheers, Daniel
On Sun, Sep 20 2020 at 10:23, Daniel Vetter wrote:
On Sun, Sep 20, 2020 at 08:23:26AM +0200, Thomas Gleixner wrote:
On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote:
On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter daniel@ffwll.ch wrote:
I think it should be the case, but I want to double check: Will copy_*_user be allowed within a kmap_temporary section? This would allow us to ditch an absolute pile of slowpaths.
(coffee just kicked in) copy_*_user is ofc allowed, but if you hit a page fault you get a short read/write. This looks like it would remove the need to handle these in a slowpath, since page faults can now be served in this new kmap_temporary sections. But this sounds too good to be true, so I'm wondering what I'm missing.
In principle we could allow pagefaults, but not with the currently proposed interface which can be called from any context. Obviously if called from atomic context it can't handle user page faults.
Yeah that's clear, but does the implemention need to disable pagefaults unconditionally?
As I wrote in the other reply it's not required and the final interface will neither disable preemption nor pagefaults (except for the atomic wrapper around it).
Thanks,
tglx
On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner tglx@linutronix.de wrote:
this provides a preemptible variant of kmap_atomic & related interfaces. This is achieved by:
Ack. This looks really nice, even apart from the new capability.
The only thing I really reacted to is that the name doesn't make sense to me: "kmap_temporary()" seems a bit odd.
Particularly for an interface that really is basically meant as a better replacement of "kmap_atomic()" (but is perhaps also a better replacement for "kmap()").
I think I understand how the name came about: I think the "temporary" is there as a distinction from the "longterm" regular kmap(). So I think it makes some sense from an internal implementation angle, but I don't think it makes a lot of sense from an interface name.
I don't know what might be a better name, but if we want to emphasize that it's thread-private and a one-off, maybe "local" would be a better naming, and make it distinct from the "global" nature of the old kmap() interface?
However, another solution might be to just use this new preemptible "local" kmap(), and remove the old global one entirely. Yes, the old global one caches the page table mapping and that sounds really efficient and nice. But it's actually horribly horribly bad, because it means that we need to use locking for them. Your new "temporary" implementation seems to be fundamentally better locking-wise, and only need preemption disabling as locking (and is equally fast for the non-highmem case).
So I wonder if the single-page TLB flush isn't a better model, and whether it wouldn't be a lot simpler to just get rid of the old complex kmap() entirely, and replace it with this?
I agree we can't replace the kmap_atomic() version, because maybe people depend on the preemption disabling it also implied. But what about replacing the non-atomic kmap()?
Maybe I've missed something. Is it because the new interface still does "pagefault_disable()" perhaps?
But does it even need the pagefault_disable() at all? Yes, the *atomic* one obviously needed it. But why does this new one need to disable page faults?
[ I'm just reading the patches, I didn't try to apply them and look at the end result, so I might have missed something ]
The only other worry I would have is just test coverage of this change. I suspect very few developers use HIGHMEM. And I know the various test robots don't tend to test 32-bit either.
But apart from that question about naming (and perhaps replacing kmap() entirely), I very much like it.
Linus
On Sat, Sep 19, 2020 at 10:18:54AM -0700, Linus Torvalds wrote:
On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner tglx@linutronix.de wrote:
this provides a preemptible variant of kmap_atomic & related interfaces. This is achieved by:
Ack. This looks really nice, even apart from the new capability.
The only thing I really reacted to is that the name doesn't make sense to me: "kmap_temporary()" seems a bit odd.
Particularly for an interface that really is basically meant as a better replacement of "kmap_atomic()" (but is perhaps also a better replacement for "kmap()").
I think I understand how the name came about: I think the "temporary" is there as a distinction from the "longterm" regular kmap(). So I think it makes some sense from an internal implementation angle, but I don't think it makes a lot of sense from an interface name.
I don't know what might be a better name, but if we want to emphasize that it's thread-private and a one-off, maybe "local" would be a better naming, and make it distinct from the "global" nature of the old kmap() interface?
However, another solution might be to just use this new preemptible "local" kmap(), and remove the old global one entirely. Yes, the old global one caches the page table mapping and that sounds really efficient and nice. But it's actually horribly horribly bad, because it means that we need to use locking for them. Your new "temporary" implementation seems to be fundamentally better locking-wise, and only need preemption disabling as locking (and is equally fast for the non-highmem case).
So I wonder if the single-page TLB flush isn't a better model, and whether it wouldn't be a lot simpler to just get rid of the old complex kmap() entirely, and replace it with this?
I agree we can't replace the kmap_atomic() version, because maybe people depend on the preemption disabling it also implied. But what about replacing the non-atomic kmap()?
My concern with that is people might use kmap() and then pass the address to a different task. So we need to audit the current users of kmap() and convert any that do that into using vmap() instead.
I like kmap_local(). Or kmap_thread().
On Sat, Sep 19, 2020 at 10:39 AM Matthew Wilcox willy@infradead.org wrote:
My concern with that is people might use kmap() and then pass the address to a different task. So we need to audit the current users of kmap() and convert any that do that into using vmap() instead.
Ahh. Yes, I guess they might do that. It sounds strange, but not entirely crazy - I could imagine some "PIO thread" that does IO to a page that has been set up by somebody else using kmap(). Or similar.
Linus
On Sat, Sep 19, 2020 at 06:39:06PM +0100, Matthew Wilcox wrote:
On Sat, Sep 19, 2020 at 10:18:54AM -0700, Linus Torvalds wrote:
On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner tglx@linutronix.de wrote:
this provides a preemptible variant of kmap_atomic & related interfaces. This is achieved by:
Ack. This looks really nice, even apart from the new capability.
The only thing I really reacted to is that the name doesn't make sense to me: "kmap_temporary()" seems a bit odd.
Particularly for an interface that really is basically meant as a better replacement of "kmap_atomic()" (but is perhaps also a better replacement for "kmap()").
I think I understand how the name came about: I think the "temporary" is there as a distinction from the "longterm" regular kmap(). So I think it makes some sense from an internal implementation angle, but I don't think it makes a lot of sense from an interface name.
I don't know what might be a better name, but if we want to emphasize that it's thread-private and a one-off, maybe "local" would be a better naming, and make it distinct from the "global" nature of the old kmap() interface?
However, another solution might be to just use this new preemptible "local" kmap(), and remove the old global one entirely. Yes, the old global one caches the page table mapping and that sounds really efficient and nice. But it's actually horribly horribly bad, because it means that we need to use locking for them. Your new "temporary" implementation seems to be fundamentally better locking-wise, and only need preemption disabling as locking (and is equally fast for the non-highmem case).
So I wonder if the single-page TLB flush isn't a better model, and whether it wouldn't be a lot simpler to just get rid of the old complex kmap() entirely, and replace it with this?
I agree we can't replace the kmap_atomic() version, because maybe people depend on the preemption disabling it also implied. But what about replacing the non-atomic kmap()?
My concern with that is people might use kmap() and then pass the address to a different task. So we need to audit the current users of kmap() and convert any that do that into using vmap() instead.
I've done some of this work.[3] PKS and pmem stray write protection[2] depend on kmap to enable the correct PKS settings. After working through the exception handling we realized that some users of kmap() seem to be doing just this; passing the address to a different task.
From what I have found ~90% of kmap() callers are 'kmap_thread()' and the other
~10% are kmap().[3] But of those 10% I'm not familiar with the code enough to know if they really require a 'global' map. What I do know is they save an address which appears to be used in other threads. But I could be wrong.
For PKS I added a 'global' implementation which could then be called by kmap() and added a new kmap_thread() call which used the original 'local' version of the PKS interface. The PKS work is still being reviewed internally for the TIP core code. But I've pushed it all to git hub for purposes of this discussion.[1]
I like kmap_local(). Or kmap_thread().
I chose kmap_thread() so that makes sense to me. I also thought about using kmap_global() as an alternative interface which would change just ~10% of the callers and make the series much smaller. But internal discussions lead me to chose kmap_thread() as the new interface so that we don't change the semantics of kmap().
Ira
[1] https://github.com/weiny2/linux-kernel/tree/lm-pks-pmem-for-5.10-v3
[2] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.weiny@intel.com/
[3] 12:42:06 > git grep ' kmap(' *.c | grep -v '* ' | wc -l 22
12:43:32 > git grep ' kmap_thread(' *.c | grep -v '* ' | wc -l 204
Here are the callers which hand an address to another thread.
12:45:25 > git grep ' kmap(' *.c | grep -v '* ' arch/x86/mm/dump_pagetables.c: [PKMAP_BASE_NR] = { 0UL, "Persistent kmap() Area" }, drivers/firewire/net.c: ptr = kmap(dev->broadcast_rcv_buffer.pages[u]); drivers/gpu/drm/i915/gem/i915_gem_pages.c: return kmap(sg_page(sgt->sgl)); drivers/gpu/drm/i915/selftests/i915_perf.c: scratch = kmap(ce->vm->scratch[0].base.page); drivers/gpu/drm/ttm/ttm_bo_util.c: map->virtual = kmap(map->page); drivers/infiniband/hw/qib/qib_user_sdma.c: mpage = kmap(page); drivers/misc/vmw_vmci/vmci_host.c: context->notify = kmap(context->notify_page) + (uva & (PAGE_SIZE - 1)); drivers/misc/xilinx_sdfec.c: addr = kmap(pages[i]); drivers/mmc/host/usdhi6rol0.c: host->pg.mapped = kmap(host->pg.page); drivers/mmc/host/usdhi6rol0.c: host->pg.mapped = kmap(host->pg.page); drivers/mmc/host/usdhi6rol0.c: host->pg.mapped = kmap(host->pg.page); drivers/nvme/target/tcp.c: iov->iov_base = kmap(sg_page(sg)) + sg->offset + sg_offset; drivers/scsi/libiscsi_tcp.c: segment->sg_mapped = kmap(sg_page(sg)); drivers/target/iscsi/iscsi_target.c: iov[i].iov_base = kmap(sg_page(sg)) + sg->offset + page_off; drivers/target/target_core_transport.c: return kmap(sg_page(sg)) + sg->offset; fs/btrfs/check-integrity.c: block_ctx->datav[i] = kmap(block_ctx->pagev[i]); fs/ceph/dir.c: cache_ctl->dentries = kmap(cache_ctl->page); fs/ceph/inode.c: ctl->dentries = kmap(ctl->page); lib/scatterlist.c: miter->addr = kmap(miter->page) + miter->__offset; net/ceph/pagelist.c: pl->mapped_tail = kmap(page); net/ceph/pagelist.c: pl->mapped_tail = kmap(page); virt/kvm/kvm_main.c: hva = kmap(page);
On Sat, Sep 19 2020 at 10:18, Linus Torvalds wrote:
On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner tglx@linutronix.de wrote:
this provides a preemptible variant of kmap_atomic & related interfaces. This is achieved by:
Ack. This looks really nice, even apart from the new capability.
The only thing I really reacted to is that the name doesn't make sense to me: "kmap_temporary()" seems a bit odd.
Yeah. Couldn't come up with something useful.
Particularly for an interface that really is basically meant as a better replacement of "kmap_atomic()" (but is perhaps also a better replacement for "kmap()").
I think I understand how the name came about: I think the "temporary" is there as a distinction from the "longterm" regular kmap(). So I think it makes some sense from an internal implementation angle, but I don't think it makes a lot of sense from an interface name.
I don't know what might be a better name, but if we want to emphasize that it's thread-private and a one-off, maybe "local" would be a better naming, and make it distinct from the "global" nature of the old kmap() interface?
Right, _local or _thread would be more intuitive.
However, another solution might be to just use this new preemptible "local" kmap(), and remove the old global one entirely. Yes, the old global one caches the page table mapping and that sounds really efficient and nice. But it's actually horribly horribly bad, because it means that we need to use locking for them. Your new "temporary" implementation seems to be fundamentally better locking-wise, and only need preemption disabling as locking (and is equally fast for the non-highmem case).
So I wonder if the single-page TLB flush isn't a better model, and whether it wouldn't be a lot simpler to just get rid of the old complex kmap() entirely, and replace it with this?
I agree we can't replace the kmap_atomic() version, because maybe people depend on the preemption disabling it also implied. But what about replacing the non-atomic kmap()?
Maybe I've missed something. Is it because the new interface still does "pagefault_disable()" perhaps?
But does it even need the pagefault_disable() at all? Yes, the *atomic* one obviously needed it. But why does this new one need to disable page faults?
It disables pagefaults because it can be called from atomic and non-atomic context. That was the point to get rid of
if (preeemptible()) kmap(); else kmap_atomic();
If it does not disable pagefaults, then it's just a lightweight variant of kmap() for short lived mappings.
But apart from that question about naming (and perhaps replacing kmap() entirely), I very much like it.
I thought about it, but then I figured that kmap pointers can be handed to other contexts from the thread which sets up the mapping because it's 'permanent'.
I'm not sure whether that actually happens, so we'd need to audit all kmap() users to be sure. If there is no such use case, then we surely can get of rid of kmap() completely. It's only 300+ instances to stare at and quite some of them are wrapped into other functions.
Highmem sucks no matter what and the only sane solution is to remove it completely.
Thanks,
tglx
On Sun, Sep 20 2020 at 08:41, Thomas Gleixner wrote:
On Sat, Sep 19 2020 at 10:18, Linus Torvalds wrote:
Maybe I've missed something. Is it because the new interface still does "pagefault_disable()" perhaps?
But does it even need the pagefault_disable() at all? Yes, the *atomic* one obviously needed it. But why does this new one need to disable page faults?
It disables pagefaults because it can be called from atomic and non-atomic context. That was the point to get rid of
if (preeemptible()) kmap(); else kmap_atomic();
If it does not disable pagefaults, then it's just a lightweight variant of kmap() for short lived mappings.
Actually most usage sites of kmap atomic do not need page faults to be disabled at all. As Daniel pointed out the implicit pagefault disable enforces copy_from_user_inatomic() even in places which are clearly preemptible task context.
As we need to look at each instance anyway we can add the PF disable explicitely to the very few places which actually need it.
Thanks,
tglx
On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner tglx@linutronix.de wrote:
Actually most usage sites of kmap atomic do not need page faults to be disabled at all.
Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part.
I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting.
The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name?
It's very wrong to do
addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2);
because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec).
So it's fundamentally a stack.
And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest.
So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still.
Btw, looking at the stack code, Ithink your new implementation of it is a bit scary:
static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++;
and now that 'current->kmap_ctrl.idx' is not atomic wrt
(a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did)
(b) the prev/next switch
And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value.
So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*.
And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing.
I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work.
And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack.
And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway.
So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile.
So I would suggest:
- continue to use an actual per-cpu kmap_atomic_idx
- make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action).
which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely.
Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is.
Linus
On Sun, Sep 20 2020 at 09:57, Linus Torvalds wrote:
On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner tglx@linutronix.de wrote: Btw, looking at the stack code, Ithink your new implementation of it is a bit scary:
static inline int kmap_atomic_idx_push(void) {
int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
int idx = current->kmap_ctrl.idx++;
and now that 'current->kmap_ctrl.idx' is not atomic wrt
(a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did)
Right. Nesting should be a non issue, but I don't think we have kmap_atomic() in NMI context.
(b) the prev/next switch
And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value.
So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*.
And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing.
Duh yes. Never thought about that.
Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is.
I think the more obvious solution is to split the whole exercise:
schedule() prepare_switch() unmap()
switch_to()
finish_switch() map()
That's safe because neither the unmap() nor the map() code changes kmap_ctrl.idx. So if there is an interrupt coming in between unmap() and switch_to() then a kmap_local() there will use the next entry. So we could even do the unmap() with interrupts enabled (preemption disabled). Same for the map() part.
To explain that we need only a few lines of commentry methinks.
Thanks,
tglx
On Sun, Sep 20, 2020 at 10:40 AM Thomas Gleixner tglx@linutronix.de wrote:
I think the more obvious solution is to split the whole exercise:
schedule() prepare_switch() unmap()
switch_to() finish_switch() map()
Yeah, that looks much easier to explain. Ack.
Linus
On Sun, Sep 20, 2020 at 10:42 AM Linus Torvalds torvalds@linux-foundation.org wrote:
Yeah, that looks much easier to explain. Ack.
Btw, one thing that might be a good idea at least initially is to add a check for p->kmap_ctrl.idx being zero at fork, exit and maybe syscall return time (but that last one may be too cumbersome to really worry about).
The kmap_atomic() interface basically has a lot of coverage for leaked as part of all the "might_sleep()" checks sprinkled around, The new kmap_temporary/local/whatever wouldn't have that kind of incidental debug checking, and any leaked kmap indexes would be rather hard to debug (much) later when they cause index overflows or whatever.
Linus
On Sun, Sep 20 2020 at 10:42, Linus Torvalds wrote:
On Sun, Sep 20, 2020 at 10:40 AM Thomas Gleixner tglx@linutronix.de wrote:
I think the more obvious solution is to split the whole exercise:
schedule() prepare_switch() unmap()
switch_to() finish_switch() map()
Yeah, that looks much easier to explain. Ack.
So far so good, but Peter Z. just pointed out to me that I completely missed the fact that this cannot work.
If a task is migrated to a different CPU then the mapping address will change which will explode in colourful ways.
On RT kernels this works because we ping the task to the CPU via migrate_disable(). On a !RT kernel migrate_disable() maps to preempt_disable() which brings us back to square one.
/me goes back to the drawing board.
Thanks,
tglx
On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner tglx@linutronix.de wrote:
If a task is migrated to a different CPU then the mapping address will change which will explode in colourful ways.
Heh.
Right you are.
Maybe we really *could* call this new kmap functionality something like "kmap_percpu()" (or maybe "local" is good enough), and make it act like your RT code does for spinlocks - not disable preemption, but only disabling CPU migration.
That would probably be good enough for a lot of users that don't want to expose excessive latencies, but where it's really not a huge deal to say "stick to this CPU for a short while".
The crypto code certainly sounds like one such case.
Linus
On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote:
On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner tglx@linutronix.de wrote:
If a task is migrated to a different CPU then the mapping address will change which will explode in colourful ways.
Right you are.
Maybe we really *could* call this new kmap functionality something like "kmap_percpu()" (or maybe "local" is good enough), and make it act like your RT code does for spinlocks - not disable preemption, but only disabling CPU migration.
I"m all for it, but the scheduler people have opinions :)
That would probably be good enough for a lot of users that don't want to expose excessive latencies, but where it's really not a huge deal to say "stick to this CPU for a short while".
The crypto code certainly sounds like one such case.
I looked at a lot of the kmap_atomic() places and quite some of them only require migration to be disabled to keep the temporary map stable.
Quite some code could be simplified significantly especially those places which need to do copy_from/to_user inside these sections. Graphics is the main example here as Daniel pointed out.
Alternatively this could of course be solved with per CPU page tables which will come around some day anyway I fear.
Thanks,
tglx
On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote:
On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote:
On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner tglx@linutronix.de wrote:
If a task is migrated to a different CPU then the mapping address will change which will explode in colourful ways.
Right you are.
Maybe we really *could* call this new kmap functionality something like "kmap_percpu()" (or maybe "local" is good enough), and make it act like your RT code does for spinlocks - not disable preemption, but only disabling CPU migration.
I"m all for it, but the scheduler people have opinions :)
Right, so I'm concerned. migrate_disable() wrecks pretty much all Real-Time scheduler theory we have, and PREEMPRT_RT bringing it in is somewhat ironic.
Yes, it allows breaking up non-preemptible regions of non-deterministic duration, and thereby both reduce and bound the scheduling latency, the cost for doing that is that the theory on CPU utilization/bandwidth go out the window.
To easily see this consider an SMP system with a number of tasks equal to the number of CPUs. On a regular (preempt_disable) kernel we can always run each task, by virtue of always having an idle CPU to take the task.
However, with migrate_disable() we can have each task preempted in a migrate_disable() region, worse we can stack them all on the _same_ CPU (super ridiculous odds, sure). And then we end up only able to run one task, with the rest of the CPUs picking their nose.
The end result is that, like with unbounded latency, we will still miss our deadline, simply because we got starved for CPU.
Now, while we could (with a _lot_ of work) rework the kernel to not rely on the implicit per-cpu ness of things like spinlock_t, the moment we bring in basic primitives that rely on migrate_disable() we're stuck with it.
The problem is; afaict there's been no research into this problem. There might be scheduling (read: balancing) schemes that can mitigate/solve this problem, or it might prove to be a 'hard' problem, I just don't know. But once we start down this road, it's going to be hell to get rid of it.
That's why I've been arguing (for many years) to strictly limit this to PREEMPT_RT and only as a gap-stop, not a fundamental primitive to build on.
On Wed, Sep 23 2020 at 10:40, peterz wrote:
Right, so I'm concerned. migrate_disable() wrecks pretty much all Real-Time scheduler theory we have, and PREEMPRT_RT bringing it in is somewhat ironic.
It's even more ironic that the approach of PREEMPT_RT has been 'pragmatic ignorance of theory' from the very beginning and despite violating all theories it still works. :)
Yes, it allows breaking up non-preemptible regions of non-deterministic duration, and thereby both reduce and bound the scheduling latency, the cost for doing that is that the theory on CPU utilization/bandwidth go out the window.
I agree, that the theory goes out of the window, but does it matter in practice? I've yet to see a report of migrate disable stacking being the culprit of a missed deadline and I surely have stared at lots of reports in the past 10+ years.
To easily see this consider an SMP system with a number of tasks equal to the number of CPUs. On a regular (preempt_disable) kernel we can always run each task, by virtue of always having an idle CPU to take the task.
However, with migrate_disable() we can have each task preempted in a migrate_disable() region, worse we can stack them all on the _same_ CPU (super ridiculous odds, sure). And then we end up only able to run one task, with the rest of the CPUs picking their nose.
The end result is that, like with unbounded latency, we will still miss our deadline, simply because we got starved for CPU.
I'm well aware of that.
Now, while we could (with a _lot_ of work) rework the kernel to not rely on the implicit per-cpu ness of things like spinlock_t, the moment we bring in basic primitives that rely on migrate_disable() we're stuck with it.
Right, but we are stuck with per CPUness and distangling that is just infeasible IMO.
The problem is; afaict there's been no research into this problem.
There is no research on a lot of things the kernel does :)
There might be scheduling (read: balancing) schemes that can mitigate/solve this problem, or it might prove to be a 'hard' problem, I just don't know.
In practive balancing surely can take the number of preempted tasks which are in a migrate disable section into account which would be just another measure to work around the fact that the kernel is not adhering to the theories. It never did that even w/o migrate disable.
But once we start down this road, it's going to be hell to get rid of it.
Like most of the other things the kernel came up with to deal with the oddities of modern hardware :)
That's why I've been arguing (for many years) to strictly limit this to PREEMPT_RT and only as a gap-stop, not a fundamental primitive to build on.
I know, but short of rewriting the world, I'm not seing the faintest plan to remove the stop gap. :)
As we discussed not long ago we have too many inconsistent preemption models already. RT is adding yet another one. And that's worse than introducing migrate disable as a general available facility.
IMO, reaching a point of consistency where our different preemption models (NONE, VOLUNTARY, PREEMPT. RT) build on each other is far more important.
For !RT migrate disable is far less of an danger than for RT kernels because the amount of code which will use it is rather limited compared to code which still will disable preemption implicit through spin and rw locks.
On RT converting these locks to 'sleepable spinlocks' is just possible because RT forces almost everything into task context and migrate disable is just the obvious decomposition of preempt disable which implicitely disables migration.
But that means that RT is by orders of magnitude more prone to run into the scheduling trainwreck you are worried about. It just refuses to do so at least with real world work loads.
I'm surely in favour of having solid theories behind implementation, but at some point you just have to bite the bullet and chose pragmatism in order to make progress.
Proliferating inconsistency is not real progress, as it is violating the most fundamental engineering principles. That's by far more dangerous than violating scheduling theories which are built on perfect models and therefore enforce violation by practical implementations anyway.
Thanks,
tglx
On Wed, 23 Sep 2020 10:40:32 +0200 peterz@infradead.org wrote:
However, with migrate_disable() we can have each task preempted in a migrate_disable() region, worse we can stack them all on the _same_ CPU (super ridiculous odds, sure). And then we end up only able to run one task, with the rest of the CPUs picking their nose.
What if we just made migrate_disable() a local_lock() available for !RT?
I mean make it a priority inheritance PER CPU lock.
That is, no two tasks could do a migrate_disable() on the same CPU? If one task does a migrate_disable() and then gets preempted and the preempting task does a migrate_disable() on the same CPU, it will block and wait for the first task to do a migrate_enable().
No two tasks on the same CPU could enter the migrate_disable() section simultaneously, just like no two tasks could enter a preempt_disable() section.
In essence, we just allow local_lock() to be used for both RT and !RT.
Perhaps make migrate_disable() an anonymous local_lock()?
This should lower the SHC in theory, if you can't have stacked migrate disables on the same CPU.
-- Steve
On Wed, Sep 23 2020 at 11:52, Steven Rostedt wrote:
On Wed, 23 Sep 2020 10:40:32 +0200 peterz@infradead.org wrote:
However, with migrate_disable() we can have each task preempted in a migrate_disable() region, worse we can stack them all on the _same_ CPU (super ridiculous odds, sure). And then we end up only able to run one task, with the rest of the CPUs picking their nose.
What if we just made migrate_disable() a local_lock() available for !RT?
I mean make it a priority inheritance PER CPU lock.
That is, no two tasks could do a migrate_disable() on the same CPU? If one task does a migrate_disable() and then gets preempted and the preempting task does a migrate_disable() on the same CPU, it will block and wait for the first task to do a migrate_enable().
No two tasks on the same CPU could enter the migrate_disable() section simultaneously, just like no two tasks could enter a preempt_disable() section.
In essence, we just allow local_lock() to be used for both RT and !RT.
Perhaps make migrate_disable() an anonymous local_lock()?
This should lower the SHC in theory, if you can't have stacked migrate disables on the same CPU.
I'm pretty sure this ends up in locking hell pretty fast and aside of that it's not working for scenarios like:
kmap_local(); migrate_disable(); ...
copy_from_user() -> #PF -> schedule()
which brought us into that discussion in the first place. You would stop any other migrate disable user from running until the page fault is resolved...
Thanks,
tglx
On Wed, 23 Sep 2020 22:55:54 +0200 Thomas Gleixner tglx@linutronix.de wrote:
Perhaps make migrate_disable() an anonymous local_lock()?
This should lower the SHC in theory, if you can't have stacked migrate disables on the same CPU.
I'm pretty sure this ends up in locking hell pretty fast and aside of that it's not working for scenarios like:
kmap_local(); migrate_disable(); ... copy_from_user() -> #PF -> schedule()
which brought us into that discussion in the first place. You would stop any other migrate disable user from running until the page fault is resolved...
Then scratch the idea of having anonymous local_lock() and just bring local_lock in directly? Then have a kmap local lock, which would only block those that need to do a kmap.
Now as for migration disabled nesting, at least now we would have groupings of this, and perhaps the theorists can handle that. I mean, how is this much different that having a bunch of tasks blocked on a mutex with the owner is pinned on a CPU?
migrate_disable() is a BKL of pinning affinity. If we only have local_lock() available (even on !RT), then it makes the blocking in groups. At least this way you could grep for all the different local_locks in the system and plug that into the algorithm for WCS, just like one would with a bunch of mutexes.
-- Steve
On Wed, Sep 23 2020 at 17:12, Steven Rostedt wrote:
On Wed, 23 Sep 2020 22:55:54 +0200 Then scratch the idea of having anonymous local_lock() and just bring local_lock in directly? Then have a kmap local lock, which would only block those that need to do a kmap.
That's still going to end up in lock ordering nightmares and you lose the ability to use kmap_local from arbitrary contexts which was again one of the goals of this exercise.
Aside of that you're imposing reentrancy protections on something which does not need it in the first place.
Now as for migration disabled nesting, at least now we would have groupings of this, and perhaps the theorists can handle that. I mean, how is this much different that having a bunch of tasks blocked on a mutex with the owner is pinned on a CPU?
migrate_disable() is a BKL of pinning affinity.
No. That's just wrong. preempt disable is a concurrency control, i.e. protecting against reentrancy on a given CPU. But it's a cpu global protection which means that it's not protecting a specific code path.
Contrary to preempt disable, migrate disable is not protecting against reentrancy on a given CPU. It's a temporary restriction to the scheduler on placement.
The fact that disabling preemption implicitely disables migration does not make them semantically equivalent.
If we only have local_lock() available (even on !RT), then it makes the blocking in groups. At least this way you could grep for all the different local_locks in the system and plug that into the algorithm for WCS, just like one would with a bunch of mutexes.
You cannot do that on RT at all where migrate disable is substituting preempt disable in spin and rw locks. The result would be the same as with a !RT kernel just with horribly bad performance.
That means the stacking problem has to be solved anyway.
So why on earth do you want to create yet another special duct tape case for kamp_local() which proliferates inconsistency instead of aiming for consistency accross all preemption models?
Thanks,
tglx
On Thu, 24 Sep 2020 08:57:52 +0200 Thomas Gleixner tglx@linutronix.de wrote:
Now as for migration disabled nesting, at least now we would have groupings of this, and perhaps the theorists can handle that. I mean, how is this much different that having a bunch of tasks blocked on a mutex with the owner is pinned on a CPU?
migrate_disable() is a BKL of pinning affinity.
No. That's just wrong. preempt disable is a concurrency control,
I think you totally misunderstood what I was saying. The above wasn't about comparing preempt_disable to migrate_disable. It was comparing migrate_disable to a chain of tasks blocked on mutexes where the top owner has preempt_disable set. You still have a bunch of tasks that can't move to other CPUs.
If we only have local_lock() available (even on !RT), then it makes the blocking in groups. At least this way you could grep for all the different local_locks in the system and plug that into the algorithm for WCS, just like one would with a bunch of mutexes.
You cannot do that on RT at all where migrate disable is substituting preempt disable in spin and rw locks. The result would be the same as with a !RT kernel just with horribly bad performance.
Note, the spin and rwlocks already have a lock associated with them. Why would it be any different on RT? I wasn't suggesting adding another lock inside a spinlock. Why would I recommend THAT? I wasn't recommending blindly replacing migrate_disable() with local_lock(). I just meant expose local_lock() but not migrate_disable().
That means the stacking problem has to be solved anyway.
So why on earth do you want to create yet another special duct tape case for kamp_local() which proliferates inconsistency instead of aiming for consistency accross all preemption models?
The idea was to help with the scheduling issue.
Anyway, instead of blocking. What about having a counter of number of migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's already another task with migrate_disabled() set, and the current task has an affinity greater than 1, it tries to migrate to another CPU?
This way migrate_disable() is less likely to have a bunch of tasks blocked on one CPU serialized by each task exiting the migrate_disable() section.
Yes, there's more overhead, but it only happens if multiple tasks are in a migrate disable section on the same CPU.
-- Steve
On Thu, Sep 24, 2020 at 08:32:41AM -0400, Steven Rostedt wrote:
Anyway, instead of blocking. What about having a counter of number of migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's already another task with migrate_disabled() set, and the current task has an affinity greater than 1, it tries to migrate to another CPU?
That doesn't solve the problem. On wakeup we should already prefer an idle CPU over one running a (RT) task, but you can always wake more tasks than there's CPUs around and you'll _have_ to stack at some point.
The trick is how to unstack them correctly. We need to detect when a migrate_disable() task _should_ start running again, and migrate away whoever is in the way at that point.
It turns out, that getting selected for pull-balance is exactly that condition, and clearly a migrate_disable() task cannot be pulled, but we can use that signal to try and pull away the running task that's in the way.
On Thu, 24 Sep 2020 14:42:41 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Thu, Sep 24, 2020 at 08:32:41AM -0400, Steven Rostedt wrote:
Anyway, instead of blocking. What about having a counter of number of migrate disabled tasks per cpu, and when taking a migrate_disable(), and there's already another task with migrate_disabled() set, and the current task has an affinity greater than 1, it tries to migrate to another CPU?
That doesn't solve the problem. On wakeup we should already prefer an idle CPU over one running a (RT) task, but you can always wake more tasks than there's CPUs around and you'll _have_ to stack at some point.
Yes, understood.
The trick is how to unstack them correctly. We need to detect when a migrate_disable() task _should_ start running again, and migrate away whoever is in the way at that point.
It turns out, that getting selected for pull-balance is exactly that condition, and clearly a migrate_disable() task cannot be pulled, but we can use that signal to try and pull away the running task that's in the way.
Unless of course that running task is in a migrate disable section itself ;-)
But I guess we will always have that SHC, and there will always be a scenario that you can't balance properly. But hopefully in practice we wont see that.
How to handle kmap_local(), will migrate_disable() be used only for 32bit or, for consistency, will it also apply to 64bit?
-- Steve
On Thu, Sep 24, 2020 at 09:51:38AM -0400, Steven Rostedt wrote:
It turns out, that getting selected for pull-balance is exactly that condition, and clearly a migrate_disable() task cannot be pulled, but we can use that signal to try and pull away the running task that's in the way.
Unless of course that running task is in a migrate disable section itself ;-)
See my ramblings here:
https://lkml.kernel.org/r/20200924082717.GA1362448@hirez.programming.kicks-a...
My plan was to have the migrate_enable() of the running task trigger the migration in that case.
On Thu, Sep 24 2020 at 08:32, Steven Rostedt wrote:
On Thu, 24 Sep 2020 08:57:52 +0200 Thomas Gleixner tglx@linutronix.de wrote:
Now as for migration disabled nesting, at least now we would have groupings of this, and perhaps the theorists can handle that. I mean, how is this much different that having a bunch of tasks blocked on a mutex with the owner is pinned on a CPU?
migrate_disable() is a BKL of pinning affinity.
No. That's just wrong. preempt disable is a concurrency control,
I think you totally misunderstood what I was saying. The above wasn't about comparing preempt_disable to migrate_disable. It was comparing migrate_disable to a chain of tasks blocked on mutexes where the top owner has preempt_disable set. You still have a bunch of tasks that can't move to other CPUs.
What? The top owner does not prevent any task from moving. The tasks cannot move because they are blocked on the mutex, which means they are not runnable and non runnable tasks are not migrated at all.
I really don't understand what you are trying to say.
If we only have local_lock() available (even on !RT), then it makes the blocking in groups. At least this way you could grep for all the different local_locks in the system and plug that into the algorithm for WCS, just like one would with a bunch of mutexes.
You cannot do that on RT at all where migrate disable is substituting preempt disable in spin and rw locks. The result would be the same as with a !RT kernel just with horribly bad performance.
Note, the spin and rwlocks already have a lock associated with them. Why would it be any different on RT? I wasn't suggesting adding another lock inside a spinlock. Why would I recommend THAT? I wasn't recommending blindly replacing migrate_disable() with local_lock(). I just meant expose local_lock() but not migrate_disable().
We already exposed local_lock() to non RT and it's for places which do preempt_disable() or local_irq_disable() without having a lock associated. But both primitives are scope less and therefore behave like CPU local BKLs. What local_lock() provides in these cases is:
- Making the protection scope clear by associating a named local lock which is coverred by lockdep.
- It still maps to preempt_disable() or local_irq_disable() in !RT kernels
- The scope and the named lock allows RT kernels to substitute with real (recursion aware) locking primitives which keep preemption and interupts enabled, but provide the fine grained protection for the scoped critical section.
So how would you substitute migrate_disable() with a local_lock()? You can't. Again migrate_disable() is NOT a concurrency control and therefore it cannot be substituted by any concurrency control primitive.
That means the stacking problem has to be solved anyway.
So why on earth do you want to create yet another special duct tape case for kamp_local() which proliferates inconsistency instead of aiming for consistency accross all preemption models?
The idea was to help with the scheduling issue.
I don't see how mixing concepts and trying to duct tape a problem which is clearly in the realm of the scheduler, i.e. load balancing and placing algorithms, is helpful.
Thanks,
tglx
On Thu, 24 Sep 2020 19:55:10 +0200 Thomas Gleixner tglx@linutronix.de wrote:
On Thu, Sep 24 2020 at 08:32, Steven Rostedt wrote:
On Thu, 24 Sep 2020 08:57:52 +0200 Thomas Gleixner tglx@linutronix.de wrote:
Now as for migration disabled nesting, at least now we would have groupings of this, and perhaps the theorists can handle that. I mean, how is this much different that having a bunch of tasks blocked on a mutex with the owner is pinned on a CPU?
migrate_disable() is a BKL of pinning affinity.
No. That's just wrong. preempt disable is a concurrency control,
I think you totally misunderstood what I was saying. The above wasn't about comparing preempt_disable to migrate_disable. It was comparing migrate_disable to a chain of tasks blocked on mutexes where the top owner has preempt_disable set. You still have a bunch of tasks that can't move to other CPUs.
What? The top owner does not prevent any task from moving. The tasks cannot move because they are blocked on the mutex, which means they are not runnable and non runnable tasks are not migrated at all.
And neither are migrated disabled tasks preempted by a high priority task.
I really don't understand what you are trying to say.
Don't worry about it. I was just making a high level comparison of how migrate disabled tasks blocked on a higher priority task is similar to that of tasks blocked on a mutex held by a pinned task that is preempted by a high priority task. But we can forget this analogy as it's not appropriate for the current conversation.
If we only have local_lock() available (even on !RT), then it makes the blocking in groups. At least this way you could grep for all the different local_locks in the system and plug that into the algorithm for WCS, just like one would with a bunch of mutexes.
You cannot do that on RT at all where migrate disable is substituting preempt disable in spin and rw locks. The result would be the same as with a !RT kernel just with horribly bad performance.
Note, the spin and rwlocks already have a lock associated with them. Why would it be any different on RT? I wasn't suggesting adding another lock inside a spinlock. Why would I recommend THAT? I wasn't recommending blindly replacing migrate_disable() with local_lock(). I just meant expose local_lock() but not migrate_disable().
We already exposed local_lock() to non RT and it's for places which do preempt_disable() or local_irq_disable() without having a lock associated. But both primitives are scope less and therefore behave like CPU local BKLs. What local_lock() provides in these cases is:
Making the protection scope clear by associating a named local lock which is coverred by lockdep.
It still maps to preempt_disable() or local_irq_disable() in !RT kernels
The scope and the named lock allows RT kernels to substitute with real (recursion aware) locking primitives which keep preemption and interupts enabled, but provide the fine grained protection for the scoped critical section.
I'm very much aware of the above.
So how would you substitute migrate_disable() with a local_lock()? You can't. Again migrate_disable() is NOT a concurrency control and therefore it cannot be substituted by any concurrency control primitive.
When I was first writing my email, I was writing about a way to replace migrate_disable with a construct similar to local locks without actually mentioning local locks, but then rewrote it to state local locks, trying to simplify what I was writing. I shouldn't have done that, because it portrayed that I wanted to use local_lock() unmodified. I was actually thinking of a new construct that was similar but not exactly the same as local lock.
But this will just make things more complex and we can forget about it.
I'll wait to see what Peter produces.
-- Steve
On Wed, Sep 23, 2020 at 11:52:51AM -0400, Steven Rostedt wrote:
On Wed, 23 Sep 2020 10:40:32 +0200 peterz@infradead.org wrote:
However, with migrate_disable() we can have each task preempted in a migrate_disable() region, worse we can stack them all on the _same_ CPU (super ridiculous odds, sure). And then we end up only able to run one task, with the rest of the CPUs picking their nose.
What if we just made migrate_disable() a local_lock() available for !RT?
Can't, neiter migrate_disable() nor migrate_enable() are allowed to block -- which is what makes their implementation so 'interesting'.
This should lower the SHC in theory, if you can't have stacked migrate disables on the same CPU.
See this email in that other thread (you're on Cc too IIRC):
https://lkml.kernel.org/r/20200923170809.GY1362448@hirez.programming.kicks-a...
I think that is we 'frob' the balance PULL, we'll end up with something similar.
Whichever way around we turn this thing, the migrate_disable() runtime (we'll have to add a tracer for that), will be an interference term on the lower priority task, exactly like preempt_disable() would be. We'll just not exclude a higher priority task from running.
AFAICT; the best case is a single migrate_disable() nesting, where a higher priority task preempts in a migrate_disable() section -- this is per design.
When this preempted task becomes elegible to run under the ideal model (IOW it becomes one of the M highest priority tasks), it might still have to wait for the preemptee's migrate_disable() section to complete. Thereby suffering interference in the exact duration of migrate_disable() section.
Per this argument, the change from preempt_disable() to migrate_disable() gets us:
- higher priority tasks gain reduced wake-up latency - lower priority tasks are unchanged and are subject to the exact same interference term as if the higher priority task were using preempt_disable().
Since we've already established this term is unbounded, any task but the highest priority task is basically buggered.
TL;DR, if we get balancing fixed and achieve (near) the optimal case above, migrate_disable() is an over-all win. But it's provably non-deterministic as long as the migrate_disable() sections are non-deterministic.
The reason this all mostly works in practise is (I think) because:
- People care most about the higher prio RT tasks and craft them to mostly avoid the migrate_disable() infected code.
- The preemption scenario is most pronounced at higher utilization scenarios, and I suspect this is fairly rare to begin with.
- And while many of these migrate_disable() regions are unbound in theory, in practise they're often fairly reasonable.
So my current todo list is:
- Change RT PULL - Change DL PULL - Add migrate_disable() tracer; exactly like preempt/irqoff, except measuring task-runtime instead of cpu-time. - Add a mode that measures actual interference. - Add a traceevent to detect preemption in migrate_disable().
And then I suppose I should twist Daniel's arm to update his model to include these scenarios and numbers.
On 9/24/20 10:27 AM, peterz@infradead.org wrote:
So my current todo list is:
- Change RT PULL
- Change DL PULL
- Add migrate_disable() tracer; exactly like preempt/irqoff, except measuring task-runtime instead of cpu-time.
- Add a mode that measures actual interference.
- Add a traceevent to detect preemption in migrate_disable().
And then I suppose I should twist Daniel's arm to update his model to include these scenarios and numbers.
Challenge accepted :-)
-- Daniel
On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote:
Alternatively this could of course be solved with per CPU page tables which will come around some day anyway I fear.
Previously (with PTI) we looked at making the entire kernel map per-CPU, and that takes a 2K copy on switch_mm() (or more general, the user part of whatever the top level directory is for architectures that have a shared kernel/user page-table setup in the first place).
The idea was having a fixed per-cpu kernel page-table, share a bunch of (kernel) page-tables between all CPUs and then copy in the user part on switch.
I've forgotten what the plan was for ASID/PCID in that scheme.
For x86_64 we've been fearing the performance of that 2k copy, but I don't think we've ever actually bit the bullet and implemented it to see how bad it really is.
On Wed, Sep 23 2020 at 12:19, peterz wrote:
On Mon, Sep 21, 2020 at 09:27:57PM +0200, Thomas Gleixner wrote:
Alternatively this could of course be solved with per CPU page tables which will come around some day anyway I fear.
Previously (with PTI) we looked at making the entire kernel map per-CPU, and that takes a 2K copy on switch_mm() (or more general, the user part of whatever the top level directory is for architectures that have a shared kernel/user page-table setup in the first place).
The idea was having a fixed per-cpu kernel page-table, share a bunch of (kernel) page-tables between all CPUs and then copy in the user part on switch.
I've forgotten what the plan was for ASID/PCID in that scheme.
For x86_64 we've been fearing the performance of that 2k copy, but I don't think we've ever actually bit the bullet and implemented it to see how bad it really is.
I actually did at some point and depending on the workload the overhead was clearly measurable. And yes, it fell apart with PCID and I could not come up with a scheme for it which did not suck horribly. So I burried the patches in the poison cabinet.
Aside of that, we'd need to implement that for a eight other architectures as well...
Thanks,
tglx
On Mon, Sep 21 2020 at 21:27, Thomas Gleixner wrote:
On Mon, Sep 21 2020 at 09:24, Linus Torvalds wrote:
On Mon, Sep 21, 2020 at 12:39 AM Thomas Gleixner tglx@linutronix.de wrote: Maybe we really *could* call this new kmap functionality something like "kmap_percpu()" (or maybe "local" is good enough), and make it act like your RT code does for spinlocks - not disable preemption, but only disabling CPU migration.
I"m all for it, but the scheduler people have opinions :)
I just took the latest version of migrate disable patches
https://lore.kernel.org/r/20200921163557.234036895@infradead.org
removed the RT dependency on top of them and adopted the kmap stuff (addressing the various comments while at it and unbreaking ARM).
I'm not going to post that until there is consensus about the general availability of migrate disable, but for those who want to play with it I've pushed the resulting combo out to:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git highmem
For testing I've tweaked a few places to use the new _local() variants and it survived testing so far and I've verified that there is actual preemption which means zap/restore of the thread local kmaps.
Thanks,
tglx
dri-devel@lists.freedesktop.org