Am 29.07.19 um 11:51 schrieb kernel test robot:
Greeting,
FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
Daniel, Noralf, we may have to revert this patch.
I expected some change in display performance, but not in VM. Since it's a server chipset, probably no one cares much about display performance. So that seemed like a good trade-off for re-using shared code.
Part of the patch set is that the generic fb emulation now maps and unmaps the fbdev BO when updating the screen. I guess that's the cause of the performance regression. And it should be visible with other drivers as well if they use a shadow FB for fbdev emulation.
The thing is that we'd need another generic fbdev emulation for ast and mgag200 that handles this issue properly.
Best regards Thomas
in testcase: vm-scalability on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory with following parameters:
runtime: 300s size: 8T test: anon-cow-seq-hugetlb cpufreq_governor: performance
test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us. test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
Details are as below: -------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml
========================================================================================= compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase: gcc-7/performance/x86_64-rhel-7.6/debian-x86_64-2019-05-14.cgz/300s/8T/lkp-knm01/anon-cow-seq-hugetlb/vm-scalability
commit: f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console") 90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9
fail:runs %reproduction fail:runs | | | 2:4 -50% :4 dmesg.WARNING:at#for_ip_interrupt_entry/0x :4 25% 1:4 dmesg.WARNING:at_ip___perf_sw_event/0x :4 25% 1:4 dmesg.WARNING:at_ip__fsnotify_parent/0x %stddev %change %stddev \ | \ 43955 ± 2% -18.8% 35691 vm-scalability.median 0.06 ± 7% +193.0% 0.16 ± 2% vm-scalability.median_stddev
14906559 ± 2% -17.9% 12237079 vm-scalability.throughput 87651 ± 2% -17.4% 72374 vm-scalability.time.involuntary_context_switches 2086168 -23.6% 1594224 vm-scalability.time.minor_page_faults 15082 ± 2% -10.4% 13517 vm-scalability.time.percent_of_cpu_this_job_got 29987 -8.9% 27327 vm-scalability.time.system_time 15755 -12.4% 13795 vm-scalability.time.user_time 122011 -19.3% 98418 vm-scalability.time.voluntary_context_switches 3.034e+09 -23.6% 2.318e+09 vm-scalability.workload 242478 ± 12% +68.5% 408518 ± 23% cpuidle.POLL.time 2788 ± 21% +117.4% 6062 ± 26% cpuidle.POLL.usage 56653 ± 10% +64.4% 93144 ± 20% meminfo.Mapped 120392 ± 7% +14.0% 137212 ± 4% meminfo.Shmem 47221 ± 11% +77.1% 83634 ± 22% numa-meminfo.node0.Mapped 120465 ± 7% +13.9% 137205 ± 4% numa-meminfo.node0.Shmem 2885513 -16.5% 2409384 numa-numastat.node0.local_node 2885471 -16.5% 2409354 numa-numastat.node0.numa_hit 11813 ± 11% +76.3% 20824 ± 22% numa-vmstat.node0.nr_mapped 30096 ± 7% +13.8% 34238 ± 4% numa-vmstat.node0.nr_shmem 43.72 ± 2% +5.5 49.20 mpstat.cpu.all.idle% 0.03 ± 4% +0.0 0.05 ± 6% mpstat.cpu.all.soft% 19.51 -2.4 17.08 mpstat.cpu.all.usr% 1012 -7.9% 932.75 turbostat.Avg_MHz 32.38 ± 10% +25.8% 40.73 turbostat.CPU%c1 145.51 -3.1% 141.01 turbostat.PkgWatt 15.09 -19.2% 12.19 turbostat.RAMWatt 43.50 ± 2% +13.2% 49.25 vmstat.cpu.id 18.75 ± 2% -13.3% 16.25 ± 2% vmstat.cpu.us 152.00 ± 2% -9.5% 137.50 vmstat.procs.r 4800 -13.1% 4173 vmstat.system.cs 156170 -11.9% 137594 slabinfo.anon_vma.active_objs 3395 -11.9% 2991 slabinfo.anon_vma.active_slabs 156190 -11.9% 137606 slabinfo.anon_vma.num_objs 3395 -11.9% 2991 slabinfo.anon_vma.num_slabs 1716 ± 5% +11.5% 1913 ± 8% slabinfo.dmaengine-unmap-16.active_objs 1716 ± 5% +11.5% 1913 ± 8% slabinfo.dmaengine-unmap-16.num_objs 1767 ± 2% -19.0% 1431 ± 2% slabinfo.hugetlbfs_inode_cache.active_objs 1767 ± 2% -19.0% 1431 ± 2% slabinfo.hugetlbfs_inode_cache.num_objs 3597 ± 5% -16.4% 3006 ± 3% slabinfo.skbuff_ext_cache.active_objs 3597 ± 5% -16.4% 3006 ± 3% slabinfo.skbuff_ext_cache.num_objs 1330122 -23.6% 1016557 proc-vmstat.htlb_buddy_alloc_success 77214 ± 3% +6.4% 82128 ± 2% proc-vmstat.nr_active_anon 67277 +2.9% 69246 proc-vmstat.nr_anon_pages 218.50 ± 3% -10.6% 195.25 proc-vmstat.nr_dirtied 288628 +1.4% 292755 proc-vmstat.nr_file_pages 360.50 -2.7% 350.75 proc-vmstat.nr_inactive_file 14225 ± 9% +63.8% 23304 ± 20% proc-vmstat.nr_mapped 30109 ± 7% +13.8% 34259 ± 4% proc-vmstat.nr_shmem 99870 -1.3% 98597 proc-vmstat.nr_slab_unreclaimable 204.00 ± 4% -12.1% 179.25 proc-vmstat.nr_written 77214 ± 3% +6.4% 82128 ± 2% proc-vmstat.nr_zone_active_anon 360.50 -2.7% 350.75 proc-vmstat.nr_zone_inactive_file 8810 ± 19% -66.1% 2987 ± 42% proc-vmstat.numa_hint_faults 8810 ± 19% -66.1% 2987 ± 42% proc-vmstat.numa_hint_faults_local 2904082 -16.4% 2427026 proc-vmstat.numa_hit 2904081 -16.4% 2427025 proc-vmstat.numa_local 6.828e+08 -23.5% 5.221e+08 proc-vmstat.pgalloc_normal 2900008 -17.2% 2400195 proc-vmstat.pgfault 6.827e+08 -23.5% 5.22e+08 proc-vmstat.pgfree 1.635e+10 -17.0% 1.357e+10 perf-stat.i.branch-instructions 1.53 ± 4% -0.1 1.45 ± 3% perf-stat.i.branch-miss-rate% 2.581e+08 ± 3% -20.5% 2.051e+08 ± 2% perf-stat.i.branch-misses 12.66 +1.1 13.78 perf-stat.i.cache-miss-rate% 72720849 -12.0% 63958986 perf-stat.i.cache-misses 5.766e+08 -18.6% 4.691e+08 perf-stat.i.cache-references 4674 ± 2% -13.0% 4064 perf-stat.i.context-switches 4.29 +12.5% 4.83 perf-stat.i.cpi 2.573e+11 -7.4% 2.383e+11 perf-stat.i.cpu-cycles 231.35 -21.5% 181.56 perf-stat.i.cpu-migrations 3522 +4.4% 3677 perf-stat.i.cycles-between-cache-misses 0.09 ± 13% +0.0 0.12 ± 5% perf-stat.i.iTLB-load-miss-rate% 5.894e+10 -15.8% 4.961e+10 perf-stat.i.iTLB-loads 5.901e+10 -15.8% 4.967e+10 perf-stat.i.instructions 1291 ± 14% -21.8% 1010 perf-stat.i.instructions-per-iTLB-miss 0.24 -11.0% 0.21 perf-stat.i.ipc 9476 -17.5% 7821 perf-stat.i.minor-faults 9478 -17.5% 7821 perf-stat.i.page-faults 9.76 -3.6% 9.41 perf-stat.overall.MPKI 1.59 ± 4% -0.1 1.52 perf-stat.overall.branch-miss-rate% 12.61 +1.1 13.71 perf-stat.overall.cache-miss-rate% 4.38 +10.5% 4.83 perf-stat.overall.cpi 3557 +5.3% 3747 perf-stat.overall.cycles-between-cache-misses 0.08 ± 12% +0.0 0.10 perf-stat.overall.iTLB-load-miss-rate% 1268 ± 15% -23.0% 976.22 perf-stat.overall.instructions-per-iTLB-miss 0.23 -9.5% 0.21 perf-stat.overall.ipc 5815 +9.7% 6378 perf-stat.overall.path-length 1.634e+10 -17.5% 1.348e+10 perf-stat.ps.branch-instructions 2.595e+08 ± 3% -21.2% 2.043e+08 ± 2% perf-stat.ps.branch-misses 72565205 -12.2% 63706339 perf-stat.ps.cache-misses 5.754e+08 -19.2% 4.646e+08 perf-stat.ps.cache-references 4640 ± 2% -12.5% 4060 perf-stat.ps.context-switches 2.581e+11 -7.5% 2.387e+11 perf-stat.ps.cpu-cycles 229.91 -22.0% 179.42 perf-stat.ps.cpu-migrations 5.889e+10 -16.3% 4.927e+10 perf-stat.ps.iTLB-loads 5.899e+10 -16.3% 4.938e+10 perf-stat.ps.instructions 9388 -18.2% 7677 perf-stat.ps.minor-faults 9389 -18.2% 7677 perf-stat.ps.page-faults 1.764e+13 -16.2% 1.479e+13 perf-stat.total.instructions 46803 ± 3% -18.8% 37982 ± 6% sched_debug.cfs_rq:/.exec_clock.min 5320 ± 3% +23.7% 6581 ± 3% sched_debug.cfs_rq:/.exec_clock.stddev 6737 ± 14% +58.1% 10649 ± 10% sched_debug.cfs_rq:/.load.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cfs_rq:/.load.max 46952 ± 16% +64.8% 77388 ± 11% sched_debug.cfs_rq:/.load.stddev 7.12 ± 4% +49.1% 10.62 ± 6% sched_debug.cfs_rq:/.load_avg.avg 474.40 ± 23% +67.5% 794.60 ± 10% sched_debug.cfs_rq:/.load_avg.max 37.70 ± 11% +74.8% 65.90 ± 9% sched_debug.cfs_rq:/.load_avg.stddev 13424269 ± 4% -15.6% 11328098 ± 2% sched_debug.cfs_rq:/.min_vruntime.avg 15411275 ± 3% -12.4% 13505072 ± 2% sched_debug.cfs_rq:/.min_vruntime.max 7939295 ± 6% -17.5% 6551322 ± 7% sched_debug.cfs_rq:/.min_vruntime.min 21.44 ± 7% -56.1% 9.42 ± 4% sched_debug.cfs_rq:/.nr_spread_over.avg 117.45 ± 11% -60.6% 46.30 ± 14% sched_debug.cfs_rq:/.nr_spread_over.max 19.33 ± 8% -66.4% 6.49 ± 9% sched_debug.cfs_rq:/.nr_spread_over.stddev 4.32 ± 15% +84.4% 7.97 ± 3% sched_debug.cfs_rq:/.runnable_load_avg.avg 353.85 ± 29% +118.8% 774.35 ± 11% sched_debug.cfs_rq:/.runnable_load_avg.max 27.30 ± 24% +118.5% 59.64 ± 9% sched_debug.cfs_rq:/.runnable_load_avg.stddev 6729 ± 14% +58.2% 10644 ± 10% sched_debug.cfs_rq:/.runnable_weight.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cfs_rq:/.runnable_weight.max 46950 ± 16% +64.8% 77387 ± 11% sched_debug.cfs_rq:/.runnable_weight.stddev 5305069 ± 4% -17.4% 4380376 ± 7% sched_debug.cfs_rq:/.spread0.avg 7328745 ± 3% -9.9% 6600897 ± 3% sched_debug.cfs_rq:/.spread0.max 2220837 ± 4% +55.8% 3460596 ± 5% sched_debug.cpu.avg_idle.avg 4590666 ± 9% +76.8% 8117037 ± 15% sched_debug.cpu.avg_idle.max 485052 ± 7% +80.3% 874679 ± 10% sched_debug.cpu.avg_idle.stddev 561.50 ± 26% +37.7% 773.30 ± 15% sched_debug.cpu.clock.stddev 561.50 ± 26% +37.7% 773.30 ± 15% sched_debug.cpu.clock_task.stddev 3.20 ± 10% +109.6% 6.70 ± 3% sched_debug.cpu.cpu_load[0].avg 309.10 ± 20% +150.3% 773.75 ± 12% sched_debug.cpu.cpu_load[0].max 21.02 ± 14% +160.8% 54.80 ± 9% sched_debug.cpu.cpu_load[0].stddev 3.19 ± 8% +109.8% 6.70 ± 3% sched_debug.cpu.cpu_load[1].avg 299.75 ± 19% +158.0% 773.30 ± 12% sched_debug.cpu.cpu_load[1].max 20.32 ± 12% +168.7% 54.62 ± 9% sched_debug.cpu.cpu_load[1].stddev 3.20 ± 8% +109.1% 6.69 ± 4% sched_debug.cpu.cpu_load[2].avg 288.90 ± 20% +167.0% 771.40 ± 12% sched_debug.cpu.cpu_load[2].max 19.70 ± 12% +175.4% 54.27 ± 9% sched_debug.cpu.cpu_load[2].stddev 3.16 ± 8% +110.9% 6.66 ± 6% sched_debug.cpu.cpu_load[3].avg 275.50 ± 24% +178.4% 766.95 ± 12% sched_debug.cpu.cpu_load[3].max 18.92 ± 15% +184.2% 53.77 ± 10% sched_debug.cpu.cpu_load[3].stddev 3.08 ± 8% +115.7% 6.65 ± 7% sched_debug.cpu.cpu_load[4].avg 263.55 ± 28% +188.7% 760.85 ± 12% sched_debug.cpu.cpu_load[4].max 18.03 ± 18% +196.6% 53.46 ± 11% sched_debug.cpu.cpu_load[4].stddev 14543 -9.6% 13150 sched_debug.cpu.curr->pid.max 5293 ± 16% +74.7% 9248 ± 11% sched_debug.cpu.load.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cpu.load.max 40887 ± 19% +78.3% 72891 ± 9% sched_debug.cpu.load.stddev 1141679 ± 4% +56.9% 1790907 ± 5% sched_debug.cpu.max_idle_balance_cost.avg 2432100 ± 9% +72.6% 4196779 ± 13% sched_debug.cpu.max_idle_balance_cost.max 745656 +29.3% 964170 ± 5% sched_debug.cpu.max_idle_balance_cost.min 239032 ± 9% +81.9% 434806 ± 10% sched_debug.cpu.max_idle_balance_cost.stddev 0.00 ± 27% +92.1% 0.00 ± 31% sched_debug.cpu.next_balance.stddev 1030 ± 4% -10.4% 924.00 ± 2% sched_debug.cpu.nr_switches.min 0.04 ± 26% +139.0% 0.09 ± 41% sched_debug.cpu.nr_uninterruptible.avg 830.35 ± 6% -12.0% 730.50 ± 2% sched_debug.cpu.sched_count.min 912.00 ± 2% -9.5% 825.38 sched_debug.cpu.ttwu_count.avg 433.05 ± 3% -19.2% 350.05 ± 3% sched_debug.cpu.ttwu_count.min 160.70 ± 3% -12.5% 140.60 ± 4% sched_debug.cpu.ttwu_local.min 9072 ± 11% -36.4% 5767 ± 8% softirqs.CPU1.RCU 12769 ± 5% +15.3% 14718 ± 3% softirqs.CPU101.SCHED 13198 +11.5% 14717 ± 3% softirqs.CPU102.SCHED 12981 ± 4% +13.9% 14788 ± 3% softirqs.CPU105.SCHED 13486 ± 3% +11.8% 15071 ± 4% softirqs.CPU111.SCHED 12794 ± 4% +14.1% 14601 ± 9% softirqs.CPU112.SCHED 12999 ± 4% +10.1% 14314 ± 4% softirqs.CPU115.SCHED 12844 ± 4% +10.6% 14202 ± 2% softirqs.CPU120.SCHED 13336 ± 3% +9.4% 14585 ± 3% softirqs.CPU122.SCHED 12639 ± 4% +20.2% 15195 softirqs.CPU123.SCHED 13040 ± 5% +15.2% 15024 ± 5% softirqs.CPU126.SCHED 13123 +15.1% 15106 ± 5% softirqs.CPU127.SCHED 9188 ± 6% -35.7% 5911 ± 2% softirqs.CPU13.RCU 13054 ± 3% +13.1% 14761 ± 5% softirqs.CPU130.SCHED 13158 ± 2% +13.9% 14985 ± 5% softirqs.CPU131.SCHED 12797 ± 6% +13.5% 14524 ± 3% softirqs.CPU133.SCHED 12452 ± 5% +14.8% 14297 softirqs.CPU134.SCHED 13078 ± 3% +10.4% 14439 ± 3% softirqs.CPU138.SCHED 12617 ± 2% +14.5% 14442 ± 5% softirqs.CPU139.SCHED 12974 ± 3% +13.7% 14752 ± 4% softirqs.CPU142.SCHED 12579 ± 4% +19.1% 14983 ± 3% softirqs.CPU143.SCHED 9122 ± 24% -44.6% 5053 ± 5% softirqs.CPU144.RCU 13366 ± 2% +11.1% 14848 ± 3% softirqs.CPU149.SCHED 13246 ± 2% +22.0% 16162 ± 7% softirqs.CPU150.SCHED 13452 ± 3% +20.5% 16210 ± 7% softirqs.CPU151.SCHED 13507 +10.1% 14869 softirqs.CPU156.SCHED 13808 ± 3% +9.2% 15079 ± 4% softirqs.CPU157.SCHED 13442 ± 2% +13.4% 15248 ± 4% softirqs.CPU160.SCHED 13311 +12.1% 14920 ± 2% softirqs.CPU162.SCHED 13544 ± 3% +8.5% 14695 ± 4% softirqs.CPU163.SCHED 13648 ± 3% +11.2% 15179 ± 2% softirqs.CPU166.SCHED 13404 ± 4% +12.5% 15079 ± 3% softirqs.CPU168.SCHED 13421 ± 6% +16.0% 15568 ± 8% softirqs.CPU169.SCHED 13115 ± 3% +23.1% 16139 ± 10% softirqs.CPU171.SCHED 13424 ± 6% +10.4% 14822 ± 3% softirqs.CPU175.SCHED 13274 ± 3% +13.7% 15087 ± 9% softirqs.CPU185.SCHED 13409 ± 3% +12.3% 15063 ± 3% softirqs.CPU190.SCHED 13181 ± 7% +13.4% 14946 ± 3% softirqs.CPU196.SCHED 13578 ± 3% +10.9% 15061 softirqs.CPU197.SCHED 13323 ± 5% +24.8% 16627 ± 6% softirqs.CPU198.SCHED 14072 ± 2% +12.3% 15798 ± 7% softirqs.CPU199.SCHED 12604 ± 13% +17.9% 14865 softirqs.CPU201.SCHED 13380 ± 4% +14.8% 15356 ± 3% softirqs.CPU203.SCHED 13481 ± 8% +14.2% 15390 ± 3% softirqs.CPU204.SCHED 12921 ± 2% +13.8% 14710 ± 3% softirqs.CPU206.SCHED 13468 +13.0% 15218 ± 2% softirqs.CPU208.SCHED 13253 ± 2% +13.1% 14992 softirqs.CPU209.SCHED 13319 ± 2% +14.3% 15225 ± 7% softirqs.CPU210.SCHED 13673 ± 5% +16.3% 15895 ± 3% softirqs.CPU211.SCHED 13290 +17.0% 15556 ± 5% softirqs.CPU212.SCHED 13455 ± 4% +14.4% 15392 ± 3% softirqs.CPU213.SCHED 13454 ± 4% +14.3% 15377 ± 3% softirqs.CPU215.SCHED 13872 ± 7% +9.7% 15221 ± 5% softirqs.CPU220.SCHED 13555 ± 4% +17.3% 15896 ± 5% softirqs.CPU222.SCHED 13411 ± 4% +20.8% 16197 ± 6% softirqs.CPU223.SCHED 8472 ± 21% -44.8% 4680 ± 3% softirqs.CPU224.RCU 13141 ± 3% +16.2% 15265 ± 7% softirqs.CPU225.SCHED 14084 ± 3% +8.2% 15242 ± 2% softirqs.CPU226.SCHED 13528 ± 4% +11.3% 15063 ± 4% softirqs.CPU228.SCHED 13218 ± 3% +16.3% 15377 ± 4% softirqs.CPU229.SCHED 14031 ± 4% +10.2% 15467 ± 2% softirqs.CPU231.SCHED 13770 ± 3% +14.0% 15700 ± 3% softirqs.CPU232.SCHED 13456 ± 3% +12.3% 15105 ± 3% softirqs.CPU233.SCHED 13137 ± 4% +13.5% 14909 ± 3% softirqs.CPU234.SCHED 13318 ± 2% +14.7% 15280 ± 2% softirqs.CPU235.SCHED 13690 ± 2% +13.7% 15563 ± 7% softirqs.CPU238.SCHED 13771 ± 5% +20.8% 16634 ± 7% softirqs.CPU241.SCHED 13317 ± 7% +19.5% 15919 ± 9% softirqs.CPU243.SCHED 8234 ± 16% -43.9% 4616 ± 5% softirqs.CPU244.RCU 13845 ± 6% +13.0% 15643 ± 3% softirqs.CPU244.SCHED 13179 ± 3% +16.3% 15323 softirqs.CPU246.SCHED 13754 +12.2% 15438 ± 3% softirqs.CPU248.SCHED 13769 ± 4% +10.9% 15276 ± 2% softirqs.CPU252.SCHED 13702 +10.5% 15147 ± 2% softirqs.CPU254.SCHED 13315 ± 2% +12.5% 14980 ± 3% softirqs.CPU255.SCHED 13785 ± 3% +12.9% 15568 ± 5% softirqs.CPU256.SCHED 13307 ± 3% +15.0% 15298 ± 3% softirqs.CPU257.SCHED 13864 ± 3% +10.5% 15313 ± 2% softirqs.CPU259.SCHED 13879 ± 2% +11.4% 15465 softirqs.CPU261.SCHED 13815 +13.6% 15687 ± 5% softirqs.CPU264.SCHED 119574 ± 2% +11.8% 133693 ± 11% softirqs.CPU266.TIMER 13688 +10.9% 15180 ± 6% softirqs.CPU267.SCHED 11716 ± 4% +19.3% 13974 ± 8% softirqs.CPU27.SCHED 13866 ± 3% +13.7% 15765 ± 4% softirqs.CPU271.SCHED 13887 ± 5% +12.5% 15621 softirqs.CPU272.SCHED 13383 ± 3% +19.8% 16031 ± 2% softirqs.CPU274.SCHED 13347 +14.1% 15232 ± 3% softirqs.CPU275.SCHED 12884 ± 2% +21.0% 15593 ± 4% softirqs.CPU276.SCHED 13131 ± 5% +13.4% 14891 ± 5% softirqs.CPU277.SCHED 12891 ± 2% +19.2% 15371 ± 4% softirqs.CPU278.SCHED 13313 ± 4% +13.0% 15049 ± 2% softirqs.CPU279.SCHED 13514 ± 3% +10.2% 14897 ± 2% softirqs.CPU280.SCHED 13501 ± 3% +13.7% 15346 softirqs.CPU281.SCHED 13261 +17.5% 15577 softirqs.CPU282.SCHED 8076 ± 15% -43.7% 4546 ± 5% softirqs.CPU283.RCU 13686 ± 3% +12.6% 15413 ± 2% softirqs.CPU284.SCHED 13439 ± 2% +9.2% 14670 ± 4% softirqs.CPU285.SCHED 8878 ± 9% -35.4% 5735 ± 4% softirqs.CPU35.RCU 11690 ± 2% +13.6% 13274 ± 5% softirqs.CPU40.SCHED 11714 ± 2% +19.3% 13975 ± 13% softirqs.CPU41.SCHED 11763 +12.5% 13239 ± 4% softirqs.CPU45.SCHED 11662 ± 2% +9.4% 12757 ± 3% softirqs.CPU46.SCHED 11805 ± 2% +9.3% 12902 ± 2% softirqs.CPU50.SCHED 12158 ± 3% +12.3% 13655 ± 8% softirqs.CPU55.SCHED 11716 ± 4% +8.8% 12751 ± 3% softirqs.CPU58.SCHED 11922 ± 2% +9.9% 13100 ± 4% softirqs.CPU64.SCHED 9674 ± 17% -41.8% 5625 ± 6% softirqs.CPU66.RCU 11818 +12.0% 13237 softirqs.CPU66.SCHED 124682 ± 7% -6.1% 117088 ± 5% softirqs.CPU66.TIMER 8637 ± 9% -34.0% 5700 ± 7% softirqs.CPU70.RCU 11624 ± 2% +11.0% 12901 ± 2% softirqs.CPU70.SCHED 12372 ± 2% +13.2% 14003 ± 3% softirqs.CPU71.SCHED 9949 ± 25% -33.9% 6574 ± 31% softirqs.CPU72.RCU 10392 ± 26% -35.1% 6745 ± 35% softirqs.CPU73.RCU 12766 ± 3% +11.1% 14188 ± 3% softirqs.CPU76.SCHED 12611 ± 2% +18.8% 14984 ± 5% softirqs.CPU78.SCHED 12786 ± 3% +17.9% 15079 ± 7% softirqs.CPU79.SCHED 11947 ± 4% +9.7% 13103 ± 4% softirqs.CPU8.SCHED 13379 ± 7% +11.8% 14962 ± 4% softirqs.CPU83.SCHED 13438 ± 5% +9.7% 14738 ± 2% softirqs.CPU84.SCHED 12768 +19.4% 15241 ± 6% softirqs.CPU88.SCHED 8604 ± 13% -39.3% 5222 ± 3% softirqs.CPU89.RCU 13077 ± 2% +17.1% 15308 ± 7% softirqs.CPU89.SCHED 11887 ± 3% +20.1% 14272 ± 5% softirqs.CPU9.SCHED 12723 ± 3% +11.3% 14165 ± 4% softirqs.CPU90.SCHED 8439 ± 12% -38.9% 5153 ± 4% softirqs.CPU91.RCU 13429 ± 3% +10.3% 14806 ± 2% softirqs.CPU95.SCHED 12852 ± 4% +10.3% 14174 ± 5% softirqs.CPU96.SCHED 13010 ± 2% +14.4% 14888 ± 5% softirqs.CPU97.SCHED 2315644 ± 4% -36.2% 1477200 ± 4% softirqs.RCU 1572 ± 10% +63.9% 2578 ± 39% interrupts.CPU0.NMI:Non-maskable_interrupts 1572 ± 10% +63.9% 2578 ± 39% interrupts.CPU0.PMI:Performance_monitoring_interrupts 252.00 ± 11% -35.2% 163.25 ± 13% interrupts.CPU104.RES:Rescheduling_interrupts 2738 ± 24% +52.4% 4173 ± 19% interrupts.CPU105.NMI:Non-maskable_interrupts 2738 ± 24% +52.4% 4173 ± 19% interrupts.CPU105.PMI:Performance_monitoring_interrupts 245.75 ± 19% -31.0% 169.50 ± 7% interrupts.CPU105.RES:Rescheduling_interrupts 228.75 ± 13% -24.7% 172.25 ± 19% interrupts.CPU106.RES:Rescheduling_interrupts 2243 ± 15% +66.3% 3730 ± 35% interrupts.CPU113.NMI:Non-maskable_interrupts 2243 ± 15% +66.3% 3730 ± 35% interrupts.CPU113.PMI:Performance_monitoring_interrupts 2703 ± 31% +67.0% 4514 ± 33% interrupts.CPU118.NMI:Non-maskable_interrupts 2703 ± 31% +67.0% 4514 ± 33% interrupts.CPU118.PMI:Performance_monitoring_interrupts 2613 ± 25% +42.2% 3715 ± 24% interrupts.CPU121.NMI:Non-maskable_interrupts 2613 ± 25% +42.2% 3715 ± 24% interrupts.CPU121.PMI:Performance_monitoring_interrupts 311.50 ± 23% -47.7% 163.00 ± 9% interrupts.CPU122.RES:Rescheduling_interrupts 266.75 ± 19% -31.6% 182.50 ± 15% interrupts.CPU124.RES:Rescheduling_interrupts 293.75 ± 33% -32.3% 198.75 ± 19% interrupts.CPU125.RES:Rescheduling_interrupts 2601 ± 36% +43.2% 3724 ± 29% interrupts.CPU127.NMI:Non-maskable_interrupts 2601 ± 36% +43.2% 3724 ± 29% interrupts.CPU127.PMI:Performance_monitoring_interrupts 2258 ± 21% +68.2% 3797 ± 29% interrupts.CPU13.NMI:Non-maskable_interrupts 2258 ± 21% +68.2% 3797 ± 29% interrupts.CPU13.PMI:Performance_monitoring_interrupts 3338 ± 29% +54.6% 5160 ± 9% interrupts.CPU139.NMI:Non-maskable_interrupts 3338 ± 29% +54.6% 5160 ± 9% interrupts.CPU139.PMI:Performance_monitoring_interrupts 219.50 ± 27% -23.0% 169.00 ± 21% interrupts.CPU139.RES:Rescheduling_interrupts 290.25 ± 25% -32.5% 196.00 ± 11% interrupts.CPU14.RES:Rescheduling_interrupts 243.50 ± 4% -16.0% 204.50 ± 12% interrupts.CPU140.RES:Rescheduling_interrupts 1797 ± 15% +135.0% 4223 ± 46% interrupts.CPU147.NMI:Non-maskable_interrupts 1797 ± 15% +135.0% 4223 ± 46% interrupts.CPU147.PMI:Performance_monitoring_interrupts 2537 ± 22% +89.6% 4812 ± 28% interrupts.CPU15.NMI:Non-maskable_interrupts 2537 ± 22% +89.6% 4812 ± 28% interrupts.CPU15.PMI:Performance_monitoring_interrupts 292.25 ± 34% -33.9% 193.25 ± 6% interrupts.CPU15.RES:Rescheduling_interrupts 424.25 ± 37% -58.5% 176.25 ± 14% interrupts.CPU158.RES:Rescheduling_interrupts 312.50 ± 42% -54.2% 143.00 ± 18% interrupts.CPU159.RES:Rescheduling_interrupts 725.00 ±118% -75.7% 176.25 ± 14% interrupts.CPU163.RES:Rescheduling_interrupts 2367 ± 6% +59.9% 3786 ± 24% interrupts.CPU177.NMI:Non-maskable_interrupts 2367 ± 6% +59.9% 3786 ± 24% interrupts.CPU177.PMI:Performance_monitoring_interrupts 239.50 ± 30% -46.6% 128.00 ± 14% interrupts.CPU179.RES:Rescheduling_interrupts 320.75 ± 15% -24.0% 243.75 ± 20% interrupts.CPU20.RES:Rescheduling_interrupts 302.50 ± 17% -47.2% 159.75 ± 8% interrupts.CPU200.RES:Rescheduling_interrupts 2166 ± 5% +92.0% 4157 ± 40% interrupts.CPU207.NMI:Non-maskable_interrupts 2166 ± 5% +92.0% 4157 ± 40% interrupts.CPU207.PMI:Performance_monitoring_interrupts 217.00 ± 11% -34.6% 142.00 ± 12% interrupts.CPU214.RES:Rescheduling_interrupts 2610 ± 36% +47.4% 3848 ± 35% interrupts.CPU215.NMI:Non-maskable_interrupts 2610 ± 36% +47.4% 3848 ± 35% interrupts.CPU215.PMI:Performance_monitoring_interrupts 2046 ± 13% +118.6% 4475 ± 43% interrupts.CPU22.NMI:Non-maskable_interrupts 2046 ± 13% +118.6% 4475 ± 43% interrupts.CPU22.PMI:Performance_monitoring_interrupts 289.50 ± 28% -41.1% 170.50 ± 8% interrupts.CPU22.RES:Rescheduling_interrupts 2232 ± 6% +33.0% 2970 ± 24% interrupts.CPU221.NMI:Non-maskable_interrupts 2232 ± 6% +33.0% 2970 ± 24% interrupts.CPU221.PMI:Performance_monitoring_interrupts 4552 ± 12% -27.6% 3295 ± 15% interrupts.CPU222.NMI:Non-maskable_interrupts 4552 ± 12% -27.6% 3295 ± 15% interrupts.CPU222.PMI:Performance_monitoring_interrupts 2013 ± 15% +80.9% 3641 ± 27% interrupts.CPU226.NMI:Non-maskable_interrupts 2013 ± 15% +80.9% 3641 ± 27% interrupts.CPU226.PMI:Performance_monitoring_interrupts 2575 ± 49% +67.1% 4302 ± 34% interrupts.CPU227.NMI:Non-maskable_interrupts 2575 ± 49% +67.1% 4302 ± 34% interrupts.CPU227.PMI:Performance_monitoring_interrupts 248.00 ± 36% -36.3% 158.00 ± 19% interrupts.CPU228.RES:Rescheduling_interrupts 2441 ± 24% +43.0% 3490 ± 30% interrupts.CPU23.NMI:Non-maskable_interrupts 2441 ± 24% +43.0% 3490 ± 30% interrupts.CPU23.PMI:Performance_monitoring_interrupts 404.25 ± 69% -65.5% 139.50 ± 17% interrupts.CPU236.RES:Rescheduling_interrupts 566.50 ± 40% -73.6% 149.50 ± 31% interrupts.CPU237.RES:Rescheduling_interrupts 243.50 ± 26% -37.1% 153.25 ± 21% interrupts.CPU248.RES:Rescheduling_interrupts 258.25 ± 12% -53.5% 120.00 ± 18% interrupts.CPU249.RES:Rescheduling_interrupts 2888 ± 27% +49.4% 4313 ± 30% interrupts.CPU253.NMI:Non-maskable_interrupts 2888 ± 27% +49.4% 4313 ± 30% interrupts.CPU253.PMI:Performance_monitoring_interrupts 2468 ± 44% +67.3% 4131 ± 37% interrupts.CPU256.NMI:Non-maskable_interrupts 2468 ± 44% +67.3% 4131 ± 37% interrupts.CPU256.PMI:Performance_monitoring_interrupts 425.00 ± 59% -60.3% 168.75 ± 34% interrupts.CPU258.RES:Rescheduling_interrupts 1859 ± 16% +106.3% 3834 ± 44% interrupts.CPU268.NMI:Non-maskable_interrupts 1859 ± 16% +106.3% 3834 ± 44% interrupts.CPU268.PMI:Performance_monitoring_interrupts 2684 ± 28% +61.2% 4326 ± 36% interrupts.CPU269.NMI:Non-maskable_interrupts 2684 ± 28% +61.2% 4326 ± 36% interrupts.CPU269.PMI:Performance_monitoring_interrupts 2171 ± 6% +108.8% 4533 ± 20% interrupts.CPU270.NMI:Non-maskable_interrupts 2171 ± 6% +108.8% 4533 ± 20% interrupts.CPU270.PMI:Performance_monitoring_interrupts 2262 ± 14% +61.8% 3659 ± 37% interrupts.CPU273.NMI:Non-maskable_interrupts 2262 ± 14% +61.8% 3659 ± 37% interrupts.CPU273.PMI:Performance_monitoring_interrupts 2203 ± 11% +50.7% 3320 ± 38% interrupts.CPU279.NMI:Non-maskable_interrupts 2203 ± 11% +50.7% 3320 ± 38% interrupts.CPU279.PMI:Performance_monitoring_interrupts 2433 ± 17% +52.9% 3721 ± 25% interrupts.CPU280.NMI:Non-maskable_interrupts 2433 ± 17% +52.9% 3721 ± 25% interrupts.CPU280.PMI:Performance_monitoring_interrupts 2778 ± 33% +63.1% 4531 ± 36% interrupts.CPU283.NMI:Non-maskable_interrupts 2778 ± 33% +63.1% 4531 ± 36% interrupts.CPU283.PMI:Performance_monitoring_interrupts 331.75 ± 32% -39.8% 199.75 ± 17% interrupts.CPU29.RES:Rescheduling_interrupts 2178 ± 22% +53.9% 3353 ± 31% interrupts.CPU3.NMI:Non-maskable_interrupts 2178 ± 22% +53.9% 3353 ± 31% interrupts.CPU3.PMI:Performance_monitoring_interrupts 298.50 ± 30% -39.7% 180.00 ± 6% interrupts.CPU34.RES:Rescheduling_interrupts 2490 ± 3% +58.7% 3953 ± 28% interrupts.CPU35.NMI:Non-maskable_interrupts 2490 ± 3% +58.7% 3953 ± 28% interrupts.CPU35.PMI:Performance_monitoring_interrupts 270.50 ± 24% -31.1% 186.25 ± 3% interrupts.CPU36.RES:Rescheduling_interrupts 2493 ± 7% +57.0% 3915 ± 27% interrupts.CPU43.NMI:Non-maskable_interrupts 2493 ± 7% +57.0% 3915 ± 27% interrupts.CPU43.PMI:Performance_monitoring_interrupts 286.75 ± 36% -32.4% 193.75 ± 7% interrupts.CPU45.RES:Rescheduling_interrupts 259.00 ± 12% -23.6% 197.75 ± 13% interrupts.CPU46.RES:Rescheduling_interrupts 244.00 ± 21% -35.6% 157.25 ± 11% interrupts.CPU47.RES:Rescheduling_interrupts 230.00 ± 7% -21.3% 181.00 ± 11% interrupts.CPU48.RES:Rescheduling_interrupts 281.00 ± 13% -27.4% 204.00 ± 15% interrupts.CPU53.RES:Rescheduling_interrupts 256.75 ± 5% -18.4% 209.50 ± 12% interrupts.CPU54.RES:Rescheduling_interrupts 2433 ± 9% +68.4% 4098 ± 35% interrupts.CPU58.NMI:Non-maskable_interrupts 2433 ± 9% +68.4% 4098 ± 35% interrupts.CPU58.PMI:Performance_monitoring_interrupts 316.00 ± 25% -41.4% 185.25 ± 13% interrupts.CPU59.RES:Rescheduling_interrupts 2703 ± 38% +56.0% 4217 ± 31% interrupts.CPU60.NMI:Non-maskable_interrupts 2703 ± 38% +56.0% 4217 ± 31% interrupts.CPU60.PMI:Performance_monitoring_interrupts 2425 ± 16% +39.9% 3394 ± 27% interrupts.CPU61.NMI:Non-maskable_interrupts 2425 ± 16% +39.9% 3394 ± 27% interrupts.CPU61.PMI:Performance_monitoring_interrupts 2388 ± 18% +69.5% 4047 ± 29% interrupts.CPU66.NMI:Non-maskable_interrupts 2388 ± 18% +69.5% 4047 ± 29% interrupts.CPU66.PMI:Performance_monitoring_interrupts 2322 ± 11% +93.4% 4491 ± 35% interrupts.CPU67.NMI:Non-maskable_interrupts 2322 ± 11% +93.4% 4491 ± 35% interrupts.CPU67.PMI:Performance_monitoring_interrupts 319.00 ± 40% -44.7% 176.25 ± 9% interrupts.CPU67.RES:Rescheduling_interrupts 2512 ± 8% +28.1% 3219 ± 25% interrupts.CPU70.NMI:Non-maskable_interrupts 2512 ± 8% +28.1% 3219 ± 25% interrupts.CPU70.PMI:Performance_monitoring_interrupts 2290 ± 39% +78.7% 4094 ± 28% interrupts.CPU74.NMI:Non-maskable_interrupts 2290 ± 39% +78.7% 4094 ± 28% interrupts.CPU74.PMI:Performance_monitoring_interrupts 2446 ± 40% +94.8% 4764 ± 23% interrupts.CPU75.NMI:Non-maskable_interrupts 2446 ± 40% +94.8% 4764 ± 23% interrupts.CPU75.PMI:Performance_monitoring_interrupts 426.75 ± 61% -67.7% 138.00 ± 8% interrupts.CPU75.RES:Rescheduling_interrupts 192.50 ± 13% +45.6% 280.25 ± 45% interrupts.CPU76.RES:Rescheduling_interrupts 274.25 ± 34% -42.2% 158.50 ± 34% interrupts.CPU77.RES:Rescheduling_interrupts 2357 ± 9% +73.0% 4078 ± 23% interrupts.CPU78.NMI:Non-maskable_interrupts 2357 ± 9% +73.0% 4078 ± 23% interrupts.CPU78.PMI:Performance_monitoring_interrupts 348.50 ± 53% -47.3% 183.75 ± 29% interrupts.CPU80.RES:Rescheduling_interrupts 2650 ± 43% +46.2% 3874 ± 36% interrupts.CPU84.NMI:Non-maskable_interrupts 2650 ± 43% +46.2% 3874 ± 36% interrupts.CPU84.PMI:Performance_monitoring_interrupts 2235 ± 10% +117.8% 4867 ± 10% interrupts.CPU90.NMI:Non-maskable_interrupts 2235 ± 10% +117.8% 4867 ± 10% interrupts.CPU90.PMI:Performance_monitoring_interrupts 2606 ± 33% +38.1% 3598 ± 21% interrupts.CPU92.NMI:Non-maskable_interrupts 2606 ± 33% +38.1% 3598 ± 21% interrupts.CPU92.PMI:Performance_monitoring_interrupts 408.75 ± 58% -56.8% 176.75 ± 25% interrupts.CPU92.RES:Rescheduling_interrupts 399.00 ± 64% -63.6% 145.25 ± 16% interrupts.CPU93.RES:Rescheduling_interrupts 314.75 ± 36% -44.2% 175.75 ± 13% interrupts.CPU94.RES:Rescheduling_interrupts 191.00 ± 15% -29.1% 135.50 ± 9% interrupts.CPU97.RES:Rescheduling_interrupts 94.00 ± 8% +50.0% 141.00 ± 12% interrupts.IWI:IRQ_work_interrupts 841457 ± 7% +16.6% 980751 ± 3% interrupts.NMI:Non-maskable_interrupts 841457 ± 7% +16.6% 980751 ± 3% interrupts.PMI:Performance_monitoring_interrupts 12.75 ± 11% -4.1 8.67 ± 31% perf-profile.calltrace.cycles-pp.do_rw_once 1.02 ± 16% -0.6 0.47 ± 59% perf-profile.calltrace.cycles-pp.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle 1.10 ± 15% -0.4 0.66 ± 14% perf-profile.calltrace.cycles-pp.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry 1.05 ± 16% -0.4 0.61 ± 14% perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter 1.58 ± 4% +0.3 1.91 ± 7% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.11 ± 4% +0.5 2.60 ± 7% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault 0.83 ± 26% +0.5 1.32 ± 18% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe 0.83 ± 26% +0.5 1.32 ± 18% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.90 ± 5% +0.6 2.45 ± 7% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage 0.65 ± 62% +0.6 1.20 ± 15% perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.60 ± 62% +0.6 1.16 ± 18% perf-profile.calltrace.cycles-pp.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap 0.95 ± 17% +0.6 1.52 ± 8% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner 0.61 ± 62% +0.6 1.18 ± 18% perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput 0.61 ± 62% +0.6 1.19 ± 19% perf-profile.calltrace.cycles-pp.tlb_finish_mmu.exit_mmap.mmput.do_exit.do_group_exit 0.61 ± 62% +0.6 1.19 ± 19% perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput.do_exit 0.64 ± 61% +0.6 1.23 ± 18% perf-profile.calltrace.cycles-pp.mmput.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64 0.64 ± 61% +0.6 1.23 ± 18% perf-profile.calltrace.cycles-pp.exit_mmap.mmput.do_exit.do_group_exit.__x64_sys_exit_group 1.30 ± 9% +0.6 1.92 ± 8% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock 0.19 ±173% +0.7 0.89 ± 20% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu 0.19 ±173% +0.7 0.90 ± 20% perf-profile.calltrace.cycles-pp._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu 0.00 +0.8 0.77 ± 30% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page 0.00 +0.8 0.78 ± 30% perf-profile.calltrace.cycles-pp._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page 0.00 +0.8 0.79 ± 29% perf-profile.calltrace.cycles-pp.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 0.82 ± 67% +0.9 1.72 ± 22% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.84 ± 66% +0.9 1.74 ± 20% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 2.52 ± 6% +0.9 3.44 ± 9% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page 0.83 ± 67% +0.9 1.75 ± 21% perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 0.84 ± 66% +0.9 1.77 ± 20% perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 1.64 ± 12% +1.0 2.67 ± 7% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault 1.65 ± 45% +1.3 2.99 ± 18% perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 1.74 ± 13% +1.4 3.16 ± 6% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault 2.56 ± 48% +2.2 4.81 ± 19% perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault 12.64 ± 14% +3.6 16.20 ± 8% perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.__do_page_fault 2.97 ± 7% +3.8 6.74 ± 9% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page.hugetlb_cow 19.99 ± 9% +4.1 24.05 ± 6% perf-profile.calltrace.cycles-pp.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault.do_page_fault 1.37 ± 15% -0.5 0.83 ± 13% perf-profile.children.cycles-pp.sched_clock_cpu 1.31 ± 16% -0.5 0.78 ± 13% perf-profile.children.cycles-pp.sched_clock 1.29 ± 16% -0.5 0.77 ± 13% perf-profile.children.cycles-pp.native_sched_clock 1.80 ± 2% -0.3 1.47 ± 10% perf-profile.children.cycles-pp.task_tick_fair 0.73 ± 2% -0.2 0.54 ± 11% perf-profile.children.cycles-pp.update_curr 0.42 ± 17% -0.2 0.27 ± 16% perf-profile.children.cycles-pp.account_process_tick 0.73 ± 10% -0.2 0.58 ± 9% perf-profile.children.cycles-pp.rcu_sched_clock_irq 0.27 ± 6% -0.1 0.14 ± 14% perf-profile.children.cycles-pp.__acct_update_integrals 0.27 ± 18% -0.1 0.16 ± 13% perf-profile.children.cycles-pp.rcu_segcblist_ready_cbs 0.40 ± 12% -0.1 0.30 ± 14% perf-profile.children.cycles-pp.__next_timer_interrupt 0.47 ± 7% -0.1 0.39 ± 13% perf-profile.children.cycles-pp.update_rq_clock 0.29 ± 12% -0.1 0.21 ± 15% perf-profile.children.cycles-pp.cpuidle_governor_latency_req 0.21 ± 7% -0.1 0.14 ± 12% perf-profile.children.cycles-pp.account_system_index_time 0.38 ± 2% -0.1 0.31 ± 12% perf-profile.children.cycles-pp.timerqueue_add 0.26 ± 11% -0.1 0.20 ± 13% perf-profile.children.cycles-pp.find_next_bit 0.23 ± 15% -0.1 0.17 ± 15% perf-profile.children.cycles-pp.rcu_dynticks_eqs_exit 0.14 ± 8% -0.1 0.07 ± 14% perf-profile.children.cycles-pp.account_user_time 0.17 ± 6% -0.0 0.12 ± 10% perf-profile.children.cycles-pp.cpuacct_charge 0.18 ± 20% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.irq_work_tick 0.11 ± 13% -0.0 0.07 ± 25% perf-profile.children.cycles-pp.tick_sched_do_timer 0.12 ± 10% -0.0 0.08 ± 15% perf-profile.children.cycles-pp.get_cpu_device 0.07 ± 11% -0.0 0.04 ± 58% perf-profile.children.cycles-pp.raise_softirq 0.12 ± 3% -0.0 0.09 ± 8% perf-profile.children.cycles-pp.write 0.11 ± 13% +0.0 0.14 ± 8% perf-profile.children.cycles-pp.native_write_msr 0.09 ± 9% +0.0 0.11 ± 7% perf-profile.children.cycles-pp.finish_task_switch 0.10 ± 10% +0.0 0.13 ± 5% perf-profile.children.cycles-pp.schedule_idle 0.07 ± 6% +0.0 0.10 ± 12% perf-profile.children.cycles-pp.__read_nocancel 0.04 ± 58% +0.0 0.07 ± 15% perf-profile.children.cycles-pp.__free_pages_ok 0.06 ± 7% +0.0 0.09 ± 13% perf-profile.children.cycles-pp.perf_read 0.07 +0.0 0.11 ± 14% perf-profile.children.cycles-pp.perf_evsel__read_counter 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.cmd_stat 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.__run_perf_stat 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.process_interval 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.read_counters 0.07 ± 22% +0.0 0.11 ± 19% perf-profile.children.cycles-pp.__handle_mm_fault 0.07 ± 19% +0.1 0.13 ± 8% perf-profile.children.cycles-pp.rb_erase 0.03 ±100% +0.1 0.09 ± 9% perf-profile.children.cycles-pp.smp_call_function_single 0.01 ±173% +0.1 0.08 ± 11% perf-profile.children.cycles-pp.perf_event_read 0.00 +0.1 0.07 ± 13% perf-profile.children.cycles-pp.__perf_event_read_value 0.00 +0.1 0.07 ± 7% perf-profile.children.cycles-pp.__intel_pmu_enable_all 0.08 ± 17% +0.1 0.15 ± 8% perf-profile.children.cycles-pp.native_apic_msr_eoi_write 0.04 ±103% +0.1 0.13 ± 58% perf-profile.children.cycles-pp.shmem_getpage_gfp 0.38 ± 14% +0.1 0.51 ± 6% perf-profile.children.cycles-pp.run_timer_softirq 0.11 ± 4% +0.3 0.37 ± 32% perf-profile.children.cycles-pp.worker_thread 0.20 ± 5% +0.3 0.48 ± 25% perf-profile.children.cycles-pp.ret_from_fork 0.20 ± 4% +0.3 0.48 ± 25% perf-profile.children.cycles-pp.kthread 0.00 +0.3 0.29 ± 38% perf-profile.children.cycles-pp.memcpy_erms 0.00 +0.3 0.29 ± 38% perf-profile.children.cycles-pp.drm_fb_helper_dirty_work 0.00 +0.3 0.31 ± 37% perf-profile.children.cycles-pp.process_one_work 0.47 ± 48% +0.4 0.91 ± 19% perf-profile.children.cycles-pp.prep_new_huge_page 0.70 ± 29% +0.5 1.16 ± 18% perf-profile.children.cycles-pp.free_huge_page 0.73 ± 29% +0.5 1.19 ± 18% perf-profile.children.cycles-pp.tlb_flush_mmu 0.72 ± 29% +0.5 1.18 ± 18% perf-profile.children.cycles-pp.release_pages 0.73 ± 29% +0.5 1.19 ± 18% perf-profile.children.cycles-pp.tlb_finish_mmu 0.76 ± 27% +0.5 1.23 ± 18% perf-profile.children.cycles-pp.exit_mmap 0.77 ± 27% +0.5 1.24 ± 18% perf-profile.children.cycles-pp.mmput 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.__x64_sys_exit_group 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.do_group_exit 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.do_exit 1.28 ± 29% +0.5 1.76 ± 9% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler 0.77 ± 28% +0.5 1.26 ± 13% perf-profile.children.cycles-pp.alloc_fresh_huge_page 1.53 ± 15% +0.7 2.26 ± 14% perf-profile.children.cycles-pp.do_syscall_64 1.53 ± 15% +0.7 2.27 ± 14% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 1.13 ± 3% +0.9 2.07 ± 14% perf-profile.children.cycles-pp.interrupt_entry 0.79 ± 9% +1.0 1.76 ± 5% perf-profile.children.cycles-pp.perf_event_task_tick 1.71 ± 39% +1.4 3.08 ± 16% perf-profile.children.cycles-pp.alloc_surplus_huge_page 2.66 ± 42% +2.3 4.94 ± 17% perf-profile.children.cycles-pp.alloc_huge_page 2.89 ± 45% +2.7 5.54 ± 18% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 3.34 ± 35% +2.7 6.02 ± 17% perf-profile.children.cycles-pp._raw_spin_lock 12.77 ± 14% +3.9 16.63 ± 7% perf-profile.children.cycles-pp.mutex_spin_on_owner 20.12 ± 9% +4.0 24.16 ± 6% perf-profile.children.cycles-pp.hugetlb_cow 15.40 ± 10% -3.6 11.84 ± 28% perf-profile.self.cycles-pp.do_rw_once 4.02 ± 9% -1.3 2.73 ± 30% perf-profile.self.cycles-pp.do_access 2.00 ± 14% -0.6 1.41 ± 13% perf-profile.self.cycles-pp.cpuidle_enter_state 1.26 ± 16% -0.5 0.74 ± 13% perf-profile.self.cycles-pp.native_sched_clock 0.42 ± 17% -0.2 0.27 ± 16% perf-profile.self.cycles-pp.account_process_tick 0.27 ± 19% -0.2 0.12 ± 17% perf-profile.self.cycles-pp.timerqueue_del 0.53 ± 3% -0.1 0.38 ± 11% perf-profile.self.cycles-pp.update_curr 0.27 ± 6% -0.1 0.14 ± 14% perf-profile.self.cycles-pp.__acct_update_integrals 0.27 ± 18% -0.1 0.16 ± 13% perf-profile.self.cycles-pp.rcu_segcblist_ready_cbs 0.61 ± 4% -0.1 0.51 ± 8% perf-profile.self.cycles-pp.task_tick_fair 0.20 ± 8% -0.1 0.12 ± 14% perf-profile.self.cycles-pp.account_system_index_time 0.23 ± 15% -0.1 0.16 ± 17% perf-profile.self.cycles-pp.rcu_dynticks_eqs_exit 0.25 ± 11% -0.1 0.18 ± 14% perf-profile.self.cycles-pp.find_next_bit 0.10 ± 11% -0.1 0.03 ±100% perf-profile.self.cycles-pp.tick_sched_do_timer 0.29 -0.1 0.23 ± 11% perf-profile.self.cycles-pp.timerqueue_add 0.12 ± 10% -0.1 0.06 ± 17% perf-profile.self.cycles-pp.account_user_time 0.22 ± 15% -0.1 0.16 ± 6% perf-profile.self.cycles-pp.scheduler_tick 0.17 ± 6% -0.0 0.12 ± 10% perf-profile.self.cycles-pp.cpuacct_charge 0.18 ± 20% -0.0 0.13 ± 3% perf-profile.self.cycles-pp.irq_work_tick 0.07 ± 13% -0.0 0.03 ±100% perf-profile.self.cycles-pp.update_process_times 0.12 ± 7% -0.0 0.08 ± 15% perf-profile.self.cycles-pp.get_cpu_device 0.07 ± 11% -0.0 0.04 ± 58% perf-profile.self.cycles-pp.raise_softirq 0.12 ± 11% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.tick_nohz_get_sleep_length 0.11 ± 11% +0.0 0.14 ± 6% perf-profile.self.cycles-pp.native_write_msr 0.10 ± 5% +0.1 0.15 ± 8% perf-profile.self.cycles-pp.__remove_hrtimer 0.07 ± 23% +0.1 0.13 ± 8% perf-profile.self.cycles-pp.rb_erase 0.08 ± 17% +0.1 0.15 ± 7% perf-profile.self.cycles-pp.native_apic_msr_eoi_write 0.00 +0.1 0.08 ± 10% perf-profile.self.cycles-pp.smp_call_function_single 0.32 ± 17% +0.1 0.42 ± 7% perf-profile.self.cycles-pp.run_timer_softirq 0.22 ± 5% +0.1 0.34 ± 4% perf-profile.self.cycles-pp.ktime_get_update_offsets_now 0.45 ± 15% +0.2 0.60 ± 12% perf-profile.self.cycles-pp.rcu_irq_enter 0.31 ± 8% +0.2 0.46 ± 16% perf-profile.self.cycles-pp.irq_enter 0.29 ± 10% +0.2 0.44 ± 16% perf-profile.self.cycles-pp.apic_timer_interrupt 0.71 ± 30% +0.2 0.92 ± 8% perf-profile.self.cycles-pp.perf_mux_hrtimer_handler 0.00 +0.3 0.28 ± 37% perf-profile.self.cycles-pp.memcpy_erms 1.12 ± 3% +0.9 2.02 ± 15% perf-profile.self.cycles-pp.interrupt_entry 0.79 ± 9% +0.9 1.73 ± 5% perf-profile.self.cycles-pp.perf_event_task_tick 2.49 ± 45% +2.1 4.55 ± 20% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 10.95 ± 15% +2.7 13.61 ± 8% perf-profile.self.cycles-pp.mutex_spin_on_owner
vm-scalability.throughput
1.6e+07 +-+---------------------------------------------------------------+ |..+.+ +..+.+..+.+. +. +..+.+..+.+..+.+..+.+..+ + | 1.4e+07 +-+ : : O O O O | 1.2e+07 O-+O O O O O O O O O O O O O O O O O O | : : O O O O | 1e+07 +-+ : : | | : : | 8e+06 +-+ : : | | : : | 6e+06 +-+ : : | 4e+06 +-+ : : | | :: | 2e+06 +-+ : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.time.minor_page_faults
2.5e+06 +-+---------------------------------------------------------------+ | | |..+.+ +..+.+..+.+..+.+..+.+.. .+. .+.+..+.+..+.+..+.+..+ | 2e+06 +-+ : : +. +. | O O O: O O O O O O O O O O | | : : O O O O O O O O O O O O O O 1.5e+06 +-+ : : | | : : | 1e+06 +-+ : : | | : : | | : : | 500000 +-+ : : | | : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.workload
3.5e+09 +-+---------------------------------------------------------------+ | .+. .+.+.. .+.. | 3e+09 +-+ + +..+.+..+.+..+.+. +..+.+..+.+..+.+..+.+..+ + | | : : O O O | 2.5e+09 O-+O O: O O O O O O O O O | | : : O O O O O O O O O O O O 2e+09 +-+ : : | | : : | 1.5e+09 +-+ : : | | : : | 1e+09 +-+ : : | | : : | 5e+08 +-+ : | | : | 0 +-+---------------------------------------------------------------+
[*] bisect-good sample [O] bisect-bad sample
Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
Thanks, Rong Chen
On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Am 29.07.19 um 11:51 schrieb kernel test robot:
Greeting,
FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
Daniel, Noralf, we may have to revert this patch.
I expected some change in display performance, but not in VM. Since it's a server chipset, probably no one cares much about display performance. So that seemed like a good trade-off for re-using shared code.
Part of the patch set is that the generic fb emulation now maps and unmaps the fbdev BO when updating the screen. I guess that's the cause of the performance regression. And it should be visible with other drivers as well if they use a shadow FB for fbdev emulation.
For fbcon we should need to do any maps/unamps at all, this is for the fbdev mmap support only. If the testcase mentioned here tests fbdev mmap handling it's pretty badly misnamed :-) And as long as you don't have an fbdev mmap there shouldn't be any impact at all.
The thing is that we'd need another generic fbdev emulation for ast and mgag200 that handles this issue properly.
Yeah I dont think we want to jump the gun here. If you can try to repro locally and profile where we're wasting cpu time I hope that should sched a light what's going wrong here. -Daniel
Best regards Thomas
in testcase: vm-scalability on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory with following parameters:
runtime: 300s size: 8T test: anon-cow-seq-hugetlb cpufreq_governor: performance
test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us. test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
Details are as below: -------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml
========================================================================================= compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase: gcc-7/performance/x86_64-rhel-7.6/debian-x86_64-2019-05-14.cgz/300s/8T/lkp-knm01/anon-cow-seq-hugetlb/vm-scalability
commit: f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console") 90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9
fail:runs %reproduction fail:runs | | | 2:4 -50% :4 dmesg.WARNING:at#for_ip_interrupt_entry/0x :4 25% 1:4 dmesg.WARNING:at_ip___perf_sw_event/0x :4 25% 1:4 dmesg.WARNING:at_ip__fsnotify_parent/0x %stddev %change %stddev \ | \ 43955 ± 2% -18.8% 35691 vm-scalability.median 0.06 ± 7% +193.0% 0.16 ± 2% vm-scalability.median_stddev
14906559 ± 2% -17.9% 12237079 vm-scalability.throughput 87651 ± 2% -17.4% 72374 vm-scalability.time.involuntary_context_switches 2086168 -23.6% 1594224 vm-scalability.time.minor_page_faults 15082 ± 2% -10.4% 13517 vm-scalability.time.percent_of_cpu_this_job_got 29987 -8.9% 27327 vm-scalability.time.system_time 15755 -12.4% 13795 vm-scalability.time.user_time 122011 -19.3% 98418 vm-scalability.time.voluntary_context_switches 3.034e+09 -23.6% 2.318e+09 vm-scalability.workload 242478 ± 12% +68.5% 408518 ± 23% cpuidle.POLL.time 2788 ± 21% +117.4% 6062 ± 26% cpuidle.POLL.usage 56653 ± 10% +64.4% 93144 ± 20% meminfo.Mapped 120392 ± 7% +14.0% 137212 ± 4% meminfo.Shmem 47221 ± 11% +77.1% 83634 ± 22% numa-meminfo.node0.Mapped 120465 ± 7% +13.9% 137205 ± 4% numa-meminfo.node0.Shmem 2885513 -16.5% 2409384 numa-numastat.node0.local_node 2885471 -16.5% 2409354 numa-numastat.node0.numa_hit 11813 ± 11% +76.3% 20824 ± 22% numa-vmstat.node0.nr_mapped 30096 ± 7% +13.8% 34238 ± 4% numa-vmstat.node0.nr_shmem 43.72 ± 2% +5.5 49.20 mpstat.cpu.all.idle% 0.03 ± 4% +0.0 0.05 ± 6% mpstat.cpu.all.soft% 19.51 -2.4 17.08 mpstat.cpu.all.usr% 1012 -7.9% 932.75 turbostat.Avg_MHz 32.38 ± 10% +25.8% 40.73 turbostat.CPU%c1 145.51 -3.1% 141.01 turbostat.PkgWatt 15.09 -19.2% 12.19 turbostat.RAMWatt 43.50 ± 2% +13.2% 49.25 vmstat.cpu.id 18.75 ± 2% -13.3% 16.25 ± 2% vmstat.cpu.us 152.00 ± 2% -9.5% 137.50 vmstat.procs.r 4800 -13.1% 4173 vmstat.system.cs 156170 -11.9% 137594 slabinfo.anon_vma.active_objs 3395 -11.9% 2991 slabinfo.anon_vma.active_slabs 156190 -11.9% 137606 slabinfo.anon_vma.num_objs 3395 -11.9% 2991 slabinfo.anon_vma.num_slabs 1716 ± 5% +11.5% 1913 ± 8% slabinfo.dmaengine-unmap-16.active_objs 1716 ± 5% +11.5% 1913 ± 8% slabinfo.dmaengine-unmap-16.num_objs 1767 ± 2% -19.0% 1431 ± 2% slabinfo.hugetlbfs_inode_cache.active_objs 1767 ± 2% -19.0% 1431 ± 2% slabinfo.hugetlbfs_inode_cache.num_objs 3597 ± 5% -16.4% 3006 ± 3% slabinfo.skbuff_ext_cache.active_objs 3597 ± 5% -16.4% 3006 ± 3% slabinfo.skbuff_ext_cache.num_objs 1330122 -23.6% 1016557 proc-vmstat.htlb_buddy_alloc_success 77214 ± 3% +6.4% 82128 ± 2% proc-vmstat.nr_active_anon 67277 +2.9% 69246 proc-vmstat.nr_anon_pages 218.50 ± 3% -10.6% 195.25 proc-vmstat.nr_dirtied 288628 +1.4% 292755 proc-vmstat.nr_file_pages 360.50 -2.7% 350.75 proc-vmstat.nr_inactive_file 14225 ± 9% +63.8% 23304 ± 20% proc-vmstat.nr_mapped 30109 ± 7% +13.8% 34259 ± 4% proc-vmstat.nr_shmem 99870 -1.3% 98597 proc-vmstat.nr_slab_unreclaimable 204.00 ± 4% -12.1% 179.25 proc-vmstat.nr_written 77214 ± 3% +6.4% 82128 ± 2% proc-vmstat.nr_zone_active_anon 360.50 -2.7% 350.75 proc-vmstat.nr_zone_inactive_file 8810 ± 19% -66.1% 2987 ± 42% proc-vmstat.numa_hint_faults 8810 ± 19% -66.1% 2987 ± 42% proc-vmstat.numa_hint_faults_local 2904082 -16.4% 2427026 proc-vmstat.numa_hit 2904081 -16.4% 2427025 proc-vmstat.numa_local 6.828e+08 -23.5% 5.221e+08 proc-vmstat.pgalloc_normal 2900008 -17.2% 2400195 proc-vmstat.pgfault 6.827e+08 -23.5% 5.22e+08 proc-vmstat.pgfree 1.635e+10 -17.0% 1.357e+10 perf-stat.i.branch-instructions 1.53 ± 4% -0.1 1.45 ± 3% perf-stat.i.branch-miss-rate% 2.581e+08 ± 3% -20.5% 2.051e+08 ± 2% perf-stat.i.branch-misses 12.66 +1.1 13.78 perf-stat.i.cache-miss-rate% 72720849 -12.0% 63958986 perf-stat.i.cache-misses 5.766e+08 -18.6% 4.691e+08 perf-stat.i.cache-references 4674 ± 2% -13.0% 4064 perf-stat.i.context-switches 4.29 +12.5% 4.83 perf-stat.i.cpi 2.573e+11 -7.4% 2.383e+11 perf-stat.i.cpu-cycles 231.35 -21.5% 181.56 perf-stat.i.cpu-migrations 3522 +4.4% 3677 perf-stat.i.cycles-between-cache-misses 0.09 ± 13% +0.0 0.12 ± 5% perf-stat.i.iTLB-load-miss-rate% 5.894e+10 -15.8% 4.961e+10 perf-stat.i.iTLB-loads 5.901e+10 -15.8% 4.967e+10 perf-stat.i.instructions 1291 ± 14% -21.8% 1010 perf-stat.i.instructions-per-iTLB-miss 0.24 -11.0% 0.21 perf-stat.i.ipc 9476 -17.5% 7821 perf-stat.i.minor-faults 9478 -17.5% 7821 perf-stat.i.page-faults 9.76 -3.6% 9.41 perf-stat.overall.MPKI 1.59 ± 4% -0.1 1.52 perf-stat.overall.branch-miss-rate% 12.61 +1.1 13.71 perf-stat.overall.cache-miss-rate% 4.38 +10.5% 4.83 perf-stat.overall.cpi 3557 +5.3% 3747 perf-stat.overall.cycles-between-cache-misses 0.08 ± 12% +0.0 0.10 perf-stat.overall.iTLB-load-miss-rate% 1268 ± 15% -23.0% 976.22 perf-stat.overall.instructions-per-iTLB-miss 0.23 -9.5% 0.21 perf-stat.overall.ipc 5815 +9.7% 6378 perf-stat.overall.path-length 1.634e+10 -17.5% 1.348e+10 perf-stat.ps.branch-instructions 2.595e+08 ± 3% -21.2% 2.043e+08 ± 2% perf-stat.ps.branch-misses 72565205 -12.2% 63706339 perf-stat.ps.cache-misses 5.754e+08 -19.2% 4.646e+08 perf-stat.ps.cache-references 4640 ± 2% -12.5% 4060 perf-stat.ps.context-switches 2.581e+11 -7.5% 2.387e+11 perf-stat.ps.cpu-cycles 229.91 -22.0% 179.42 perf-stat.ps.cpu-migrations 5.889e+10 -16.3% 4.927e+10 perf-stat.ps.iTLB-loads 5.899e+10 -16.3% 4.938e+10 perf-stat.ps.instructions 9388 -18.2% 7677 perf-stat.ps.minor-faults 9389 -18.2% 7677 perf-stat.ps.page-faults 1.764e+13 -16.2% 1.479e+13 perf-stat.total.instructions 46803 ± 3% -18.8% 37982 ± 6% sched_debug.cfs_rq:/.exec_clock.min 5320 ± 3% +23.7% 6581 ± 3% sched_debug.cfs_rq:/.exec_clock.stddev 6737 ± 14% +58.1% 10649 ± 10% sched_debug.cfs_rq:/.load.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cfs_rq:/.load.max 46952 ± 16% +64.8% 77388 ± 11% sched_debug.cfs_rq:/.load.stddev 7.12 ± 4% +49.1% 10.62 ± 6% sched_debug.cfs_rq:/.load_avg.avg 474.40 ± 23% +67.5% 794.60 ± 10% sched_debug.cfs_rq:/.load_avg.max 37.70 ± 11% +74.8% 65.90 ± 9% sched_debug.cfs_rq:/.load_avg.stddev 13424269 ± 4% -15.6% 11328098 ± 2% sched_debug.cfs_rq:/.min_vruntime.avg 15411275 ± 3% -12.4% 13505072 ± 2% sched_debug.cfs_rq:/.min_vruntime.max 7939295 ± 6% -17.5% 6551322 ± 7% sched_debug.cfs_rq:/.min_vruntime.min 21.44 ± 7% -56.1% 9.42 ± 4% sched_debug.cfs_rq:/.nr_spread_over.avg 117.45 ± 11% -60.6% 46.30 ± 14% sched_debug.cfs_rq:/.nr_spread_over.max 19.33 ± 8% -66.4% 6.49 ± 9% sched_debug.cfs_rq:/.nr_spread_over.stddev 4.32 ± 15% +84.4% 7.97 ± 3% sched_debug.cfs_rq:/.runnable_load_avg.avg 353.85 ± 29% +118.8% 774.35 ± 11% sched_debug.cfs_rq:/.runnable_load_avg.max 27.30 ± 24% +118.5% 59.64 ± 9% sched_debug.cfs_rq:/.runnable_load_avg.stddev 6729 ± 14% +58.2% 10644 ± 10% sched_debug.cfs_rq:/.runnable_weight.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cfs_rq:/.runnable_weight.max 46950 ± 16% +64.8% 77387 ± 11% sched_debug.cfs_rq:/.runnable_weight.stddev 5305069 ± 4% -17.4% 4380376 ± 7% sched_debug.cfs_rq:/.spread0.avg 7328745 ± 3% -9.9% 6600897 ± 3% sched_debug.cfs_rq:/.spread0.max 2220837 ± 4% +55.8% 3460596 ± 5% sched_debug.cpu.avg_idle.avg 4590666 ± 9% +76.8% 8117037 ± 15% sched_debug.cpu.avg_idle.max 485052 ± 7% +80.3% 874679 ± 10% sched_debug.cpu.avg_idle.stddev 561.50 ± 26% +37.7% 773.30 ± 15% sched_debug.cpu.clock.stddev 561.50 ± 26% +37.7% 773.30 ± 15% sched_debug.cpu.clock_task.stddev 3.20 ± 10% +109.6% 6.70 ± 3% sched_debug.cpu.cpu_load[0].avg 309.10 ± 20% +150.3% 773.75 ± 12% sched_debug.cpu.cpu_load[0].max 21.02 ± 14% +160.8% 54.80 ± 9% sched_debug.cpu.cpu_load[0].stddev 3.19 ± 8% +109.8% 6.70 ± 3% sched_debug.cpu.cpu_load[1].avg 299.75 ± 19% +158.0% 773.30 ± 12% sched_debug.cpu.cpu_load[1].max 20.32 ± 12% +168.7% 54.62 ± 9% sched_debug.cpu.cpu_load[1].stddev 3.20 ± 8% +109.1% 6.69 ± 4% sched_debug.cpu.cpu_load[2].avg 288.90 ± 20% +167.0% 771.40 ± 12% sched_debug.cpu.cpu_load[2].max 19.70 ± 12% +175.4% 54.27 ± 9% sched_debug.cpu.cpu_load[2].stddev 3.16 ± 8% +110.9% 6.66 ± 6% sched_debug.cpu.cpu_load[3].avg 275.50 ± 24% +178.4% 766.95 ± 12% sched_debug.cpu.cpu_load[3].max 18.92 ± 15% +184.2% 53.77 ± 10% sched_debug.cpu.cpu_load[3].stddev 3.08 ± 8% +115.7% 6.65 ± 7% sched_debug.cpu.cpu_load[4].avg 263.55 ± 28% +188.7% 760.85 ± 12% sched_debug.cpu.cpu_load[4].max 18.03 ± 18% +196.6% 53.46 ± 11% sched_debug.cpu.cpu_load[4].stddev 14543 -9.6% 13150 sched_debug.cpu.curr->pid.max 5293 ± 16% +74.7% 9248 ± 11% sched_debug.cpu.load.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cpu.load.max 40887 ± 19% +78.3% 72891 ± 9% sched_debug.cpu.load.stddev 1141679 ± 4% +56.9% 1790907 ± 5% sched_debug.cpu.max_idle_balance_cost.avg 2432100 ± 9% +72.6% 4196779 ± 13% sched_debug.cpu.max_idle_balance_cost.max 745656 +29.3% 964170 ± 5% sched_debug.cpu.max_idle_balance_cost.min 239032 ± 9% +81.9% 434806 ± 10% sched_debug.cpu.max_idle_balance_cost.stddev 0.00 ± 27% +92.1% 0.00 ± 31% sched_debug.cpu.next_balance.stddev 1030 ± 4% -10.4% 924.00 ± 2% sched_debug.cpu.nr_switches.min 0.04 ± 26% +139.0% 0.09 ± 41% sched_debug.cpu.nr_uninterruptible.avg 830.35 ± 6% -12.0% 730.50 ± 2% sched_debug.cpu.sched_count.min 912.00 ± 2% -9.5% 825.38 sched_debug.cpu.ttwu_count.avg 433.05 ± 3% -19.2% 350.05 ± 3% sched_debug.cpu.ttwu_count.min 160.70 ± 3% -12.5% 140.60 ± 4% sched_debug.cpu.ttwu_local.min 9072 ± 11% -36.4% 5767 ± 8% softirqs.CPU1.RCU 12769 ± 5% +15.3% 14718 ± 3% softirqs.CPU101.SCHED 13198 +11.5% 14717 ± 3% softirqs.CPU102.SCHED 12981 ± 4% +13.9% 14788 ± 3% softirqs.CPU105.SCHED 13486 ± 3% +11.8% 15071 ± 4% softirqs.CPU111.SCHED 12794 ± 4% +14.1% 14601 ± 9% softirqs.CPU112.SCHED 12999 ± 4% +10.1% 14314 ± 4% softirqs.CPU115.SCHED 12844 ± 4% +10.6% 14202 ± 2% softirqs.CPU120.SCHED 13336 ± 3% +9.4% 14585 ± 3% softirqs.CPU122.SCHED 12639 ± 4% +20.2% 15195 softirqs.CPU123.SCHED 13040 ± 5% +15.2% 15024 ± 5% softirqs.CPU126.SCHED 13123 +15.1% 15106 ± 5% softirqs.CPU127.SCHED 9188 ± 6% -35.7% 5911 ± 2% softirqs.CPU13.RCU 13054 ± 3% +13.1% 14761 ± 5% softirqs.CPU130.SCHED 13158 ± 2% +13.9% 14985 ± 5% softirqs.CPU131.SCHED 12797 ± 6% +13.5% 14524 ± 3% softirqs.CPU133.SCHED 12452 ± 5% +14.8% 14297 softirqs.CPU134.SCHED 13078 ± 3% +10.4% 14439 ± 3% softirqs.CPU138.SCHED 12617 ± 2% +14.5% 14442 ± 5% softirqs.CPU139.SCHED 12974 ± 3% +13.7% 14752 ± 4% softirqs.CPU142.SCHED 12579 ± 4% +19.1% 14983 ± 3% softirqs.CPU143.SCHED 9122 ± 24% -44.6% 5053 ± 5% softirqs.CPU144.RCU 13366 ± 2% +11.1% 14848 ± 3% softirqs.CPU149.SCHED 13246 ± 2% +22.0% 16162 ± 7% softirqs.CPU150.SCHED 13452 ± 3% +20.5% 16210 ± 7% softirqs.CPU151.SCHED 13507 +10.1% 14869 softirqs.CPU156.SCHED 13808 ± 3% +9.2% 15079 ± 4% softirqs.CPU157.SCHED 13442 ± 2% +13.4% 15248 ± 4% softirqs.CPU160.SCHED 13311 +12.1% 14920 ± 2% softirqs.CPU162.SCHED 13544 ± 3% +8.5% 14695 ± 4% softirqs.CPU163.SCHED 13648 ± 3% +11.2% 15179 ± 2% softirqs.CPU166.SCHED 13404 ± 4% +12.5% 15079 ± 3% softirqs.CPU168.SCHED 13421 ± 6% +16.0% 15568 ± 8% softirqs.CPU169.SCHED 13115 ± 3% +23.1% 16139 ± 10% softirqs.CPU171.SCHED 13424 ± 6% +10.4% 14822 ± 3% softirqs.CPU175.SCHED 13274 ± 3% +13.7% 15087 ± 9% softirqs.CPU185.SCHED 13409 ± 3% +12.3% 15063 ± 3% softirqs.CPU190.SCHED 13181 ± 7% +13.4% 14946 ± 3% softirqs.CPU196.SCHED 13578 ± 3% +10.9% 15061 softirqs.CPU197.SCHED 13323 ± 5% +24.8% 16627 ± 6% softirqs.CPU198.SCHED 14072 ± 2% +12.3% 15798 ± 7% softirqs.CPU199.SCHED 12604 ± 13% +17.9% 14865 softirqs.CPU201.SCHED 13380 ± 4% +14.8% 15356 ± 3% softirqs.CPU203.SCHED 13481 ± 8% +14.2% 15390 ± 3% softirqs.CPU204.SCHED 12921 ± 2% +13.8% 14710 ± 3% softirqs.CPU206.SCHED 13468 +13.0% 15218 ± 2% softirqs.CPU208.SCHED 13253 ± 2% +13.1% 14992 softirqs.CPU209.SCHED 13319 ± 2% +14.3% 15225 ± 7% softirqs.CPU210.SCHED 13673 ± 5% +16.3% 15895 ± 3% softirqs.CPU211.SCHED 13290 +17.0% 15556 ± 5% softirqs.CPU212.SCHED 13455 ± 4% +14.4% 15392 ± 3% softirqs.CPU213.SCHED 13454 ± 4% +14.3% 15377 ± 3% softirqs.CPU215.SCHED 13872 ± 7% +9.7% 15221 ± 5% softirqs.CPU220.SCHED 13555 ± 4% +17.3% 15896 ± 5% softirqs.CPU222.SCHED 13411 ± 4% +20.8% 16197 ± 6% softirqs.CPU223.SCHED 8472 ± 21% -44.8% 4680 ± 3% softirqs.CPU224.RCU 13141 ± 3% +16.2% 15265 ± 7% softirqs.CPU225.SCHED 14084 ± 3% +8.2% 15242 ± 2% softirqs.CPU226.SCHED 13528 ± 4% +11.3% 15063 ± 4% softirqs.CPU228.SCHED 13218 ± 3% +16.3% 15377 ± 4% softirqs.CPU229.SCHED 14031 ± 4% +10.2% 15467 ± 2% softirqs.CPU231.SCHED 13770 ± 3% +14.0% 15700 ± 3% softirqs.CPU232.SCHED 13456 ± 3% +12.3% 15105 ± 3% softirqs.CPU233.SCHED 13137 ± 4% +13.5% 14909 ± 3% softirqs.CPU234.SCHED 13318 ± 2% +14.7% 15280 ± 2% softirqs.CPU235.SCHED 13690 ± 2% +13.7% 15563 ± 7% softirqs.CPU238.SCHED 13771 ± 5% +20.8% 16634 ± 7% softirqs.CPU241.SCHED 13317 ± 7% +19.5% 15919 ± 9% softirqs.CPU243.SCHED 8234 ± 16% -43.9% 4616 ± 5% softirqs.CPU244.RCU 13845 ± 6% +13.0% 15643 ± 3% softirqs.CPU244.SCHED 13179 ± 3% +16.3% 15323 softirqs.CPU246.SCHED 13754 +12.2% 15438 ± 3% softirqs.CPU248.SCHED 13769 ± 4% +10.9% 15276 ± 2% softirqs.CPU252.SCHED 13702 +10.5% 15147 ± 2% softirqs.CPU254.SCHED 13315 ± 2% +12.5% 14980 ± 3% softirqs.CPU255.SCHED 13785 ± 3% +12.9% 15568 ± 5% softirqs.CPU256.SCHED 13307 ± 3% +15.0% 15298 ± 3% softirqs.CPU257.SCHED 13864 ± 3% +10.5% 15313 ± 2% softirqs.CPU259.SCHED 13879 ± 2% +11.4% 15465 softirqs.CPU261.SCHED 13815 +13.6% 15687 ± 5% softirqs.CPU264.SCHED 119574 ± 2% +11.8% 133693 ± 11% softirqs.CPU266.TIMER 13688 +10.9% 15180 ± 6% softirqs.CPU267.SCHED 11716 ± 4% +19.3% 13974 ± 8% softirqs.CPU27.SCHED 13866 ± 3% +13.7% 15765 ± 4% softirqs.CPU271.SCHED 13887 ± 5% +12.5% 15621 softirqs.CPU272.SCHED 13383 ± 3% +19.8% 16031 ± 2% softirqs.CPU274.SCHED 13347 +14.1% 15232 ± 3% softirqs.CPU275.SCHED 12884 ± 2% +21.0% 15593 ± 4% softirqs.CPU276.SCHED 13131 ± 5% +13.4% 14891 ± 5% softirqs.CPU277.SCHED 12891 ± 2% +19.2% 15371 ± 4% softirqs.CPU278.SCHED 13313 ± 4% +13.0% 15049 ± 2% softirqs.CPU279.SCHED 13514 ± 3% +10.2% 14897 ± 2% softirqs.CPU280.SCHED 13501 ± 3% +13.7% 15346 softirqs.CPU281.SCHED 13261 +17.5% 15577 softirqs.CPU282.SCHED 8076 ± 15% -43.7% 4546 ± 5% softirqs.CPU283.RCU 13686 ± 3% +12.6% 15413 ± 2% softirqs.CPU284.SCHED 13439 ± 2% +9.2% 14670 ± 4% softirqs.CPU285.SCHED 8878 ± 9% -35.4% 5735 ± 4% softirqs.CPU35.RCU 11690 ± 2% +13.6% 13274 ± 5% softirqs.CPU40.SCHED 11714 ± 2% +19.3% 13975 ± 13% softirqs.CPU41.SCHED 11763 +12.5% 13239 ± 4% softirqs.CPU45.SCHED 11662 ± 2% +9.4% 12757 ± 3% softirqs.CPU46.SCHED 11805 ± 2% +9.3% 12902 ± 2% softirqs.CPU50.SCHED 12158 ± 3% +12.3% 13655 ± 8% softirqs.CPU55.SCHED 11716 ± 4% +8.8% 12751 ± 3% softirqs.CPU58.SCHED 11922 ± 2% +9.9% 13100 ± 4% softirqs.CPU64.SCHED 9674 ± 17% -41.8% 5625 ± 6% softirqs.CPU66.RCU 11818 +12.0% 13237 softirqs.CPU66.SCHED 124682 ± 7% -6.1% 117088 ± 5% softirqs.CPU66.TIMER 8637 ± 9% -34.0% 5700 ± 7% softirqs.CPU70.RCU 11624 ± 2% +11.0% 12901 ± 2% softirqs.CPU70.SCHED 12372 ± 2% +13.2% 14003 ± 3% softirqs.CPU71.SCHED 9949 ± 25% -33.9% 6574 ± 31% softirqs.CPU72.RCU 10392 ± 26% -35.1% 6745 ± 35% softirqs.CPU73.RCU 12766 ± 3% +11.1% 14188 ± 3% softirqs.CPU76.SCHED 12611 ± 2% +18.8% 14984 ± 5% softirqs.CPU78.SCHED 12786 ± 3% +17.9% 15079 ± 7% softirqs.CPU79.SCHED 11947 ± 4% +9.7% 13103 ± 4% softirqs.CPU8.SCHED 13379 ± 7% +11.8% 14962 ± 4% softirqs.CPU83.SCHED 13438 ± 5% +9.7% 14738 ± 2% softirqs.CPU84.SCHED 12768 +19.4% 15241 ± 6% softirqs.CPU88.SCHED 8604 ± 13% -39.3% 5222 ± 3% softirqs.CPU89.RCU 13077 ± 2% +17.1% 15308 ± 7% softirqs.CPU89.SCHED 11887 ± 3% +20.1% 14272 ± 5% softirqs.CPU9.SCHED 12723 ± 3% +11.3% 14165 ± 4% softirqs.CPU90.SCHED 8439 ± 12% -38.9% 5153 ± 4% softirqs.CPU91.RCU 13429 ± 3% +10.3% 14806 ± 2% softirqs.CPU95.SCHED 12852 ± 4% +10.3% 14174 ± 5% softirqs.CPU96.SCHED 13010 ± 2% +14.4% 14888 ± 5% softirqs.CPU97.SCHED 2315644 ± 4% -36.2% 1477200 ± 4% softirqs.RCU 1572 ± 10% +63.9% 2578 ± 39% interrupts.CPU0.NMI:Non-maskable_interrupts 1572 ± 10% +63.9% 2578 ± 39% interrupts.CPU0.PMI:Performance_monitoring_interrupts 252.00 ± 11% -35.2% 163.25 ± 13% interrupts.CPU104.RES:Rescheduling_interrupts 2738 ± 24% +52.4% 4173 ± 19% interrupts.CPU105.NMI:Non-maskable_interrupts 2738 ± 24% +52.4% 4173 ± 19% interrupts.CPU105.PMI:Performance_monitoring_interrupts 245.75 ± 19% -31.0% 169.50 ± 7% interrupts.CPU105.RES:Rescheduling_interrupts 228.75 ± 13% -24.7% 172.25 ± 19% interrupts.CPU106.RES:Rescheduling_interrupts 2243 ± 15% +66.3% 3730 ± 35% interrupts.CPU113.NMI:Non-maskable_interrupts 2243 ± 15% +66.3% 3730 ± 35% interrupts.CPU113.PMI:Performance_monitoring_interrupts 2703 ± 31% +67.0% 4514 ± 33% interrupts.CPU118.NMI:Non-maskable_interrupts 2703 ± 31% +67.0% 4514 ± 33% interrupts.CPU118.PMI:Performance_monitoring_interrupts 2613 ± 25% +42.2% 3715 ± 24% interrupts.CPU121.NMI:Non-maskable_interrupts 2613 ± 25% +42.2% 3715 ± 24% interrupts.CPU121.PMI:Performance_monitoring_interrupts 311.50 ± 23% -47.7% 163.00 ± 9% interrupts.CPU122.RES:Rescheduling_interrupts 266.75 ± 19% -31.6% 182.50 ± 15% interrupts.CPU124.RES:Rescheduling_interrupts 293.75 ± 33% -32.3% 198.75 ± 19% interrupts.CPU125.RES:Rescheduling_interrupts 2601 ± 36% +43.2% 3724 ± 29% interrupts.CPU127.NMI:Non-maskable_interrupts 2601 ± 36% +43.2% 3724 ± 29% interrupts.CPU127.PMI:Performance_monitoring_interrupts 2258 ± 21% +68.2% 3797 ± 29% interrupts.CPU13.NMI:Non-maskable_interrupts 2258 ± 21% +68.2% 3797 ± 29% interrupts.CPU13.PMI:Performance_monitoring_interrupts 3338 ± 29% +54.6% 5160 ± 9% interrupts.CPU139.NMI:Non-maskable_interrupts 3338 ± 29% +54.6% 5160 ± 9% interrupts.CPU139.PMI:Performance_monitoring_interrupts 219.50 ± 27% -23.0% 169.00 ± 21% interrupts.CPU139.RES:Rescheduling_interrupts 290.25 ± 25% -32.5% 196.00 ± 11% interrupts.CPU14.RES:Rescheduling_interrupts 243.50 ± 4% -16.0% 204.50 ± 12% interrupts.CPU140.RES:Rescheduling_interrupts 1797 ± 15% +135.0% 4223 ± 46% interrupts.CPU147.NMI:Non-maskable_interrupts 1797 ± 15% +135.0% 4223 ± 46% interrupts.CPU147.PMI:Performance_monitoring_interrupts 2537 ± 22% +89.6% 4812 ± 28% interrupts.CPU15.NMI:Non-maskable_interrupts 2537 ± 22% +89.6% 4812 ± 28% interrupts.CPU15.PMI:Performance_monitoring_interrupts 292.25 ± 34% -33.9% 193.25 ± 6% interrupts.CPU15.RES:Rescheduling_interrupts 424.25 ± 37% -58.5% 176.25 ± 14% interrupts.CPU158.RES:Rescheduling_interrupts 312.50 ± 42% -54.2% 143.00 ± 18% interrupts.CPU159.RES:Rescheduling_interrupts 725.00 ±118% -75.7% 176.25 ± 14% interrupts.CPU163.RES:Rescheduling_interrupts 2367 ± 6% +59.9% 3786 ± 24% interrupts.CPU177.NMI:Non-maskable_interrupts 2367 ± 6% +59.9% 3786 ± 24% interrupts.CPU177.PMI:Performance_monitoring_interrupts 239.50 ± 30% -46.6% 128.00 ± 14% interrupts.CPU179.RES:Rescheduling_interrupts 320.75 ± 15% -24.0% 243.75 ± 20% interrupts.CPU20.RES:Rescheduling_interrupts 302.50 ± 17% -47.2% 159.75 ± 8% interrupts.CPU200.RES:Rescheduling_interrupts 2166 ± 5% +92.0% 4157 ± 40% interrupts.CPU207.NMI:Non-maskable_interrupts 2166 ± 5% +92.0% 4157 ± 40% interrupts.CPU207.PMI:Performance_monitoring_interrupts 217.00 ± 11% -34.6% 142.00 ± 12% interrupts.CPU214.RES:Rescheduling_interrupts 2610 ± 36% +47.4% 3848 ± 35% interrupts.CPU215.NMI:Non-maskable_interrupts 2610 ± 36% +47.4% 3848 ± 35% interrupts.CPU215.PMI:Performance_monitoring_interrupts 2046 ± 13% +118.6% 4475 ± 43% interrupts.CPU22.NMI:Non-maskable_interrupts 2046 ± 13% +118.6% 4475 ± 43% interrupts.CPU22.PMI:Performance_monitoring_interrupts 289.50 ± 28% -41.1% 170.50 ± 8% interrupts.CPU22.RES:Rescheduling_interrupts 2232 ± 6% +33.0% 2970 ± 24% interrupts.CPU221.NMI:Non-maskable_interrupts 2232 ± 6% +33.0% 2970 ± 24% interrupts.CPU221.PMI:Performance_monitoring_interrupts 4552 ± 12% -27.6% 3295 ± 15% interrupts.CPU222.NMI:Non-maskable_interrupts 4552 ± 12% -27.6% 3295 ± 15% interrupts.CPU222.PMI:Performance_monitoring_interrupts 2013 ± 15% +80.9% 3641 ± 27% interrupts.CPU226.NMI:Non-maskable_interrupts 2013 ± 15% +80.9% 3641 ± 27% interrupts.CPU226.PMI:Performance_monitoring_interrupts 2575 ± 49% +67.1% 4302 ± 34% interrupts.CPU227.NMI:Non-maskable_interrupts 2575 ± 49% +67.1% 4302 ± 34% interrupts.CPU227.PMI:Performance_monitoring_interrupts 248.00 ± 36% -36.3% 158.00 ± 19% interrupts.CPU228.RES:Rescheduling_interrupts 2441 ± 24% +43.0% 3490 ± 30% interrupts.CPU23.NMI:Non-maskable_interrupts 2441 ± 24% +43.0% 3490 ± 30% interrupts.CPU23.PMI:Performance_monitoring_interrupts 404.25 ± 69% -65.5% 139.50 ± 17% interrupts.CPU236.RES:Rescheduling_interrupts 566.50 ± 40% -73.6% 149.50 ± 31% interrupts.CPU237.RES:Rescheduling_interrupts 243.50 ± 26% -37.1% 153.25 ± 21% interrupts.CPU248.RES:Rescheduling_interrupts 258.25 ± 12% -53.5% 120.00 ± 18% interrupts.CPU249.RES:Rescheduling_interrupts 2888 ± 27% +49.4% 4313 ± 30% interrupts.CPU253.NMI:Non-maskable_interrupts 2888 ± 27% +49.4% 4313 ± 30% interrupts.CPU253.PMI:Performance_monitoring_interrupts 2468 ± 44% +67.3% 4131 ± 37% interrupts.CPU256.NMI:Non-maskable_interrupts 2468 ± 44% +67.3% 4131 ± 37% interrupts.CPU256.PMI:Performance_monitoring_interrupts 425.00 ± 59% -60.3% 168.75 ± 34% interrupts.CPU258.RES:Rescheduling_interrupts 1859 ± 16% +106.3% 3834 ± 44% interrupts.CPU268.NMI:Non-maskable_interrupts 1859 ± 16% +106.3% 3834 ± 44% interrupts.CPU268.PMI:Performance_monitoring_interrupts 2684 ± 28% +61.2% 4326 ± 36% interrupts.CPU269.NMI:Non-maskable_interrupts 2684 ± 28% +61.2% 4326 ± 36% interrupts.CPU269.PMI:Performance_monitoring_interrupts 2171 ± 6% +108.8% 4533 ± 20% interrupts.CPU270.NMI:Non-maskable_interrupts 2171 ± 6% +108.8% 4533 ± 20% interrupts.CPU270.PMI:Performance_monitoring_interrupts 2262 ± 14% +61.8% 3659 ± 37% interrupts.CPU273.NMI:Non-maskable_interrupts 2262 ± 14% +61.8% 3659 ± 37% interrupts.CPU273.PMI:Performance_monitoring_interrupts 2203 ± 11% +50.7% 3320 ± 38% interrupts.CPU279.NMI:Non-maskable_interrupts 2203 ± 11% +50.7% 3320 ± 38% interrupts.CPU279.PMI:Performance_monitoring_interrupts 2433 ± 17% +52.9% 3721 ± 25% interrupts.CPU280.NMI:Non-maskable_interrupts 2433 ± 17% +52.9% 3721 ± 25% interrupts.CPU280.PMI:Performance_monitoring_interrupts 2778 ± 33% +63.1% 4531 ± 36% interrupts.CPU283.NMI:Non-maskable_interrupts 2778 ± 33% +63.1% 4531 ± 36% interrupts.CPU283.PMI:Performance_monitoring_interrupts 331.75 ± 32% -39.8% 199.75 ± 17% interrupts.CPU29.RES:Rescheduling_interrupts 2178 ± 22% +53.9% 3353 ± 31% interrupts.CPU3.NMI:Non-maskable_interrupts 2178 ± 22% +53.9% 3353 ± 31% interrupts.CPU3.PMI:Performance_monitoring_interrupts 298.50 ± 30% -39.7% 180.00 ± 6% interrupts.CPU34.RES:Rescheduling_interrupts 2490 ± 3% +58.7% 3953 ± 28% interrupts.CPU35.NMI:Non-maskable_interrupts 2490 ± 3% +58.7% 3953 ± 28% interrupts.CPU35.PMI:Performance_monitoring_interrupts 270.50 ± 24% -31.1% 186.25 ± 3% interrupts.CPU36.RES:Rescheduling_interrupts 2493 ± 7% +57.0% 3915 ± 27% interrupts.CPU43.NMI:Non-maskable_interrupts 2493 ± 7% +57.0% 3915 ± 27% interrupts.CPU43.PMI:Performance_monitoring_interrupts 286.75 ± 36% -32.4% 193.75 ± 7% interrupts.CPU45.RES:Rescheduling_interrupts 259.00 ± 12% -23.6% 197.75 ± 13% interrupts.CPU46.RES:Rescheduling_interrupts 244.00 ± 21% -35.6% 157.25 ± 11% interrupts.CPU47.RES:Rescheduling_interrupts 230.00 ± 7% -21.3% 181.00 ± 11% interrupts.CPU48.RES:Rescheduling_interrupts 281.00 ± 13% -27.4% 204.00 ± 15% interrupts.CPU53.RES:Rescheduling_interrupts 256.75 ± 5% -18.4% 209.50 ± 12% interrupts.CPU54.RES:Rescheduling_interrupts 2433 ± 9% +68.4% 4098 ± 35% interrupts.CPU58.NMI:Non-maskable_interrupts 2433 ± 9% +68.4% 4098 ± 35% interrupts.CPU58.PMI:Performance_monitoring_interrupts 316.00 ± 25% -41.4% 185.25 ± 13% interrupts.CPU59.RES:Rescheduling_interrupts 2703 ± 38% +56.0% 4217 ± 31% interrupts.CPU60.NMI:Non-maskable_interrupts 2703 ± 38% +56.0% 4217 ± 31% interrupts.CPU60.PMI:Performance_monitoring_interrupts 2425 ± 16% +39.9% 3394 ± 27% interrupts.CPU61.NMI:Non-maskable_interrupts 2425 ± 16% +39.9% 3394 ± 27% interrupts.CPU61.PMI:Performance_monitoring_interrupts 2388 ± 18% +69.5% 4047 ± 29% interrupts.CPU66.NMI:Non-maskable_interrupts 2388 ± 18% +69.5% 4047 ± 29% interrupts.CPU66.PMI:Performance_monitoring_interrupts 2322 ± 11% +93.4% 4491 ± 35% interrupts.CPU67.NMI:Non-maskable_interrupts 2322 ± 11% +93.4% 4491 ± 35% interrupts.CPU67.PMI:Performance_monitoring_interrupts 319.00 ± 40% -44.7% 176.25 ± 9% interrupts.CPU67.RES:Rescheduling_interrupts 2512 ± 8% +28.1% 3219 ± 25% interrupts.CPU70.NMI:Non-maskable_interrupts 2512 ± 8% +28.1% 3219 ± 25% interrupts.CPU70.PMI:Performance_monitoring_interrupts 2290 ± 39% +78.7% 4094 ± 28% interrupts.CPU74.NMI:Non-maskable_interrupts 2290 ± 39% +78.7% 4094 ± 28% interrupts.CPU74.PMI:Performance_monitoring_interrupts 2446 ± 40% +94.8% 4764 ± 23% interrupts.CPU75.NMI:Non-maskable_interrupts 2446 ± 40% +94.8% 4764 ± 23% interrupts.CPU75.PMI:Performance_monitoring_interrupts 426.75 ± 61% -67.7% 138.00 ± 8% interrupts.CPU75.RES:Rescheduling_interrupts 192.50 ± 13% +45.6% 280.25 ± 45% interrupts.CPU76.RES:Rescheduling_interrupts 274.25 ± 34% -42.2% 158.50 ± 34% interrupts.CPU77.RES:Rescheduling_interrupts 2357 ± 9% +73.0% 4078 ± 23% interrupts.CPU78.NMI:Non-maskable_interrupts 2357 ± 9% +73.0% 4078 ± 23% interrupts.CPU78.PMI:Performance_monitoring_interrupts 348.50 ± 53% -47.3% 183.75 ± 29% interrupts.CPU80.RES:Rescheduling_interrupts 2650 ± 43% +46.2% 3874 ± 36% interrupts.CPU84.NMI:Non-maskable_interrupts 2650 ± 43% +46.2% 3874 ± 36% interrupts.CPU84.PMI:Performance_monitoring_interrupts 2235 ± 10% +117.8% 4867 ± 10% interrupts.CPU90.NMI:Non-maskable_interrupts 2235 ± 10% +117.8% 4867 ± 10% interrupts.CPU90.PMI:Performance_monitoring_interrupts 2606 ± 33% +38.1% 3598 ± 21% interrupts.CPU92.NMI:Non-maskable_interrupts 2606 ± 33% +38.1% 3598 ± 21% interrupts.CPU92.PMI:Performance_monitoring_interrupts 408.75 ± 58% -56.8% 176.75 ± 25% interrupts.CPU92.RES:Rescheduling_interrupts 399.00 ± 64% -63.6% 145.25 ± 16% interrupts.CPU93.RES:Rescheduling_interrupts 314.75 ± 36% -44.2% 175.75 ± 13% interrupts.CPU94.RES:Rescheduling_interrupts 191.00 ± 15% -29.1% 135.50 ± 9% interrupts.CPU97.RES:Rescheduling_interrupts 94.00 ± 8% +50.0% 141.00 ± 12% interrupts.IWI:IRQ_work_interrupts 841457 ± 7% +16.6% 980751 ± 3% interrupts.NMI:Non-maskable_interrupts 841457 ± 7% +16.6% 980751 ± 3% interrupts.PMI:Performance_monitoring_interrupts 12.75 ± 11% -4.1 8.67 ± 31% perf-profile.calltrace.cycles-pp.do_rw_once 1.02 ± 16% -0.6 0.47 ± 59% perf-profile.calltrace.cycles-pp.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle 1.10 ± 15% -0.4 0.66 ± 14% perf-profile.calltrace.cycles-pp.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry 1.05 ± 16% -0.4 0.61 ± 14% perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter 1.58 ± 4% +0.3 1.91 ± 7% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.11 ± 4% +0.5 2.60 ± 7% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault 0.83 ± 26% +0.5 1.32 ± 18% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe 0.83 ± 26% +0.5 1.32 ± 18% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.90 ± 5% +0.6 2.45 ± 7% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage 0.65 ± 62% +0.6 1.20 ± 15% perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.60 ± 62% +0.6 1.16 ± 18% perf-profile.calltrace.cycles-pp.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap 0.95 ± 17% +0.6 1.52 ± 8% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner 0.61 ± 62% +0.6 1.18 ± 18% perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput 0.61 ± 62% +0.6 1.19 ± 19% perf-profile.calltrace.cycles-pp.tlb_finish_mmu.exit_mmap.mmput.do_exit.do_group_exit 0.61 ± 62% +0.6 1.19 ± 19% perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput.do_exit 0.64 ± 61% +0.6 1.23 ± 18% perf-profile.calltrace.cycles-pp.mmput.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64 0.64 ± 61% +0.6 1.23 ± 18% perf-profile.calltrace.cycles-pp.exit_mmap.mmput.do_exit.do_group_exit.__x64_sys_exit_group 1.30 ± 9% +0.6 1.92 ± 8% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock 0.19 ±173% +0.7 0.89 ± 20% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu 0.19 ±173% +0.7 0.90 ± 20% perf-profile.calltrace.cycles-pp._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu 0.00 +0.8 0.77 ± 30% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page 0.00 +0.8 0.78 ± 30% perf-profile.calltrace.cycles-pp._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page 0.00 +0.8 0.79 ± 29% perf-profile.calltrace.cycles-pp.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 0.82 ± 67% +0.9 1.72 ± 22% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.84 ± 66% +0.9 1.74 ± 20% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 2.52 ± 6% +0.9 3.44 ± 9% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page 0.83 ± 67% +0.9 1.75 ± 21% perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 0.84 ± 66% +0.9 1.77 ± 20% perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 1.64 ± 12% +1.0 2.67 ± 7% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault 1.65 ± 45% +1.3 2.99 ± 18% perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 1.74 ± 13% +1.4 3.16 ± 6% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault 2.56 ± 48% +2.2 4.81 ± 19% perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault 12.64 ± 14% +3.6 16.20 ± 8% perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.__do_page_fault 2.97 ± 7% +3.8 6.74 ± 9% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page.hugetlb_cow 19.99 ± 9% +4.1 24.05 ± 6% perf-profile.calltrace.cycles-pp.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault.do_page_fault 1.37 ± 15% -0.5 0.83 ± 13% perf-profile.children.cycles-pp.sched_clock_cpu 1.31 ± 16% -0.5 0.78 ± 13% perf-profile.children.cycles-pp.sched_clock 1.29 ± 16% -0.5 0.77 ± 13% perf-profile.children.cycles-pp.native_sched_clock 1.80 ± 2% -0.3 1.47 ± 10% perf-profile.children.cycles-pp.task_tick_fair 0.73 ± 2% -0.2 0.54 ± 11% perf-profile.children.cycles-pp.update_curr 0.42 ± 17% -0.2 0.27 ± 16% perf-profile.children.cycles-pp.account_process_tick 0.73 ± 10% -0.2 0.58 ± 9% perf-profile.children.cycles-pp.rcu_sched_clock_irq 0.27 ± 6% -0.1 0.14 ± 14% perf-profile.children.cycles-pp.__acct_update_integrals 0.27 ± 18% -0.1 0.16 ± 13% perf-profile.children.cycles-pp.rcu_segcblist_ready_cbs 0.40 ± 12% -0.1 0.30 ± 14% perf-profile.children.cycles-pp.__next_timer_interrupt 0.47 ± 7% -0.1 0.39 ± 13% perf-profile.children.cycles-pp.update_rq_clock 0.29 ± 12% -0.1 0.21 ± 15% perf-profile.children.cycles-pp.cpuidle_governor_latency_req 0.21 ± 7% -0.1 0.14 ± 12% perf-profile.children.cycles-pp.account_system_index_time 0.38 ± 2% -0.1 0.31 ± 12% perf-profile.children.cycles-pp.timerqueue_add 0.26 ± 11% -0.1 0.20 ± 13% perf-profile.children.cycles-pp.find_next_bit 0.23 ± 15% -0.1 0.17 ± 15% perf-profile.children.cycles-pp.rcu_dynticks_eqs_exit 0.14 ± 8% -0.1 0.07 ± 14% perf-profile.children.cycles-pp.account_user_time 0.17 ± 6% -0.0 0.12 ± 10% perf-profile.children.cycles-pp.cpuacct_charge 0.18 ± 20% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.irq_work_tick 0.11 ± 13% -0.0 0.07 ± 25% perf-profile.children.cycles-pp.tick_sched_do_timer 0.12 ± 10% -0.0 0.08 ± 15% perf-profile.children.cycles-pp.get_cpu_device 0.07 ± 11% -0.0 0.04 ± 58% perf-profile.children.cycles-pp.raise_softirq 0.12 ± 3% -0.0 0.09 ± 8% perf-profile.children.cycles-pp.write 0.11 ± 13% +0.0 0.14 ± 8% perf-profile.children.cycles-pp.native_write_msr 0.09 ± 9% +0.0 0.11 ± 7% perf-profile.children.cycles-pp.finish_task_switch 0.10 ± 10% +0.0 0.13 ± 5% perf-profile.children.cycles-pp.schedule_idle 0.07 ± 6% +0.0 0.10 ± 12% perf-profile.children.cycles-pp.__read_nocancel 0.04 ± 58% +0.0 0.07 ± 15% perf-profile.children.cycles-pp.__free_pages_ok 0.06 ± 7% +0.0 0.09 ± 13% perf-profile.children.cycles-pp.perf_read 0.07 +0.0 0.11 ± 14% perf-profile.children.cycles-pp.perf_evsel__read_counter 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.cmd_stat 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.__run_perf_stat 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.process_interval 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.read_counters 0.07 ± 22% +0.0 0.11 ± 19% perf-profile.children.cycles-pp.__handle_mm_fault 0.07 ± 19% +0.1 0.13 ± 8% perf-profile.children.cycles-pp.rb_erase 0.03 ±100% +0.1 0.09 ± 9% perf-profile.children.cycles-pp.smp_call_function_single 0.01 ±173% +0.1 0.08 ± 11% perf-profile.children.cycles-pp.perf_event_read 0.00 +0.1 0.07 ± 13% perf-profile.children.cycles-pp.__perf_event_read_value 0.00 +0.1 0.07 ± 7% perf-profile.children.cycles-pp.__intel_pmu_enable_all 0.08 ± 17% +0.1 0.15 ± 8% perf-profile.children.cycles-pp.native_apic_msr_eoi_write 0.04 ±103% +0.1 0.13 ± 58% perf-profile.children.cycles-pp.shmem_getpage_gfp 0.38 ± 14% +0.1 0.51 ± 6% perf-profile.children.cycles-pp.run_timer_softirq 0.11 ± 4% +0.3 0.37 ± 32% perf-profile.children.cycles-pp.worker_thread 0.20 ± 5% +0.3 0.48 ± 25% perf-profile.children.cycles-pp.ret_from_fork 0.20 ± 4% +0.3 0.48 ± 25% perf-profile.children.cycles-pp.kthread 0.00 +0.3 0.29 ± 38% perf-profile.children.cycles-pp.memcpy_erms 0.00 +0.3 0.29 ± 38% perf-profile.children.cycles-pp.drm_fb_helper_dirty_work 0.00 +0.3 0.31 ± 37% perf-profile.children.cycles-pp.process_one_work 0.47 ± 48% +0.4 0.91 ± 19% perf-profile.children.cycles-pp.prep_new_huge_page 0.70 ± 29% +0.5 1.16 ± 18% perf-profile.children.cycles-pp.free_huge_page 0.73 ± 29% +0.5 1.19 ± 18% perf-profile.children.cycles-pp.tlb_flush_mmu 0.72 ± 29% +0.5 1.18 ± 18% perf-profile.children.cycles-pp.release_pages 0.73 ± 29% +0.5 1.19 ± 18% perf-profile.children.cycles-pp.tlb_finish_mmu 0.76 ± 27% +0.5 1.23 ± 18% perf-profile.children.cycles-pp.exit_mmap 0.77 ± 27% +0.5 1.24 ± 18% perf-profile.children.cycles-pp.mmput 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.__x64_sys_exit_group 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.do_group_exit 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.do_exit 1.28 ± 29% +0.5 1.76 ± 9% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler 0.77 ± 28% +0.5 1.26 ± 13% perf-profile.children.cycles-pp.alloc_fresh_huge_page 1.53 ± 15% +0.7 2.26 ± 14% perf-profile.children.cycles-pp.do_syscall_64 1.53 ± 15% +0.7 2.27 ± 14% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 1.13 ± 3% +0.9 2.07 ± 14% perf-profile.children.cycles-pp.interrupt_entry 0.79 ± 9% +1.0 1.76 ± 5% perf-profile.children.cycles-pp.perf_event_task_tick 1.71 ± 39% +1.4 3.08 ± 16% perf-profile.children.cycles-pp.alloc_surplus_huge_page 2.66 ± 42% +2.3 4.94 ± 17% perf-profile.children.cycles-pp.alloc_huge_page 2.89 ± 45% +2.7 5.54 ± 18% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 3.34 ± 35% +2.7 6.02 ± 17% perf-profile.children.cycles-pp._raw_spin_lock 12.77 ± 14% +3.9 16.63 ± 7% perf-profile.children.cycles-pp.mutex_spin_on_owner 20.12 ± 9% +4.0 24.16 ± 6% perf-profile.children.cycles-pp.hugetlb_cow 15.40 ± 10% -3.6 11.84 ± 28% perf-profile.self.cycles-pp.do_rw_once 4.02 ± 9% -1.3 2.73 ± 30% perf-profile.self.cycles-pp.do_access 2.00 ± 14% -0.6 1.41 ± 13% perf-profile.self.cycles-pp.cpuidle_enter_state 1.26 ± 16% -0.5 0.74 ± 13% perf-profile.self.cycles-pp.native_sched_clock 0.42 ± 17% -0.2 0.27 ± 16% perf-profile.self.cycles-pp.account_process_tick 0.27 ± 19% -0.2 0.12 ± 17% perf-profile.self.cycles-pp.timerqueue_del 0.53 ± 3% -0.1 0.38 ± 11% perf-profile.self.cycles-pp.update_curr 0.27 ± 6% -0.1 0.14 ± 14% perf-profile.self.cycles-pp.__acct_update_integrals 0.27 ± 18% -0.1 0.16 ± 13% perf-profile.self.cycles-pp.rcu_segcblist_ready_cbs 0.61 ± 4% -0.1 0.51 ± 8% perf-profile.self.cycles-pp.task_tick_fair 0.20 ± 8% -0.1 0.12 ± 14% perf-profile.self.cycles-pp.account_system_index_time 0.23 ± 15% -0.1 0.16 ± 17% perf-profile.self.cycles-pp.rcu_dynticks_eqs_exit 0.25 ± 11% -0.1 0.18 ± 14% perf-profile.self.cycles-pp.find_next_bit 0.10 ± 11% -0.1 0.03 ±100% perf-profile.self.cycles-pp.tick_sched_do_timer 0.29 -0.1 0.23 ± 11% perf-profile.self.cycles-pp.timerqueue_add 0.12 ± 10% -0.1 0.06 ± 17% perf-profile.self.cycles-pp.account_user_time 0.22 ± 15% -0.1 0.16 ± 6% perf-profile.self.cycles-pp.scheduler_tick 0.17 ± 6% -0.0 0.12 ± 10% perf-profile.self.cycles-pp.cpuacct_charge 0.18 ± 20% -0.0 0.13 ± 3% perf-profile.self.cycles-pp.irq_work_tick 0.07 ± 13% -0.0 0.03 ±100% perf-profile.self.cycles-pp.update_process_times 0.12 ± 7% -0.0 0.08 ± 15% perf-profile.self.cycles-pp.get_cpu_device 0.07 ± 11% -0.0 0.04 ± 58% perf-profile.self.cycles-pp.raise_softirq 0.12 ± 11% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.tick_nohz_get_sleep_length 0.11 ± 11% +0.0 0.14 ± 6% perf-profile.self.cycles-pp.native_write_msr 0.10 ± 5% +0.1 0.15 ± 8% perf-profile.self.cycles-pp.__remove_hrtimer 0.07 ± 23% +0.1 0.13 ± 8% perf-profile.self.cycles-pp.rb_erase 0.08 ± 17% +0.1 0.15 ± 7% perf-profile.self.cycles-pp.native_apic_msr_eoi_write 0.00 +0.1 0.08 ± 10% perf-profile.self.cycles-pp.smp_call_function_single 0.32 ± 17% +0.1 0.42 ± 7% perf-profile.self.cycles-pp.run_timer_softirq 0.22 ± 5% +0.1 0.34 ± 4% perf-profile.self.cycles-pp.ktime_get_update_offsets_now 0.45 ± 15% +0.2 0.60 ± 12% perf-profile.self.cycles-pp.rcu_irq_enter 0.31 ± 8% +0.2 0.46 ± 16% perf-profile.self.cycles-pp.irq_enter 0.29 ± 10% +0.2 0.44 ± 16% perf-profile.self.cycles-pp.apic_timer_interrupt 0.71 ± 30% +0.2 0.92 ± 8% perf-profile.self.cycles-pp.perf_mux_hrtimer_handler 0.00 +0.3 0.28 ± 37% perf-profile.self.cycles-pp.memcpy_erms 1.12 ± 3% +0.9 2.02 ± 15% perf-profile.self.cycles-pp.interrupt_entry 0.79 ± 9% +0.9 1.73 ± 5% perf-profile.self.cycles-pp.perf_event_task_tick 2.49 ± 45% +2.1 4.55 ± 20% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 10.95 ± 15% +2.7 13.61 ± 8% perf-profile.self.cycles-pp.mutex_spin_on_owner
vm-scalability.throughput
1.6e+07 +-+---------------------------------------------------------------+ |..+.+ +..+.+..+.+. +. +..+.+..+.+..+.+..+.+..+ + | 1.4e+07 +-+ : : O O O O | 1.2e+07 O-+O O O O O O O O O O O O O O O O O O | : : O O O O | 1e+07 +-+ : : | | : : | 8e+06 +-+ : : | | : : | 6e+06 +-+ : : | 4e+06 +-+ : : | | :: | 2e+06 +-+ : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.time.minor_page_faults
2.5e+06 +-+---------------------------------------------------------------+ | | |..+.+ +..+.+..+.+..+.+..+.+.. .+. .+.+..+.+..+.+..+.+..+ | 2e+06 +-+ : : +. +. | O O O: O O O O O O O O O O | | : : O O O O O O O O O O O O O O 1.5e+06 +-+ : : | | : : | 1e+06 +-+ : : | | : : | | : : | 500000 +-+ : : | | : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.workload
3.5e+09 +-+---------------------------------------------------------------+ | .+. .+.+.. .+.. | 3e+09 +-+ + +..+.+..+.+..+.+. +..+.+..+.+..+.+..+.+..+ + | | : : O O O | 2.5e+09 O-+O O: O O O O O O O O O | | : : O O O O O O O O O O O O 2e+09 +-+ : : | | : : | 1.5e+09 +-+ : : | | : : | 1e+09 +-+ : : | | : : | 5e+08 +-+ : | | : | 0 +-+---------------------------------------------------------------+
[*] bisect-good sample [O] bisect-bad sample
Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
Thanks, Rong Chen
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Am 29.07.19 um 11:51 schrieb kernel test robot:
Greeting,
FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
Daniel, Noralf, we may have to revert this patch.
I expected some change in display performance, but not in VM. Since it's a server chipset, probably no one cares much about display performance. So that seemed like a good trade-off for re-using shared code.
Part of the patch set is that the generic fb emulation now maps and unmaps the fbdev BO when updating the screen. I guess that's the cause of the performance regression. And it should be visible with other drivers as well if they use a shadow FB for fbdev emulation.
For fbcon we should need to do any maps/unamps at all, this is for the fbdev mmap support only. If the testcase mentioned here tests fbdev mmap handling it's pretty badly misnamed :-) And as long as you don't have an fbdev mmap there shouldn't be any impact at all.
The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
The thing is that we'd need another generic fbdev emulation for ast and mgag200 that handles this issue properly.
Yeah I dont think we want to jump the gun here. If you can try to repro locally and profile where we're wasting cpu time I hope that should sched a light what's going wrong here.
I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Best regards Thomas
[1] https://cgit.freedesktop.org/drm/drm-misc/tree/drivers/gpu/drm/drm_fb_helper... [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv... [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...
-Daniel
Best regards Thomas
in testcase: vm-scalability on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory with following parameters:
runtime: 300s size: 8T test: anon-cow-seq-hugetlb cpufreq_governor: performance
test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us. test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
Details are as below: -------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml
========================================================================================= compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase: gcc-7/performance/x86_64-rhel-7.6/debian-x86_64-2019-05-14.cgz/300s/8T/lkp-knm01/anon-cow-seq-hugetlb/vm-scalability
commit: f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console") 90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9
fail:runs %reproduction fail:runs | | | 2:4 -50% :4 dmesg.WARNING:at#for_ip_interrupt_entry/0x :4 25% 1:4 dmesg.WARNING:at_ip___perf_sw_event/0x :4 25% 1:4 dmesg.WARNING:at_ip__fsnotify_parent/0x %stddev %change %stddev \ | \ 43955 ± 2% -18.8% 35691 vm-scalability.median 0.06 ± 7% +193.0% 0.16 ± 2% vm-scalability.median_stddev
14906559 ± 2% -17.9% 12237079 vm-scalability.throughput 87651 ± 2% -17.4% 72374 vm-scalability.time.involuntary_context_switches 2086168 -23.6% 1594224 vm-scalability.time.minor_page_faults 15082 ± 2% -10.4% 13517 vm-scalability.time.percent_of_cpu_this_job_got 29987 -8.9% 27327 vm-scalability.time.system_time 15755 -12.4% 13795 vm-scalability.time.user_time 122011 -19.3% 98418 vm-scalability.time.voluntary_context_switches 3.034e+09 -23.6% 2.318e+09 vm-scalability.workload 242478 ± 12% +68.5% 408518 ± 23% cpuidle.POLL.time 2788 ± 21% +117.4% 6062 ± 26% cpuidle.POLL.usage 56653 ± 10% +64.4% 93144 ± 20% meminfo.Mapped 120392 ± 7% +14.0% 137212 ± 4% meminfo.Shmem 47221 ± 11% +77.1% 83634 ± 22% numa-meminfo.node0.Mapped 120465 ± 7% +13.9% 137205 ± 4% numa-meminfo.node0.Shmem 2885513 -16.5% 2409384 numa-numastat.node0.local_node 2885471 -16.5% 2409354 numa-numastat.node0.numa_hit 11813 ± 11% +76.3% 20824 ± 22% numa-vmstat.node0.nr_mapped 30096 ± 7% +13.8% 34238 ± 4% numa-vmstat.node0.nr_shmem 43.72 ± 2% +5.5 49.20 mpstat.cpu.all.idle% 0.03 ± 4% +0.0 0.05 ± 6% mpstat.cpu.all.soft% 19.51 -2.4 17.08 mpstat.cpu.all.usr% 1012 -7.9% 932.75 turbostat.Avg_MHz 32.38 ± 10% +25.8% 40.73 turbostat.CPU%c1 145.51 -3.1% 141.01 turbostat.PkgWatt 15.09 -19.2% 12.19 turbostat.RAMWatt 43.50 ± 2% +13.2% 49.25 vmstat.cpu.id 18.75 ± 2% -13.3% 16.25 ± 2% vmstat.cpu.us 152.00 ± 2% -9.5% 137.50 vmstat.procs.r 4800 -13.1% 4173 vmstat.system.cs 156170 -11.9% 137594 slabinfo.anon_vma.active_objs 3395 -11.9% 2991 slabinfo.anon_vma.active_slabs 156190 -11.9% 137606 slabinfo.anon_vma.num_objs 3395 -11.9% 2991 slabinfo.anon_vma.num_slabs 1716 ± 5% +11.5% 1913 ± 8% slabinfo.dmaengine-unmap-16.active_objs 1716 ± 5% +11.5% 1913 ± 8% slabinfo.dmaengine-unmap-16.num_objs 1767 ± 2% -19.0% 1431 ± 2% slabinfo.hugetlbfs_inode_cache.active_objs 1767 ± 2% -19.0% 1431 ± 2% slabinfo.hugetlbfs_inode_cache.num_objs 3597 ± 5% -16.4% 3006 ± 3% slabinfo.skbuff_ext_cache.active_objs 3597 ± 5% -16.4% 3006 ± 3% slabinfo.skbuff_ext_cache.num_objs 1330122 -23.6% 1016557 proc-vmstat.htlb_buddy_alloc_success 77214 ± 3% +6.4% 82128 ± 2% proc-vmstat.nr_active_anon 67277 +2.9% 69246 proc-vmstat.nr_anon_pages 218.50 ± 3% -10.6% 195.25 proc-vmstat.nr_dirtied 288628 +1.4% 292755 proc-vmstat.nr_file_pages 360.50 -2.7% 350.75 proc-vmstat.nr_inactive_file 14225 ± 9% +63.8% 23304 ± 20% proc-vmstat.nr_mapped 30109 ± 7% +13.8% 34259 ± 4% proc-vmstat.nr_shmem 99870 -1.3% 98597 proc-vmstat.nr_slab_unreclaimable 204.00 ± 4% -12.1% 179.25 proc-vmstat.nr_written 77214 ± 3% +6.4% 82128 ± 2% proc-vmstat.nr_zone_active_anon 360.50 -2.7% 350.75 proc-vmstat.nr_zone_inactive_file 8810 ± 19% -66.1% 2987 ± 42% proc-vmstat.numa_hint_faults 8810 ± 19% -66.1% 2987 ± 42% proc-vmstat.numa_hint_faults_local 2904082 -16.4% 2427026 proc-vmstat.numa_hit 2904081 -16.4% 2427025 proc-vmstat.numa_local 6.828e+08 -23.5% 5.221e+08 proc-vmstat.pgalloc_normal 2900008 -17.2% 2400195 proc-vmstat.pgfault 6.827e+08 -23.5% 5.22e+08 proc-vmstat.pgfree 1.635e+10 -17.0% 1.357e+10 perf-stat.i.branch-instructions 1.53 ± 4% -0.1 1.45 ± 3% perf-stat.i.branch-miss-rate% 2.581e+08 ± 3% -20.5% 2.051e+08 ± 2% perf-stat.i.branch-misses 12.66 +1.1 13.78 perf-stat.i.cache-miss-rate% 72720849 -12.0% 63958986 perf-stat.i.cache-misses 5.766e+08 -18.6% 4.691e+08 perf-stat.i.cache-references 4674 ± 2% -13.0% 4064 perf-stat.i.context-switches 4.29 +12.5% 4.83 perf-stat.i.cpi 2.573e+11 -7.4% 2.383e+11 perf-stat.i.cpu-cycles 231.35 -21.5% 181.56 perf-stat.i.cpu-migrations 3522 +4.4% 3677 perf-stat.i.cycles-between-cache-misses 0.09 ± 13% +0.0 0.12 ± 5% perf-stat.i.iTLB-load-miss-rate% 5.894e+10 -15.8% 4.961e+10 perf-stat.i.iTLB-loads 5.901e+10 -15.8% 4.967e+10 perf-stat.i.instructions 1291 ± 14% -21.8% 1010 perf-stat.i.instructions-per-iTLB-miss 0.24 -11.0% 0.21 perf-stat.i.ipc 9476 -17.5% 7821 perf-stat.i.minor-faults 9478 -17.5% 7821 perf-stat.i.page-faults 9.76 -3.6% 9.41 perf-stat.overall.MPKI 1.59 ± 4% -0.1 1.52 perf-stat.overall.branch-miss-rate% 12.61 +1.1 13.71 perf-stat.overall.cache-miss-rate% 4.38 +10.5% 4.83 perf-stat.overall.cpi 3557 +5.3% 3747 perf-stat.overall.cycles-between-cache-misses 0.08 ± 12% +0.0 0.10 perf-stat.overall.iTLB-load-miss-rate% 1268 ± 15% -23.0% 976.22 perf-stat.overall.instructions-per-iTLB-miss 0.23 -9.5% 0.21 perf-stat.overall.ipc 5815 +9.7% 6378 perf-stat.overall.path-length 1.634e+10 -17.5% 1.348e+10 perf-stat.ps.branch-instructions 2.595e+08 ± 3% -21.2% 2.043e+08 ± 2% perf-stat.ps.branch-misses 72565205 -12.2% 63706339 perf-stat.ps.cache-misses 5.754e+08 -19.2% 4.646e+08 perf-stat.ps.cache-references 4640 ± 2% -12.5% 4060 perf-stat.ps.context-switches 2.581e+11 -7.5% 2.387e+11 perf-stat.ps.cpu-cycles 229.91 -22.0% 179.42 perf-stat.ps.cpu-migrations 5.889e+10 -16.3% 4.927e+10 perf-stat.ps.iTLB-loads 5.899e+10 -16.3% 4.938e+10 perf-stat.ps.instructions 9388 -18.2% 7677 perf-stat.ps.minor-faults 9389 -18.2% 7677 perf-stat.ps.page-faults 1.764e+13 -16.2% 1.479e+13 perf-stat.total.instructions 46803 ± 3% -18.8% 37982 ± 6% sched_debug.cfs_rq:/.exec_clock.min 5320 ± 3% +23.7% 6581 ± 3% sched_debug.cfs_rq:/.exec_clock.stddev 6737 ± 14% +58.1% 10649 ± 10% sched_debug.cfs_rq:/.load.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cfs_rq:/.load.max 46952 ± 16% +64.8% 77388 ± 11% sched_debug.cfs_rq:/.load.stddev 7.12 ± 4% +49.1% 10.62 ± 6% sched_debug.cfs_rq:/.load_avg.avg 474.40 ± 23% +67.5% 794.60 ± 10% sched_debug.cfs_rq:/.load_avg.max 37.70 ± 11% +74.8% 65.90 ± 9% sched_debug.cfs_rq:/.load_avg.stddev 13424269 ± 4% -15.6% 11328098 ± 2% sched_debug.cfs_rq:/.min_vruntime.avg 15411275 ± 3% -12.4% 13505072 ± 2% sched_debug.cfs_rq:/.min_vruntime.max 7939295 ± 6% -17.5% 6551322 ± 7% sched_debug.cfs_rq:/.min_vruntime.min 21.44 ± 7% -56.1% 9.42 ± 4% sched_debug.cfs_rq:/.nr_spread_over.avg 117.45 ± 11% -60.6% 46.30 ± 14% sched_debug.cfs_rq:/.nr_spread_over.max 19.33 ± 8% -66.4% 6.49 ± 9% sched_debug.cfs_rq:/.nr_spread_over.stddev 4.32 ± 15% +84.4% 7.97 ± 3% sched_debug.cfs_rq:/.runnable_load_avg.avg 353.85 ± 29% +118.8% 774.35 ± 11% sched_debug.cfs_rq:/.runnable_load_avg.max 27.30 ± 24% +118.5% 59.64 ± 9% sched_debug.cfs_rq:/.runnable_load_avg.stddev 6729 ± 14% +58.2% 10644 ± 10% sched_debug.cfs_rq:/.runnable_weight.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cfs_rq:/.runnable_weight.max 46950 ± 16% +64.8% 77387 ± 11% sched_debug.cfs_rq:/.runnable_weight.stddev 5305069 ± 4% -17.4% 4380376 ± 7% sched_debug.cfs_rq:/.spread0.avg 7328745 ± 3% -9.9% 6600897 ± 3% sched_debug.cfs_rq:/.spread0.max 2220837 ± 4% +55.8% 3460596 ± 5% sched_debug.cpu.avg_idle.avg 4590666 ± 9% +76.8% 8117037 ± 15% sched_debug.cpu.avg_idle.max 485052 ± 7% +80.3% 874679 ± 10% sched_debug.cpu.avg_idle.stddev 561.50 ± 26% +37.7% 773.30 ± 15% sched_debug.cpu.clock.stddev 561.50 ± 26% +37.7% 773.30 ± 15% sched_debug.cpu.clock_task.stddev 3.20 ± 10% +109.6% 6.70 ± 3% sched_debug.cpu.cpu_load[0].avg 309.10 ± 20% +150.3% 773.75 ± 12% sched_debug.cpu.cpu_load[0].max 21.02 ± 14% +160.8% 54.80 ± 9% sched_debug.cpu.cpu_load[0].stddev 3.19 ± 8% +109.8% 6.70 ± 3% sched_debug.cpu.cpu_load[1].avg 299.75 ± 19% +158.0% 773.30 ± 12% sched_debug.cpu.cpu_load[1].max 20.32 ± 12% +168.7% 54.62 ± 9% sched_debug.cpu.cpu_load[1].stddev 3.20 ± 8% +109.1% 6.69 ± 4% sched_debug.cpu.cpu_load[2].avg 288.90 ± 20% +167.0% 771.40 ± 12% sched_debug.cpu.cpu_load[2].max 19.70 ± 12% +175.4% 54.27 ± 9% sched_debug.cpu.cpu_load[2].stddev 3.16 ± 8% +110.9% 6.66 ± 6% sched_debug.cpu.cpu_load[3].avg 275.50 ± 24% +178.4% 766.95 ± 12% sched_debug.cpu.cpu_load[3].max 18.92 ± 15% +184.2% 53.77 ± 10% sched_debug.cpu.cpu_load[3].stddev 3.08 ± 8% +115.7% 6.65 ± 7% sched_debug.cpu.cpu_load[4].avg 263.55 ± 28% +188.7% 760.85 ± 12% sched_debug.cpu.cpu_load[4].max 18.03 ± 18% +196.6% 53.46 ± 11% sched_debug.cpu.cpu_load[4].stddev 14543 -9.6% 13150 sched_debug.cpu.curr->pid.max 5293 ± 16% +74.7% 9248 ± 11% sched_debug.cpu.load.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cpu.load.max 40887 ± 19% +78.3% 72891 ± 9% sched_debug.cpu.load.stddev 1141679 ± 4% +56.9% 1790907 ± 5% sched_debug.cpu.max_idle_balance_cost.avg 2432100 ± 9% +72.6% 4196779 ± 13% sched_debug.cpu.max_idle_balance_cost.max 745656 +29.3% 964170 ± 5% sched_debug.cpu.max_idle_balance_cost.min 239032 ± 9% +81.9% 434806 ± 10% sched_debug.cpu.max_idle_balance_cost.stddev 0.00 ± 27% +92.1% 0.00 ± 31% sched_debug.cpu.next_balance.stddev 1030 ± 4% -10.4% 924.00 ± 2% sched_debug.cpu.nr_switches.min 0.04 ± 26% +139.0% 0.09 ± 41% sched_debug.cpu.nr_uninterruptible.avg 830.35 ± 6% -12.0% 730.50 ± 2% sched_debug.cpu.sched_count.min 912.00 ± 2% -9.5% 825.38 sched_debug.cpu.ttwu_count.avg 433.05 ± 3% -19.2% 350.05 ± 3% sched_debug.cpu.ttwu_count.min 160.70 ± 3% -12.5% 140.60 ± 4% sched_debug.cpu.ttwu_local.min 9072 ± 11% -36.4% 5767 ± 8% softirqs.CPU1.RCU 12769 ± 5% +15.3% 14718 ± 3% softirqs.CPU101.SCHED 13198 +11.5% 14717 ± 3% softirqs.CPU102.SCHED 12981 ± 4% +13.9% 14788 ± 3% softirqs.CPU105.SCHED 13486 ± 3% +11.8% 15071 ± 4% softirqs.CPU111.SCHED 12794 ± 4% +14.1% 14601 ± 9% softirqs.CPU112.SCHED 12999 ± 4% +10.1% 14314 ± 4% softirqs.CPU115.SCHED 12844 ± 4% +10.6% 14202 ± 2% softirqs.CPU120.SCHED 13336 ± 3% +9.4% 14585 ± 3% softirqs.CPU122.SCHED 12639 ± 4% +20.2% 15195 softirqs.CPU123.SCHED 13040 ± 5% +15.2% 15024 ± 5% softirqs.CPU126.SCHED 13123 +15.1% 15106 ± 5% softirqs.CPU127.SCHED 9188 ± 6% -35.7% 5911 ± 2% softirqs.CPU13.RCU 13054 ± 3% +13.1% 14761 ± 5% softirqs.CPU130.SCHED 13158 ± 2% +13.9% 14985 ± 5% softirqs.CPU131.SCHED 12797 ± 6% +13.5% 14524 ± 3% softirqs.CPU133.SCHED 12452 ± 5% +14.8% 14297 softirqs.CPU134.SCHED 13078 ± 3% +10.4% 14439 ± 3% softirqs.CPU138.SCHED 12617 ± 2% +14.5% 14442 ± 5% softirqs.CPU139.SCHED 12974 ± 3% +13.7% 14752 ± 4% softirqs.CPU142.SCHED 12579 ± 4% +19.1% 14983 ± 3% softirqs.CPU143.SCHED 9122 ± 24% -44.6% 5053 ± 5% softirqs.CPU144.RCU 13366 ± 2% +11.1% 14848 ± 3% softirqs.CPU149.SCHED 13246 ± 2% +22.0% 16162 ± 7% softirqs.CPU150.SCHED 13452 ± 3% +20.5% 16210 ± 7% softirqs.CPU151.SCHED 13507 +10.1% 14869 softirqs.CPU156.SCHED 13808 ± 3% +9.2% 15079 ± 4% softirqs.CPU157.SCHED 13442 ± 2% +13.4% 15248 ± 4% softirqs.CPU160.SCHED 13311 +12.1% 14920 ± 2% softirqs.CPU162.SCHED 13544 ± 3% +8.5% 14695 ± 4% softirqs.CPU163.SCHED 13648 ± 3% +11.2% 15179 ± 2% softirqs.CPU166.SCHED 13404 ± 4% +12.5% 15079 ± 3% softirqs.CPU168.SCHED 13421 ± 6% +16.0% 15568 ± 8% softirqs.CPU169.SCHED 13115 ± 3% +23.1% 16139 ± 10% softirqs.CPU171.SCHED 13424 ± 6% +10.4% 14822 ± 3% softirqs.CPU175.SCHED 13274 ± 3% +13.7% 15087 ± 9% softirqs.CPU185.SCHED 13409 ± 3% +12.3% 15063 ± 3% softirqs.CPU190.SCHED 13181 ± 7% +13.4% 14946 ± 3% softirqs.CPU196.SCHED 13578 ± 3% +10.9% 15061 softirqs.CPU197.SCHED 13323 ± 5% +24.8% 16627 ± 6% softirqs.CPU198.SCHED 14072 ± 2% +12.3% 15798 ± 7% softirqs.CPU199.SCHED 12604 ± 13% +17.9% 14865 softirqs.CPU201.SCHED 13380 ± 4% +14.8% 15356 ± 3% softirqs.CPU203.SCHED 13481 ± 8% +14.2% 15390 ± 3% softirqs.CPU204.SCHED 12921 ± 2% +13.8% 14710 ± 3% softirqs.CPU206.SCHED 13468 +13.0% 15218 ± 2% softirqs.CPU208.SCHED 13253 ± 2% +13.1% 14992 softirqs.CPU209.SCHED 13319 ± 2% +14.3% 15225 ± 7% softirqs.CPU210.SCHED 13673 ± 5% +16.3% 15895 ± 3% softirqs.CPU211.SCHED 13290 +17.0% 15556 ± 5% softirqs.CPU212.SCHED 13455 ± 4% +14.4% 15392 ± 3% softirqs.CPU213.SCHED 13454 ± 4% +14.3% 15377 ± 3% softirqs.CPU215.SCHED 13872 ± 7% +9.7% 15221 ± 5% softirqs.CPU220.SCHED 13555 ± 4% +17.3% 15896 ± 5% softirqs.CPU222.SCHED 13411 ± 4% +20.8% 16197 ± 6% softirqs.CPU223.SCHED 8472 ± 21% -44.8% 4680 ± 3% softirqs.CPU224.RCU 13141 ± 3% +16.2% 15265 ± 7% softirqs.CPU225.SCHED 14084 ± 3% +8.2% 15242 ± 2% softirqs.CPU226.SCHED 13528 ± 4% +11.3% 15063 ± 4% softirqs.CPU228.SCHED 13218 ± 3% +16.3% 15377 ± 4% softirqs.CPU229.SCHED 14031 ± 4% +10.2% 15467 ± 2% softirqs.CPU231.SCHED 13770 ± 3% +14.0% 15700 ± 3% softirqs.CPU232.SCHED 13456 ± 3% +12.3% 15105 ± 3% softirqs.CPU233.SCHED 13137 ± 4% +13.5% 14909 ± 3% softirqs.CPU234.SCHED 13318 ± 2% +14.7% 15280 ± 2% softirqs.CPU235.SCHED 13690 ± 2% +13.7% 15563 ± 7% softirqs.CPU238.SCHED 13771 ± 5% +20.8% 16634 ± 7% softirqs.CPU241.SCHED 13317 ± 7% +19.5% 15919 ± 9% softirqs.CPU243.SCHED 8234 ± 16% -43.9% 4616 ± 5% softirqs.CPU244.RCU 13845 ± 6% +13.0% 15643 ± 3% softirqs.CPU244.SCHED 13179 ± 3% +16.3% 15323 softirqs.CPU246.SCHED 13754 +12.2% 15438 ± 3% softirqs.CPU248.SCHED 13769 ± 4% +10.9% 15276 ± 2% softirqs.CPU252.SCHED 13702 +10.5% 15147 ± 2% softirqs.CPU254.SCHED 13315 ± 2% +12.5% 14980 ± 3% softirqs.CPU255.SCHED 13785 ± 3% +12.9% 15568 ± 5% softirqs.CPU256.SCHED 13307 ± 3% +15.0% 15298 ± 3% softirqs.CPU257.SCHED 13864 ± 3% +10.5% 15313 ± 2% softirqs.CPU259.SCHED 13879 ± 2% +11.4% 15465 softirqs.CPU261.SCHED 13815 +13.6% 15687 ± 5% softirqs.CPU264.SCHED 119574 ± 2% +11.8% 133693 ± 11% softirqs.CPU266.TIMER 13688 +10.9% 15180 ± 6% softirqs.CPU267.SCHED 11716 ± 4% +19.3% 13974 ± 8% softirqs.CPU27.SCHED 13866 ± 3% +13.7% 15765 ± 4% softirqs.CPU271.SCHED 13887 ± 5% +12.5% 15621 softirqs.CPU272.SCHED 13383 ± 3% +19.8% 16031 ± 2% softirqs.CPU274.SCHED 13347 +14.1% 15232 ± 3% softirqs.CPU275.SCHED 12884 ± 2% +21.0% 15593 ± 4% softirqs.CPU276.SCHED 13131 ± 5% +13.4% 14891 ± 5% softirqs.CPU277.SCHED 12891 ± 2% +19.2% 15371 ± 4% softirqs.CPU278.SCHED 13313 ± 4% +13.0% 15049 ± 2% softirqs.CPU279.SCHED 13514 ± 3% +10.2% 14897 ± 2% softirqs.CPU280.SCHED 13501 ± 3% +13.7% 15346 softirqs.CPU281.SCHED 13261 +17.5% 15577 softirqs.CPU282.SCHED 8076 ± 15% -43.7% 4546 ± 5% softirqs.CPU283.RCU 13686 ± 3% +12.6% 15413 ± 2% softirqs.CPU284.SCHED 13439 ± 2% +9.2% 14670 ± 4% softirqs.CPU285.SCHED 8878 ± 9% -35.4% 5735 ± 4% softirqs.CPU35.RCU 11690 ± 2% +13.6% 13274 ± 5% softirqs.CPU40.SCHED 11714 ± 2% +19.3% 13975 ± 13% softirqs.CPU41.SCHED 11763 +12.5% 13239 ± 4% softirqs.CPU45.SCHED 11662 ± 2% +9.4% 12757 ± 3% softirqs.CPU46.SCHED 11805 ± 2% +9.3% 12902 ± 2% softirqs.CPU50.SCHED 12158 ± 3% +12.3% 13655 ± 8% softirqs.CPU55.SCHED 11716 ± 4% +8.8% 12751 ± 3% softirqs.CPU58.SCHED 11922 ± 2% +9.9% 13100 ± 4% softirqs.CPU64.SCHED 9674 ± 17% -41.8% 5625 ± 6% softirqs.CPU66.RCU 11818 +12.0% 13237 softirqs.CPU66.SCHED 124682 ± 7% -6.1% 117088 ± 5% softirqs.CPU66.TIMER 8637 ± 9% -34.0% 5700 ± 7% softirqs.CPU70.RCU 11624 ± 2% +11.0% 12901 ± 2% softirqs.CPU70.SCHED 12372 ± 2% +13.2% 14003 ± 3% softirqs.CPU71.SCHED 9949 ± 25% -33.9% 6574 ± 31% softirqs.CPU72.RCU 10392 ± 26% -35.1% 6745 ± 35% softirqs.CPU73.RCU 12766 ± 3% +11.1% 14188 ± 3% softirqs.CPU76.SCHED 12611 ± 2% +18.8% 14984 ± 5% softirqs.CPU78.SCHED 12786 ± 3% +17.9% 15079 ± 7% softirqs.CPU79.SCHED 11947 ± 4% +9.7% 13103 ± 4% softirqs.CPU8.SCHED 13379 ± 7% +11.8% 14962 ± 4% softirqs.CPU83.SCHED 13438 ± 5% +9.7% 14738 ± 2% softirqs.CPU84.SCHED 12768 +19.4% 15241 ± 6% softirqs.CPU88.SCHED 8604 ± 13% -39.3% 5222 ± 3% softirqs.CPU89.RCU 13077 ± 2% +17.1% 15308 ± 7% softirqs.CPU89.SCHED 11887 ± 3% +20.1% 14272 ± 5% softirqs.CPU9.SCHED 12723 ± 3% +11.3% 14165 ± 4% softirqs.CPU90.SCHED 8439 ± 12% -38.9% 5153 ± 4% softirqs.CPU91.RCU 13429 ± 3% +10.3% 14806 ± 2% softirqs.CPU95.SCHED 12852 ± 4% +10.3% 14174 ± 5% softirqs.CPU96.SCHED 13010 ± 2% +14.4% 14888 ± 5% softirqs.CPU97.SCHED 2315644 ± 4% -36.2% 1477200 ± 4% softirqs.RCU 1572 ± 10% +63.9% 2578 ± 39% interrupts.CPU0.NMI:Non-maskable_interrupts 1572 ± 10% +63.9% 2578 ± 39% interrupts.CPU0.PMI:Performance_monitoring_interrupts 252.00 ± 11% -35.2% 163.25 ± 13% interrupts.CPU104.RES:Rescheduling_interrupts 2738 ± 24% +52.4% 4173 ± 19% interrupts.CPU105.NMI:Non-maskable_interrupts 2738 ± 24% +52.4% 4173 ± 19% interrupts.CPU105.PMI:Performance_monitoring_interrupts 245.75 ± 19% -31.0% 169.50 ± 7% interrupts.CPU105.RES:Rescheduling_interrupts 228.75 ± 13% -24.7% 172.25 ± 19% interrupts.CPU106.RES:Rescheduling_interrupts 2243 ± 15% +66.3% 3730 ± 35% interrupts.CPU113.NMI:Non-maskable_interrupts 2243 ± 15% +66.3% 3730 ± 35% interrupts.CPU113.PMI:Performance_monitoring_interrupts 2703 ± 31% +67.0% 4514 ± 33% interrupts.CPU118.NMI:Non-maskable_interrupts 2703 ± 31% +67.0% 4514 ± 33% interrupts.CPU118.PMI:Performance_monitoring_interrupts 2613 ± 25% +42.2% 3715 ± 24% interrupts.CPU121.NMI:Non-maskable_interrupts 2613 ± 25% +42.2% 3715 ± 24% interrupts.CPU121.PMI:Performance_monitoring_interrupts 311.50 ± 23% -47.7% 163.00 ± 9% interrupts.CPU122.RES:Rescheduling_interrupts 266.75 ± 19% -31.6% 182.50 ± 15% interrupts.CPU124.RES:Rescheduling_interrupts 293.75 ± 33% -32.3% 198.75 ± 19% interrupts.CPU125.RES:Rescheduling_interrupts 2601 ± 36% +43.2% 3724 ± 29% interrupts.CPU127.NMI:Non-maskable_interrupts 2601 ± 36% +43.2% 3724 ± 29% interrupts.CPU127.PMI:Performance_monitoring_interrupts 2258 ± 21% +68.2% 3797 ± 29% interrupts.CPU13.NMI:Non-maskable_interrupts 2258 ± 21% +68.2% 3797 ± 29% interrupts.CPU13.PMI:Performance_monitoring_interrupts 3338 ± 29% +54.6% 5160 ± 9% interrupts.CPU139.NMI:Non-maskable_interrupts 3338 ± 29% +54.6% 5160 ± 9% interrupts.CPU139.PMI:Performance_monitoring_interrupts 219.50 ± 27% -23.0% 169.00 ± 21% interrupts.CPU139.RES:Rescheduling_interrupts 290.25 ± 25% -32.5% 196.00 ± 11% interrupts.CPU14.RES:Rescheduling_interrupts 243.50 ± 4% -16.0% 204.50 ± 12% interrupts.CPU140.RES:Rescheduling_interrupts 1797 ± 15% +135.0% 4223 ± 46% interrupts.CPU147.NMI:Non-maskable_interrupts 1797 ± 15% +135.0% 4223 ± 46% interrupts.CPU147.PMI:Performance_monitoring_interrupts 2537 ± 22% +89.6% 4812 ± 28% interrupts.CPU15.NMI:Non-maskable_interrupts 2537 ± 22% +89.6% 4812 ± 28% interrupts.CPU15.PMI:Performance_monitoring_interrupts 292.25 ± 34% -33.9% 193.25 ± 6% interrupts.CPU15.RES:Rescheduling_interrupts 424.25 ± 37% -58.5% 176.25 ± 14% interrupts.CPU158.RES:Rescheduling_interrupts 312.50 ± 42% -54.2% 143.00 ± 18% interrupts.CPU159.RES:Rescheduling_interrupts 725.00 ±118% -75.7% 176.25 ± 14% interrupts.CPU163.RES:Rescheduling_interrupts 2367 ± 6% +59.9% 3786 ± 24% interrupts.CPU177.NMI:Non-maskable_interrupts 2367 ± 6% +59.9% 3786 ± 24% interrupts.CPU177.PMI:Performance_monitoring_interrupts 239.50 ± 30% -46.6% 128.00 ± 14% interrupts.CPU179.RES:Rescheduling_interrupts 320.75 ± 15% -24.0% 243.75 ± 20% interrupts.CPU20.RES:Rescheduling_interrupts 302.50 ± 17% -47.2% 159.75 ± 8% interrupts.CPU200.RES:Rescheduling_interrupts 2166 ± 5% +92.0% 4157 ± 40% interrupts.CPU207.NMI:Non-maskable_interrupts 2166 ± 5% +92.0% 4157 ± 40% interrupts.CPU207.PMI:Performance_monitoring_interrupts 217.00 ± 11% -34.6% 142.00 ± 12% interrupts.CPU214.RES:Rescheduling_interrupts 2610 ± 36% +47.4% 3848 ± 35% interrupts.CPU215.NMI:Non-maskable_interrupts 2610 ± 36% +47.4% 3848 ± 35% interrupts.CPU215.PMI:Performance_monitoring_interrupts 2046 ± 13% +118.6% 4475 ± 43% interrupts.CPU22.NMI:Non-maskable_interrupts 2046 ± 13% +118.6% 4475 ± 43% interrupts.CPU22.PMI:Performance_monitoring_interrupts 289.50 ± 28% -41.1% 170.50 ± 8% interrupts.CPU22.RES:Rescheduling_interrupts 2232 ± 6% +33.0% 2970 ± 24% interrupts.CPU221.NMI:Non-maskable_interrupts 2232 ± 6% +33.0% 2970 ± 24% interrupts.CPU221.PMI:Performance_monitoring_interrupts 4552 ± 12% -27.6% 3295 ± 15% interrupts.CPU222.NMI:Non-maskable_interrupts 4552 ± 12% -27.6% 3295 ± 15% interrupts.CPU222.PMI:Performance_monitoring_interrupts 2013 ± 15% +80.9% 3641 ± 27% interrupts.CPU226.NMI:Non-maskable_interrupts 2013 ± 15% +80.9% 3641 ± 27% interrupts.CPU226.PMI:Performance_monitoring_interrupts 2575 ± 49% +67.1% 4302 ± 34% interrupts.CPU227.NMI:Non-maskable_interrupts 2575 ± 49% +67.1% 4302 ± 34% interrupts.CPU227.PMI:Performance_monitoring_interrupts 248.00 ± 36% -36.3% 158.00 ± 19% interrupts.CPU228.RES:Rescheduling_interrupts 2441 ± 24% +43.0% 3490 ± 30% interrupts.CPU23.NMI:Non-maskable_interrupts 2441 ± 24% +43.0% 3490 ± 30% interrupts.CPU23.PMI:Performance_monitoring_interrupts 404.25 ± 69% -65.5% 139.50 ± 17% interrupts.CPU236.RES:Rescheduling_interrupts 566.50 ± 40% -73.6% 149.50 ± 31% interrupts.CPU237.RES:Rescheduling_interrupts 243.50 ± 26% -37.1% 153.25 ± 21% interrupts.CPU248.RES:Rescheduling_interrupts 258.25 ± 12% -53.5% 120.00 ± 18% interrupts.CPU249.RES:Rescheduling_interrupts 2888 ± 27% +49.4% 4313 ± 30% interrupts.CPU253.NMI:Non-maskable_interrupts 2888 ± 27% +49.4% 4313 ± 30% interrupts.CPU253.PMI:Performance_monitoring_interrupts 2468 ± 44% +67.3% 4131 ± 37% interrupts.CPU256.NMI:Non-maskable_interrupts 2468 ± 44% +67.3% 4131 ± 37% interrupts.CPU256.PMI:Performance_monitoring_interrupts 425.00 ± 59% -60.3% 168.75 ± 34% interrupts.CPU258.RES:Rescheduling_interrupts 1859 ± 16% +106.3% 3834 ± 44% interrupts.CPU268.NMI:Non-maskable_interrupts 1859 ± 16% +106.3% 3834 ± 44% interrupts.CPU268.PMI:Performance_monitoring_interrupts 2684 ± 28% +61.2% 4326 ± 36% interrupts.CPU269.NMI:Non-maskable_interrupts 2684 ± 28% +61.2% 4326 ± 36% interrupts.CPU269.PMI:Performance_monitoring_interrupts 2171 ± 6% +108.8% 4533 ± 20% interrupts.CPU270.NMI:Non-maskable_interrupts 2171 ± 6% +108.8% 4533 ± 20% interrupts.CPU270.PMI:Performance_monitoring_interrupts 2262 ± 14% +61.8% 3659 ± 37% interrupts.CPU273.NMI:Non-maskable_interrupts 2262 ± 14% +61.8% 3659 ± 37% interrupts.CPU273.PMI:Performance_monitoring_interrupts 2203 ± 11% +50.7% 3320 ± 38% interrupts.CPU279.NMI:Non-maskable_interrupts 2203 ± 11% +50.7% 3320 ± 38% interrupts.CPU279.PMI:Performance_monitoring_interrupts 2433 ± 17% +52.9% 3721 ± 25% interrupts.CPU280.NMI:Non-maskable_interrupts 2433 ± 17% +52.9% 3721 ± 25% interrupts.CPU280.PMI:Performance_monitoring_interrupts 2778 ± 33% +63.1% 4531 ± 36% interrupts.CPU283.NMI:Non-maskable_interrupts 2778 ± 33% +63.1% 4531 ± 36% interrupts.CPU283.PMI:Performance_monitoring_interrupts 331.75 ± 32% -39.8% 199.75 ± 17% interrupts.CPU29.RES:Rescheduling_interrupts 2178 ± 22% +53.9% 3353 ± 31% interrupts.CPU3.NMI:Non-maskable_interrupts 2178 ± 22% +53.9% 3353 ± 31% interrupts.CPU3.PMI:Performance_monitoring_interrupts 298.50 ± 30% -39.7% 180.00 ± 6% interrupts.CPU34.RES:Rescheduling_interrupts 2490 ± 3% +58.7% 3953 ± 28% interrupts.CPU35.NMI:Non-maskable_interrupts 2490 ± 3% +58.7% 3953 ± 28% interrupts.CPU35.PMI:Performance_monitoring_interrupts 270.50 ± 24% -31.1% 186.25 ± 3% interrupts.CPU36.RES:Rescheduling_interrupts 2493 ± 7% +57.0% 3915 ± 27% interrupts.CPU43.NMI:Non-maskable_interrupts 2493 ± 7% +57.0% 3915 ± 27% interrupts.CPU43.PMI:Performance_monitoring_interrupts 286.75 ± 36% -32.4% 193.75 ± 7% interrupts.CPU45.RES:Rescheduling_interrupts 259.00 ± 12% -23.6% 197.75 ± 13% interrupts.CPU46.RES:Rescheduling_interrupts 244.00 ± 21% -35.6% 157.25 ± 11% interrupts.CPU47.RES:Rescheduling_interrupts 230.00 ± 7% -21.3% 181.00 ± 11% interrupts.CPU48.RES:Rescheduling_interrupts 281.00 ± 13% -27.4% 204.00 ± 15% interrupts.CPU53.RES:Rescheduling_interrupts 256.75 ± 5% -18.4% 209.50 ± 12% interrupts.CPU54.RES:Rescheduling_interrupts 2433 ± 9% +68.4% 4098 ± 35% interrupts.CPU58.NMI:Non-maskable_interrupts 2433 ± 9% +68.4% 4098 ± 35% interrupts.CPU58.PMI:Performance_monitoring_interrupts 316.00 ± 25% -41.4% 185.25 ± 13% interrupts.CPU59.RES:Rescheduling_interrupts 2703 ± 38% +56.0% 4217 ± 31% interrupts.CPU60.NMI:Non-maskable_interrupts 2703 ± 38% +56.0% 4217 ± 31% interrupts.CPU60.PMI:Performance_monitoring_interrupts 2425 ± 16% +39.9% 3394 ± 27% interrupts.CPU61.NMI:Non-maskable_interrupts 2425 ± 16% +39.9% 3394 ± 27% interrupts.CPU61.PMI:Performance_monitoring_interrupts 2388 ± 18% +69.5% 4047 ± 29% interrupts.CPU66.NMI:Non-maskable_interrupts 2388 ± 18% +69.5% 4047 ± 29% interrupts.CPU66.PMI:Performance_monitoring_interrupts 2322 ± 11% +93.4% 4491 ± 35% interrupts.CPU67.NMI:Non-maskable_interrupts 2322 ± 11% +93.4% 4491 ± 35% interrupts.CPU67.PMI:Performance_monitoring_interrupts 319.00 ± 40% -44.7% 176.25 ± 9% interrupts.CPU67.RES:Rescheduling_interrupts 2512 ± 8% +28.1% 3219 ± 25% interrupts.CPU70.NMI:Non-maskable_interrupts 2512 ± 8% +28.1% 3219 ± 25% interrupts.CPU70.PMI:Performance_monitoring_interrupts 2290 ± 39% +78.7% 4094 ± 28% interrupts.CPU74.NMI:Non-maskable_interrupts 2290 ± 39% +78.7% 4094 ± 28% interrupts.CPU74.PMI:Performance_monitoring_interrupts 2446 ± 40% +94.8% 4764 ± 23% interrupts.CPU75.NMI:Non-maskable_interrupts 2446 ± 40% +94.8% 4764 ± 23% interrupts.CPU75.PMI:Performance_monitoring_interrupts 426.75 ± 61% -67.7% 138.00 ± 8% interrupts.CPU75.RES:Rescheduling_interrupts 192.50 ± 13% +45.6% 280.25 ± 45% interrupts.CPU76.RES:Rescheduling_interrupts 274.25 ± 34% -42.2% 158.50 ± 34% interrupts.CPU77.RES:Rescheduling_interrupts 2357 ± 9% +73.0% 4078 ± 23% interrupts.CPU78.NMI:Non-maskable_interrupts 2357 ± 9% +73.0% 4078 ± 23% interrupts.CPU78.PMI:Performance_monitoring_interrupts 348.50 ± 53% -47.3% 183.75 ± 29% interrupts.CPU80.RES:Rescheduling_interrupts 2650 ± 43% +46.2% 3874 ± 36% interrupts.CPU84.NMI:Non-maskable_interrupts 2650 ± 43% +46.2% 3874 ± 36% interrupts.CPU84.PMI:Performance_monitoring_interrupts 2235 ± 10% +117.8% 4867 ± 10% interrupts.CPU90.NMI:Non-maskable_interrupts 2235 ± 10% +117.8% 4867 ± 10% interrupts.CPU90.PMI:Performance_monitoring_interrupts 2606 ± 33% +38.1% 3598 ± 21% interrupts.CPU92.NMI:Non-maskable_interrupts 2606 ± 33% +38.1% 3598 ± 21% interrupts.CPU92.PMI:Performance_monitoring_interrupts 408.75 ± 58% -56.8% 176.75 ± 25% interrupts.CPU92.RES:Rescheduling_interrupts 399.00 ± 64% -63.6% 145.25 ± 16% interrupts.CPU93.RES:Rescheduling_interrupts 314.75 ± 36% -44.2% 175.75 ± 13% interrupts.CPU94.RES:Rescheduling_interrupts 191.00 ± 15% -29.1% 135.50 ± 9% interrupts.CPU97.RES:Rescheduling_interrupts 94.00 ± 8% +50.0% 141.00 ± 12% interrupts.IWI:IRQ_work_interrupts 841457 ± 7% +16.6% 980751 ± 3% interrupts.NMI:Non-maskable_interrupts 841457 ± 7% +16.6% 980751 ± 3% interrupts.PMI:Performance_monitoring_interrupts 12.75 ± 11% -4.1 8.67 ± 31% perf-profile.calltrace.cycles-pp.do_rw_once 1.02 ± 16% -0.6 0.47 ± 59% perf-profile.calltrace.cycles-pp.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle 1.10 ± 15% -0.4 0.66 ± 14% perf-profile.calltrace.cycles-pp.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry 1.05 ± 16% -0.4 0.61 ± 14% perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter 1.58 ± 4% +0.3 1.91 ± 7% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.11 ± 4% +0.5 2.60 ± 7% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault 0.83 ± 26% +0.5 1.32 ± 18% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe 0.83 ± 26% +0.5 1.32 ± 18% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.90 ± 5% +0.6 2.45 ± 7% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage 0.65 ± 62% +0.6 1.20 ± 15% perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.60 ± 62% +0.6 1.16 ± 18% perf-profile.calltrace.cycles-pp.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap 0.95 ± 17% +0.6 1.52 ± 8% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner 0.61 ± 62% +0.6 1.18 ± 18% perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput 0.61 ± 62% +0.6 1.19 ± 19% perf-profile.calltrace.cycles-pp.tlb_finish_mmu.exit_mmap.mmput.do_exit.do_group_exit 0.61 ± 62% +0.6 1.19 ± 19% perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput.do_exit 0.64 ± 61% +0.6 1.23 ± 18% perf-profile.calltrace.cycles-pp.mmput.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64 0.64 ± 61% +0.6 1.23 ± 18% perf-profile.calltrace.cycles-pp.exit_mmap.mmput.do_exit.do_group_exit.__x64_sys_exit_group 1.30 ± 9% +0.6 1.92 ± 8% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock 0.19 ±173% +0.7 0.89 ± 20% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu 0.19 ±173% +0.7 0.90 ± 20% perf-profile.calltrace.cycles-pp._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu 0.00 +0.8 0.77 ± 30% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page 0.00 +0.8 0.78 ± 30% perf-profile.calltrace.cycles-pp._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page 0.00 +0.8 0.79 ± 29% perf-profile.calltrace.cycles-pp.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 0.82 ± 67% +0.9 1.72 ± 22% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.84 ± 66% +0.9 1.74 ± 20% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 2.52 ± 6% +0.9 3.44 ± 9% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page 0.83 ± 67% +0.9 1.75 ± 21% perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 0.84 ± 66% +0.9 1.77 ± 20% perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 1.64 ± 12% +1.0 2.67 ± 7% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault 1.65 ± 45% +1.3 2.99 ± 18% perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 1.74 ± 13% +1.4 3.16 ± 6% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault 2.56 ± 48% +2.2 4.81 ± 19% perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault 12.64 ± 14% +3.6 16.20 ± 8% perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.__do_page_fault 2.97 ± 7% +3.8 6.74 ± 9% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page.hugetlb_cow 19.99 ± 9% +4.1 24.05 ± 6% perf-profile.calltrace.cycles-pp.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault.do_page_fault 1.37 ± 15% -0.5 0.83 ± 13% perf-profile.children.cycles-pp.sched_clock_cpu 1.31 ± 16% -0.5 0.78 ± 13% perf-profile.children.cycles-pp.sched_clock 1.29 ± 16% -0.5 0.77 ± 13% perf-profile.children.cycles-pp.native_sched_clock 1.80 ± 2% -0.3 1.47 ± 10% perf-profile.children.cycles-pp.task_tick_fair 0.73 ± 2% -0.2 0.54 ± 11% perf-profile.children.cycles-pp.update_curr 0.42 ± 17% -0.2 0.27 ± 16% perf-profile.children.cycles-pp.account_process_tick 0.73 ± 10% -0.2 0.58 ± 9% perf-profile.children.cycles-pp.rcu_sched_clock_irq 0.27 ± 6% -0.1 0.14 ± 14% perf-profile.children.cycles-pp.__acct_update_integrals 0.27 ± 18% -0.1 0.16 ± 13% perf-profile.children.cycles-pp.rcu_segcblist_ready_cbs 0.40 ± 12% -0.1 0.30 ± 14% perf-profile.children.cycles-pp.__next_timer_interrupt 0.47 ± 7% -0.1 0.39 ± 13% perf-profile.children.cycles-pp.update_rq_clock 0.29 ± 12% -0.1 0.21 ± 15% perf-profile.children.cycles-pp.cpuidle_governor_latency_req 0.21 ± 7% -0.1 0.14 ± 12% perf-profile.children.cycles-pp.account_system_index_time 0.38 ± 2% -0.1 0.31 ± 12% perf-profile.children.cycles-pp.timerqueue_add 0.26 ± 11% -0.1 0.20 ± 13% perf-profile.children.cycles-pp.find_next_bit 0.23 ± 15% -0.1 0.17 ± 15% perf-profile.children.cycles-pp.rcu_dynticks_eqs_exit 0.14 ± 8% -0.1 0.07 ± 14% perf-profile.children.cycles-pp.account_user_time 0.17 ± 6% -0.0 0.12 ± 10% perf-profile.children.cycles-pp.cpuacct_charge 0.18 ± 20% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.irq_work_tick 0.11 ± 13% -0.0 0.07 ± 25% perf-profile.children.cycles-pp.tick_sched_do_timer 0.12 ± 10% -0.0 0.08 ± 15% perf-profile.children.cycles-pp.get_cpu_device 0.07 ± 11% -0.0 0.04 ± 58% perf-profile.children.cycles-pp.raise_softirq 0.12 ± 3% -0.0 0.09 ± 8% perf-profile.children.cycles-pp.write 0.11 ± 13% +0.0 0.14 ± 8% perf-profile.children.cycles-pp.native_write_msr 0.09 ± 9% +0.0 0.11 ± 7% perf-profile.children.cycles-pp.finish_task_switch 0.10 ± 10% +0.0 0.13 ± 5% perf-profile.children.cycles-pp.schedule_idle 0.07 ± 6% +0.0 0.10 ± 12% perf-profile.children.cycles-pp.__read_nocancel 0.04 ± 58% +0.0 0.07 ± 15% perf-profile.children.cycles-pp.__free_pages_ok 0.06 ± 7% +0.0 0.09 ± 13% perf-profile.children.cycles-pp.perf_read 0.07 +0.0 0.11 ± 14% perf-profile.children.cycles-pp.perf_evsel__read_counter 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.cmd_stat 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.__run_perf_stat 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.process_interval 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.read_counters 0.07 ± 22% +0.0 0.11 ± 19% perf-profile.children.cycles-pp.__handle_mm_fault 0.07 ± 19% +0.1 0.13 ± 8% perf-profile.children.cycles-pp.rb_erase 0.03 ±100% +0.1 0.09 ± 9% perf-profile.children.cycles-pp.smp_call_function_single 0.01 ±173% +0.1 0.08 ± 11% perf-profile.children.cycles-pp.perf_event_read 0.00 +0.1 0.07 ± 13% perf-profile.children.cycles-pp.__perf_event_read_value 0.00 +0.1 0.07 ± 7% perf-profile.children.cycles-pp.__intel_pmu_enable_all 0.08 ± 17% +0.1 0.15 ± 8% perf-profile.children.cycles-pp.native_apic_msr_eoi_write 0.04 ±103% +0.1 0.13 ± 58% perf-profile.children.cycles-pp.shmem_getpage_gfp 0.38 ± 14% +0.1 0.51 ± 6% perf-profile.children.cycles-pp.run_timer_softirq 0.11 ± 4% +0.3 0.37 ± 32% perf-profile.children.cycles-pp.worker_thread 0.20 ± 5% +0.3 0.48 ± 25% perf-profile.children.cycles-pp.ret_from_fork 0.20 ± 4% +0.3 0.48 ± 25% perf-profile.children.cycles-pp.kthread 0.00 +0.3 0.29 ± 38% perf-profile.children.cycles-pp.memcpy_erms 0.00 +0.3 0.29 ± 38% perf-profile.children.cycles-pp.drm_fb_helper_dirty_work 0.00 +0.3 0.31 ± 37% perf-profile.children.cycles-pp.process_one_work 0.47 ± 48% +0.4 0.91 ± 19% perf-profile.children.cycles-pp.prep_new_huge_page 0.70 ± 29% +0.5 1.16 ± 18% perf-profile.children.cycles-pp.free_huge_page 0.73 ± 29% +0.5 1.19 ± 18% perf-profile.children.cycles-pp.tlb_flush_mmu 0.72 ± 29% +0.5 1.18 ± 18% perf-profile.children.cycles-pp.release_pages 0.73 ± 29% +0.5 1.19 ± 18% perf-profile.children.cycles-pp.tlb_finish_mmu 0.76 ± 27% +0.5 1.23 ± 18% perf-profile.children.cycles-pp.exit_mmap 0.77 ± 27% +0.5 1.24 ± 18% perf-profile.children.cycles-pp.mmput 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.__x64_sys_exit_group 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.do_group_exit 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.do_exit 1.28 ± 29% +0.5 1.76 ± 9% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler 0.77 ± 28% +0.5 1.26 ± 13% perf-profile.children.cycles-pp.alloc_fresh_huge_page 1.53 ± 15% +0.7 2.26 ± 14% perf-profile.children.cycles-pp.do_syscall_64 1.53 ± 15% +0.7 2.27 ± 14% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 1.13 ± 3% +0.9 2.07 ± 14% perf-profile.children.cycles-pp.interrupt_entry 0.79 ± 9% +1.0 1.76 ± 5% perf-profile.children.cycles-pp.perf_event_task_tick 1.71 ± 39% +1.4 3.08 ± 16% perf-profile.children.cycles-pp.alloc_surplus_huge_page 2.66 ± 42% +2.3 4.94 ± 17% perf-profile.children.cycles-pp.alloc_huge_page 2.89 ± 45% +2.7 5.54 ± 18% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 3.34 ± 35% +2.7 6.02 ± 17% perf-profile.children.cycles-pp._raw_spin_lock 12.77 ± 14% +3.9 16.63 ± 7% perf-profile.children.cycles-pp.mutex_spin_on_owner 20.12 ± 9% +4.0 24.16 ± 6% perf-profile.children.cycles-pp.hugetlb_cow 15.40 ± 10% -3.6 11.84 ± 28% perf-profile.self.cycles-pp.do_rw_once 4.02 ± 9% -1.3 2.73 ± 30% perf-profile.self.cycles-pp.do_access 2.00 ± 14% -0.6 1.41 ± 13% perf-profile.self.cycles-pp.cpuidle_enter_state 1.26 ± 16% -0.5 0.74 ± 13% perf-profile.self.cycles-pp.native_sched_clock 0.42 ± 17% -0.2 0.27 ± 16% perf-profile.self.cycles-pp.account_process_tick 0.27 ± 19% -0.2 0.12 ± 17% perf-profile.self.cycles-pp.timerqueue_del 0.53 ± 3% -0.1 0.38 ± 11% perf-profile.self.cycles-pp.update_curr 0.27 ± 6% -0.1 0.14 ± 14% perf-profile.self.cycles-pp.__acct_update_integrals 0.27 ± 18% -0.1 0.16 ± 13% perf-profile.self.cycles-pp.rcu_segcblist_ready_cbs 0.61 ± 4% -0.1 0.51 ± 8% perf-profile.self.cycles-pp.task_tick_fair 0.20 ± 8% -0.1 0.12 ± 14% perf-profile.self.cycles-pp.account_system_index_time 0.23 ± 15% -0.1 0.16 ± 17% perf-profile.self.cycles-pp.rcu_dynticks_eqs_exit 0.25 ± 11% -0.1 0.18 ± 14% perf-profile.self.cycles-pp.find_next_bit 0.10 ± 11% -0.1 0.03 ±100% perf-profile.self.cycles-pp.tick_sched_do_timer 0.29 -0.1 0.23 ± 11% perf-profile.self.cycles-pp.timerqueue_add 0.12 ± 10% -0.1 0.06 ± 17% perf-profile.self.cycles-pp.account_user_time 0.22 ± 15% -0.1 0.16 ± 6% perf-profile.self.cycles-pp.scheduler_tick 0.17 ± 6% -0.0 0.12 ± 10% perf-profile.self.cycles-pp.cpuacct_charge 0.18 ± 20% -0.0 0.13 ± 3% perf-profile.self.cycles-pp.irq_work_tick 0.07 ± 13% -0.0 0.03 ±100% perf-profile.self.cycles-pp.update_process_times 0.12 ± 7% -0.0 0.08 ± 15% perf-profile.self.cycles-pp.get_cpu_device 0.07 ± 11% -0.0 0.04 ± 58% perf-profile.self.cycles-pp.raise_softirq 0.12 ± 11% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.tick_nohz_get_sleep_length 0.11 ± 11% +0.0 0.14 ± 6% perf-profile.self.cycles-pp.native_write_msr 0.10 ± 5% +0.1 0.15 ± 8% perf-profile.self.cycles-pp.__remove_hrtimer 0.07 ± 23% +0.1 0.13 ± 8% perf-profile.self.cycles-pp.rb_erase 0.08 ± 17% +0.1 0.15 ± 7% perf-profile.self.cycles-pp.native_apic_msr_eoi_write 0.00 +0.1 0.08 ± 10% perf-profile.self.cycles-pp.smp_call_function_single 0.32 ± 17% +0.1 0.42 ± 7% perf-profile.self.cycles-pp.run_timer_softirq 0.22 ± 5% +0.1 0.34 ± 4% perf-profile.self.cycles-pp.ktime_get_update_offsets_now 0.45 ± 15% +0.2 0.60 ± 12% perf-profile.self.cycles-pp.rcu_irq_enter 0.31 ± 8% +0.2 0.46 ± 16% perf-profile.self.cycles-pp.irq_enter 0.29 ± 10% +0.2 0.44 ± 16% perf-profile.self.cycles-pp.apic_timer_interrupt 0.71 ± 30% +0.2 0.92 ± 8% perf-profile.self.cycles-pp.perf_mux_hrtimer_handler 0.00 +0.3 0.28 ± 37% perf-profile.self.cycles-pp.memcpy_erms 1.12 ± 3% +0.9 2.02 ± 15% perf-profile.self.cycles-pp.interrupt_entry 0.79 ± 9% +0.9 1.73 ± 5% perf-profile.self.cycles-pp.perf_event_task_tick 2.49 ± 45% +2.1 4.55 ± 20% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 10.95 ± 15% +2.7 13.61 ± 8% perf-profile.self.cycles-pp.mutex_spin_on_owner
vm-scalability.throughput
1.6e+07 +-+---------------------------------------------------------------+ |..+.+ +..+.+..+.+. +. +..+.+..+.+..+.+..+.+..+ + | 1.4e+07 +-+ : : O O O O | 1.2e+07 O-+O O O O O O O O O O O O O O O O O O | : : O O O O | 1e+07 +-+ : : | | : : | 8e+06 +-+ : : | | : : | 6e+06 +-+ : : | 4e+06 +-+ : : | | :: | 2e+06 +-+ : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.time.minor_page_faults
2.5e+06 +-+---------------------------------------------------------------+ | | |..+.+ +..+.+..+.+..+.+..+.+.. .+. .+.+..+.+..+.+..+.+..+ | 2e+06 +-+ : : +. +. | O O O: O O O O O O O O O O | | : : O O O O O O O O O O O O O O 1.5e+06 +-+ : : | | : : | 1e+06 +-+ : : | | : : | | : : | 500000 +-+ : : | | : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.workload
3.5e+09 +-+---------------------------------------------------------------+ | .+. .+.+.. .+.. | 3e+09 +-+ + +..+.+..+.+..+.+. +..+.+..+.+..+.+..+.+..+ + | | : : O O O | 2.5e+09 O-+O O: O O O O O O O O O | | : : O O O O O O O O O O O O 2e+09 +-+ : : | | : : | 1.5e+09 +-+ : : | | : : | 1e+09 +-+ : : | | : : | 5e+08 +-+ : | | : | 0 +-+---------------------------------------------------------------+
[*] bisect-good sample [O] bisect-bad sample
Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
Thanks, Rong Chen
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Am 29.07.19 um 11:51 schrieb kernel test robot:
Greeting,
FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
Daniel, Noralf, we may have to revert this patch.
I expected some change in display performance, but not in VM. Since it's a server chipset, probably no one cares much about display performance. So that seemed like a good trade-off for re-using shared code.
Part of the patch set is that the generic fb emulation now maps and unmaps the fbdev BO when updating the screen. I guess that's the cause of the performance regression. And it should be visible with other drivers as well if they use a shadow FB for fbdev emulation.
For fbcon we should need to do any maps/unamps at all, this is for the fbdev mmap support only. If the testcase mentioned here tests fbdev mmap handling it's pretty badly misnamed :-) And as long as you don't have an fbdev mmap there shouldn't be any impact at all.
The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Yeah agreed, fbdev emulation should try to cache the vmap.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
The thing is that we'd need another generic fbdev emulation for ast and mgag200 that handles this issue properly.
Yeah I dont think we want to jump the gun here. If you can try to repro locally and profile where we're wasting cpu time I hope that should sched a light what's going wrong here.
I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ... -Daniel
Best regards Thomas
[1] https://cgit.freedesktop.org/drm/drm-misc/tree/drivers/gpu/drm/drm_fb_helper... [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv... [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...
-Daniel
Best regards Thomas
in testcase: vm-scalability on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory with following parameters:
runtime: 300s size: 8T test: anon-cow-seq-hugetlb cpufreq_governor: performance
test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us. test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
Details are as below: -------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml
========================================================================================= compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase: gcc-7/performance/x86_64-rhel-7.6/debian-x86_64-2019-05-14.cgz/300s/8T/lkp-knm01/anon-cow-seq-hugetlb/vm-scalability
commit: f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console") 90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9
fail:runs %reproduction fail:runs | | | 2:4 -50% :4 dmesg.WARNING:at#for_ip_interrupt_entry/0x :4 25% 1:4 dmesg.WARNING:at_ip___perf_sw_event/0x :4 25% 1:4 dmesg.WARNING:at_ip__fsnotify_parent/0x %stddev %change %stddev \ | \ 43955 ± 2% -18.8% 35691 vm-scalability.median 0.06 ± 7% +193.0% 0.16 ± 2% vm-scalability.median_stddev
14906559 ± 2% -17.9% 12237079 vm-scalability.throughput 87651 ± 2% -17.4% 72374 vm-scalability.time.involuntary_context_switches 2086168 -23.6% 1594224 vm-scalability.time.minor_page_faults 15082 ± 2% -10.4% 13517 vm-scalability.time.percent_of_cpu_this_job_got 29987 -8.9% 27327 vm-scalability.time.system_time 15755 -12.4% 13795 vm-scalability.time.user_time 122011 -19.3% 98418 vm-scalability.time.voluntary_context_switches 3.034e+09 -23.6% 2.318e+09 vm-scalability.workload 242478 ± 12% +68.5% 408518 ± 23% cpuidle.POLL.time 2788 ± 21% +117.4% 6062 ± 26% cpuidle.POLL.usage 56653 ± 10% +64.4% 93144 ± 20% meminfo.Mapped 120392 ± 7% +14.0% 137212 ± 4% meminfo.Shmem 47221 ± 11% +77.1% 83634 ± 22% numa-meminfo.node0.Mapped 120465 ± 7% +13.9% 137205 ± 4% numa-meminfo.node0.Shmem 2885513 -16.5% 2409384 numa-numastat.node0.local_node 2885471 -16.5% 2409354 numa-numastat.node0.numa_hit 11813 ± 11% +76.3% 20824 ± 22% numa-vmstat.node0.nr_mapped 30096 ± 7% +13.8% 34238 ± 4% numa-vmstat.node0.nr_shmem 43.72 ± 2% +5.5 49.20 mpstat.cpu.all.idle% 0.03 ± 4% +0.0 0.05 ± 6% mpstat.cpu.all.soft% 19.51 -2.4 17.08 mpstat.cpu.all.usr% 1012 -7.9% 932.75 turbostat.Avg_MHz 32.38 ± 10% +25.8% 40.73 turbostat.CPU%c1 145.51 -3.1% 141.01 turbostat.PkgWatt 15.09 -19.2% 12.19 turbostat.RAMWatt 43.50 ± 2% +13.2% 49.25 vmstat.cpu.id 18.75 ± 2% -13.3% 16.25 ± 2% vmstat.cpu.us 152.00 ± 2% -9.5% 137.50 vmstat.procs.r 4800 -13.1% 4173 vmstat.system.cs 156170 -11.9% 137594 slabinfo.anon_vma.active_objs 3395 -11.9% 2991 slabinfo.anon_vma.active_slabs 156190 -11.9% 137606 slabinfo.anon_vma.num_objs 3395 -11.9% 2991 slabinfo.anon_vma.num_slabs 1716 ± 5% +11.5% 1913 ± 8% slabinfo.dmaengine-unmap-16.active_objs 1716 ± 5% +11.5% 1913 ± 8% slabinfo.dmaengine-unmap-16.num_objs 1767 ± 2% -19.0% 1431 ± 2% slabinfo.hugetlbfs_inode_cache.active_objs 1767 ± 2% -19.0% 1431 ± 2% slabinfo.hugetlbfs_inode_cache.num_objs 3597 ± 5% -16.4% 3006 ± 3% slabinfo.skbuff_ext_cache.active_objs 3597 ± 5% -16.4% 3006 ± 3% slabinfo.skbuff_ext_cache.num_objs 1330122 -23.6% 1016557 proc-vmstat.htlb_buddy_alloc_success 77214 ± 3% +6.4% 82128 ± 2% proc-vmstat.nr_active_anon 67277 +2.9% 69246 proc-vmstat.nr_anon_pages 218.50 ± 3% -10.6% 195.25 proc-vmstat.nr_dirtied 288628 +1.4% 292755 proc-vmstat.nr_file_pages 360.50 -2.7% 350.75 proc-vmstat.nr_inactive_file 14225 ± 9% +63.8% 23304 ± 20% proc-vmstat.nr_mapped 30109 ± 7% +13.8% 34259 ± 4% proc-vmstat.nr_shmem 99870 -1.3% 98597 proc-vmstat.nr_slab_unreclaimable 204.00 ± 4% -12.1% 179.25 proc-vmstat.nr_written 77214 ± 3% +6.4% 82128 ± 2% proc-vmstat.nr_zone_active_anon 360.50 -2.7% 350.75 proc-vmstat.nr_zone_inactive_file 8810 ± 19% -66.1% 2987 ± 42% proc-vmstat.numa_hint_faults 8810 ± 19% -66.1% 2987 ± 42% proc-vmstat.numa_hint_faults_local 2904082 -16.4% 2427026 proc-vmstat.numa_hit 2904081 -16.4% 2427025 proc-vmstat.numa_local 6.828e+08 -23.5% 5.221e+08 proc-vmstat.pgalloc_normal 2900008 -17.2% 2400195 proc-vmstat.pgfault 6.827e+08 -23.5% 5.22e+08 proc-vmstat.pgfree 1.635e+10 -17.0% 1.357e+10 perf-stat.i.branch-instructions 1.53 ± 4% -0.1 1.45 ± 3% perf-stat.i.branch-miss-rate% 2.581e+08 ± 3% -20.5% 2.051e+08 ± 2% perf-stat.i.branch-misses 12.66 +1.1 13.78 perf-stat.i.cache-miss-rate% 72720849 -12.0% 63958986 perf-stat.i.cache-misses 5.766e+08 -18.6% 4.691e+08 perf-stat.i.cache-references 4674 ± 2% -13.0% 4064 perf-stat.i.context-switches 4.29 +12.5% 4.83 perf-stat.i.cpi 2.573e+11 -7.4% 2.383e+11 perf-stat.i.cpu-cycles 231.35 -21.5% 181.56 perf-stat.i.cpu-migrations 3522 +4.4% 3677 perf-stat.i.cycles-between-cache-misses 0.09 ± 13% +0.0 0.12 ± 5% perf-stat.i.iTLB-load-miss-rate% 5.894e+10 -15.8% 4.961e+10 perf-stat.i.iTLB-loads 5.901e+10 -15.8% 4.967e+10 perf-stat.i.instructions 1291 ± 14% -21.8% 1010 perf-stat.i.instructions-per-iTLB-miss 0.24 -11.0% 0.21 perf-stat.i.ipc 9476 -17.5% 7821 perf-stat.i.minor-faults 9478 -17.5% 7821 perf-stat.i.page-faults 9.76 -3.6% 9.41 perf-stat.overall.MPKI 1.59 ± 4% -0.1 1.52 perf-stat.overall.branch-miss-rate% 12.61 +1.1 13.71 perf-stat.overall.cache-miss-rate% 4.38 +10.5% 4.83 perf-stat.overall.cpi 3557 +5.3% 3747 perf-stat.overall.cycles-between-cache-misses 0.08 ± 12% +0.0 0.10 perf-stat.overall.iTLB-load-miss-rate% 1268 ± 15% -23.0% 976.22 perf-stat.overall.instructions-per-iTLB-miss 0.23 -9.5% 0.21 perf-stat.overall.ipc 5815 +9.7% 6378 perf-stat.overall.path-length 1.634e+10 -17.5% 1.348e+10 perf-stat.ps.branch-instructions 2.595e+08 ± 3% -21.2% 2.043e+08 ± 2% perf-stat.ps.branch-misses 72565205 -12.2% 63706339 perf-stat.ps.cache-misses 5.754e+08 -19.2% 4.646e+08 perf-stat.ps.cache-references 4640 ± 2% -12.5% 4060 perf-stat.ps.context-switches 2.581e+11 -7.5% 2.387e+11 perf-stat.ps.cpu-cycles 229.91 -22.0% 179.42 perf-stat.ps.cpu-migrations 5.889e+10 -16.3% 4.927e+10 perf-stat.ps.iTLB-loads 5.899e+10 -16.3% 4.938e+10 perf-stat.ps.instructions 9388 -18.2% 7677 perf-stat.ps.minor-faults 9389 -18.2% 7677 perf-stat.ps.page-faults 1.764e+13 -16.2% 1.479e+13 perf-stat.total.instructions 46803 ± 3% -18.8% 37982 ± 6% sched_debug.cfs_rq:/.exec_clock.min 5320 ± 3% +23.7% 6581 ± 3% sched_debug.cfs_rq:/.exec_clock.stddev 6737 ± 14% +58.1% 10649 ± 10% sched_debug.cfs_rq:/.load.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cfs_rq:/.load.max 46952 ± 16% +64.8% 77388 ± 11% sched_debug.cfs_rq:/.load.stddev 7.12 ± 4% +49.1% 10.62 ± 6% sched_debug.cfs_rq:/.load_avg.avg 474.40 ± 23% +67.5% 794.60 ± 10% sched_debug.cfs_rq:/.load_avg.max 37.70 ± 11% +74.8% 65.90 ± 9% sched_debug.cfs_rq:/.load_avg.stddev 13424269 ± 4% -15.6% 11328098 ± 2% sched_debug.cfs_rq:/.min_vruntime.avg 15411275 ± 3% -12.4% 13505072 ± 2% sched_debug.cfs_rq:/.min_vruntime.max 7939295 ± 6% -17.5% 6551322 ± 7% sched_debug.cfs_rq:/.min_vruntime.min 21.44 ± 7% -56.1% 9.42 ± 4% sched_debug.cfs_rq:/.nr_spread_over.avg 117.45 ± 11% -60.6% 46.30 ± 14% sched_debug.cfs_rq:/.nr_spread_over.max 19.33 ± 8% -66.4% 6.49 ± 9% sched_debug.cfs_rq:/.nr_spread_over.stddev 4.32 ± 15% +84.4% 7.97 ± 3% sched_debug.cfs_rq:/.runnable_load_avg.avg 353.85 ± 29% +118.8% 774.35 ± 11% sched_debug.cfs_rq:/.runnable_load_avg.max 27.30 ± 24% +118.5% 59.64 ± 9% sched_debug.cfs_rq:/.runnable_load_avg.stddev 6729 ± 14% +58.2% 10644 ± 10% sched_debug.cfs_rq:/.runnable_weight.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cfs_rq:/.runnable_weight.max 46950 ± 16% +64.8% 77387 ± 11% sched_debug.cfs_rq:/.runnable_weight.stddev 5305069 ± 4% -17.4% 4380376 ± 7% sched_debug.cfs_rq:/.spread0.avg 7328745 ± 3% -9.9% 6600897 ± 3% sched_debug.cfs_rq:/.spread0.max 2220837 ± 4% +55.8% 3460596 ± 5% sched_debug.cpu.avg_idle.avg 4590666 ± 9% +76.8% 8117037 ± 15% sched_debug.cpu.avg_idle.max 485052 ± 7% +80.3% 874679 ± 10% sched_debug.cpu.avg_idle.stddev 561.50 ± 26% +37.7% 773.30 ± 15% sched_debug.cpu.clock.stddev 561.50 ± 26% +37.7% 773.30 ± 15% sched_debug.cpu.clock_task.stddev 3.20 ± 10% +109.6% 6.70 ± 3% sched_debug.cpu.cpu_load[0].avg 309.10 ± 20% +150.3% 773.75 ± 12% sched_debug.cpu.cpu_load[0].max 21.02 ± 14% +160.8% 54.80 ± 9% sched_debug.cpu.cpu_load[0].stddev 3.19 ± 8% +109.8% 6.70 ± 3% sched_debug.cpu.cpu_load[1].avg 299.75 ± 19% +158.0% 773.30 ± 12% sched_debug.cpu.cpu_load[1].max 20.32 ± 12% +168.7% 54.62 ± 9% sched_debug.cpu.cpu_load[1].stddev 3.20 ± 8% +109.1% 6.69 ± 4% sched_debug.cpu.cpu_load[2].avg 288.90 ± 20% +167.0% 771.40 ± 12% sched_debug.cpu.cpu_load[2].max 19.70 ± 12% +175.4% 54.27 ± 9% sched_debug.cpu.cpu_load[2].stddev 3.16 ± 8% +110.9% 6.66 ± 6% sched_debug.cpu.cpu_load[3].avg 275.50 ± 24% +178.4% 766.95 ± 12% sched_debug.cpu.cpu_load[3].max 18.92 ± 15% +184.2% 53.77 ± 10% sched_debug.cpu.cpu_load[3].stddev 3.08 ± 8% +115.7% 6.65 ± 7% sched_debug.cpu.cpu_load[4].avg 263.55 ± 28% +188.7% 760.85 ± 12% sched_debug.cpu.cpu_load[4].max 18.03 ± 18% +196.6% 53.46 ± 11% sched_debug.cpu.cpu_load[4].stddev 14543 -9.6% 13150 sched_debug.cpu.curr->pid.max 5293 ± 16% +74.7% 9248 ± 11% sched_debug.cpu.load.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cpu.load.max 40887 ± 19% +78.3% 72891 ± 9% sched_debug.cpu.load.stddev 1141679 ± 4% +56.9% 1790907 ± 5% sched_debug.cpu.max_idle_balance_cost.avg 2432100 ± 9% +72.6% 4196779 ± 13% sched_debug.cpu.max_idle_balance_cost.max 745656 +29.3% 964170 ± 5% sched_debug.cpu.max_idle_balance_cost.min 239032 ± 9% +81.9% 434806 ± 10% sched_debug.cpu.max_idle_balance_cost.stddev 0.00 ± 27% +92.1% 0.00 ± 31% sched_debug.cpu.next_balance.stddev 1030 ± 4% -10.4% 924.00 ± 2% sched_debug.cpu.nr_switches.min 0.04 ± 26% +139.0% 0.09 ± 41% sched_debug.cpu.nr_uninterruptible.avg 830.35 ± 6% -12.0% 730.50 ± 2% sched_debug.cpu.sched_count.min 912.00 ± 2% -9.5% 825.38 sched_debug.cpu.ttwu_count.avg 433.05 ± 3% -19.2% 350.05 ± 3% sched_debug.cpu.ttwu_count.min 160.70 ± 3% -12.5% 140.60 ± 4% sched_debug.cpu.ttwu_local.min 9072 ± 11% -36.4% 5767 ± 8% softirqs.CPU1.RCU 12769 ± 5% +15.3% 14718 ± 3% softirqs.CPU101.SCHED 13198 +11.5% 14717 ± 3% softirqs.CPU102.SCHED 12981 ± 4% +13.9% 14788 ± 3% softirqs.CPU105.SCHED 13486 ± 3% +11.8% 15071 ± 4% softirqs.CPU111.SCHED 12794 ± 4% +14.1% 14601 ± 9% softirqs.CPU112.SCHED 12999 ± 4% +10.1% 14314 ± 4% softirqs.CPU115.SCHED 12844 ± 4% +10.6% 14202 ± 2% softirqs.CPU120.SCHED 13336 ± 3% +9.4% 14585 ± 3% softirqs.CPU122.SCHED 12639 ± 4% +20.2% 15195 softirqs.CPU123.SCHED 13040 ± 5% +15.2% 15024 ± 5% softirqs.CPU126.SCHED 13123 +15.1% 15106 ± 5% softirqs.CPU127.SCHED 9188 ± 6% -35.7% 5911 ± 2% softirqs.CPU13.RCU 13054 ± 3% +13.1% 14761 ± 5% softirqs.CPU130.SCHED 13158 ± 2% +13.9% 14985 ± 5% softirqs.CPU131.SCHED 12797 ± 6% +13.5% 14524 ± 3% softirqs.CPU133.SCHED 12452 ± 5% +14.8% 14297 softirqs.CPU134.SCHED 13078 ± 3% +10.4% 14439 ± 3% softirqs.CPU138.SCHED 12617 ± 2% +14.5% 14442 ± 5% softirqs.CPU139.SCHED 12974 ± 3% +13.7% 14752 ± 4% softirqs.CPU142.SCHED 12579 ± 4% +19.1% 14983 ± 3% softirqs.CPU143.SCHED 9122 ± 24% -44.6% 5053 ± 5% softirqs.CPU144.RCU 13366 ± 2% +11.1% 14848 ± 3% softirqs.CPU149.SCHED 13246 ± 2% +22.0% 16162 ± 7% softirqs.CPU150.SCHED 13452 ± 3% +20.5% 16210 ± 7% softirqs.CPU151.SCHED 13507 +10.1% 14869 softirqs.CPU156.SCHED 13808 ± 3% +9.2% 15079 ± 4% softirqs.CPU157.SCHED 13442 ± 2% +13.4% 15248 ± 4% softirqs.CPU160.SCHED 13311 +12.1% 14920 ± 2% softirqs.CPU162.SCHED 13544 ± 3% +8.5% 14695 ± 4% softirqs.CPU163.SCHED 13648 ± 3% +11.2% 15179 ± 2% softirqs.CPU166.SCHED 13404 ± 4% +12.5% 15079 ± 3% softirqs.CPU168.SCHED 13421 ± 6% +16.0% 15568 ± 8% softirqs.CPU169.SCHED 13115 ± 3% +23.1% 16139 ± 10% softirqs.CPU171.SCHED 13424 ± 6% +10.4% 14822 ± 3% softirqs.CPU175.SCHED 13274 ± 3% +13.7% 15087 ± 9% softirqs.CPU185.SCHED 13409 ± 3% +12.3% 15063 ± 3% softirqs.CPU190.SCHED 13181 ± 7% +13.4% 14946 ± 3% softirqs.CPU196.SCHED 13578 ± 3% +10.9% 15061 softirqs.CPU197.SCHED 13323 ± 5% +24.8% 16627 ± 6% softirqs.CPU198.SCHED 14072 ± 2% +12.3% 15798 ± 7% softirqs.CPU199.SCHED 12604 ± 13% +17.9% 14865 softirqs.CPU201.SCHED 13380 ± 4% +14.8% 15356 ± 3% softirqs.CPU203.SCHED 13481 ± 8% +14.2% 15390 ± 3% softirqs.CPU204.SCHED 12921 ± 2% +13.8% 14710 ± 3% softirqs.CPU206.SCHED 13468 +13.0% 15218 ± 2% softirqs.CPU208.SCHED 13253 ± 2% +13.1% 14992 softirqs.CPU209.SCHED 13319 ± 2% +14.3% 15225 ± 7% softirqs.CPU210.SCHED 13673 ± 5% +16.3% 15895 ± 3% softirqs.CPU211.SCHED 13290 +17.0% 15556 ± 5% softirqs.CPU212.SCHED 13455 ± 4% +14.4% 15392 ± 3% softirqs.CPU213.SCHED 13454 ± 4% +14.3% 15377 ± 3% softirqs.CPU215.SCHED 13872 ± 7% +9.7% 15221 ± 5% softirqs.CPU220.SCHED 13555 ± 4% +17.3% 15896 ± 5% softirqs.CPU222.SCHED 13411 ± 4% +20.8% 16197 ± 6% softirqs.CPU223.SCHED 8472 ± 21% -44.8% 4680 ± 3% softirqs.CPU224.RCU 13141 ± 3% +16.2% 15265 ± 7% softirqs.CPU225.SCHED 14084 ± 3% +8.2% 15242 ± 2% softirqs.CPU226.SCHED 13528 ± 4% +11.3% 15063 ± 4% softirqs.CPU228.SCHED 13218 ± 3% +16.3% 15377 ± 4% softirqs.CPU229.SCHED 14031 ± 4% +10.2% 15467 ± 2% softirqs.CPU231.SCHED 13770 ± 3% +14.0% 15700 ± 3% softirqs.CPU232.SCHED 13456 ± 3% +12.3% 15105 ± 3% softirqs.CPU233.SCHED 13137 ± 4% +13.5% 14909 ± 3% softirqs.CPU234.SCHED 13318 ± 2% +14.7% 15280 ± 2% softirqs.CPU235.SCHED 13690 ± 2% +13.7% 15563 ± 7% softirqs.CPU238.SCHED 13771 ± 5% +20.8% 16634 ± 7% softirqs.CPU241.SCHED 13317 ± 7% +19.5% 15919 ± 9% softirqs.CPU243.SCHED 8234 ± 16% -43.9% 4616 ± 5% softirqs.CPU244.RCU 13845 ± 6% +13.0% 15643 ± 3% softirqs.CPU244.SCHED 13179 ± 3% +16.3% 15323 softirqs.CPU246.SCHED 13754 +12.2% 15438 ± 3% softirqs.CPU248.SCHED 13769 ± 4% +10.9% 15276 ± 2% softirqs.CPU252.SCHED 13702 +10.5% 15147 ± 2% softirqs.CPU254.SCHED 13315 ± 2% +12.5% 14980 ± 3% softirqs.CPU255.SCHED 13785 ± 3% +12.9% 15568 ± 5% softirqs.CPU256.SCHED 13307 ± 3% +15.0% 15298 ± 3% softirqs.CPU257.SCHED 13864 ± 3% +10.5% 15313 ± 2% softirqs.CPU259.SCHED 13879 ± 2% +11.4% 15465 softirqs.CPU261.SCHED 13815 +13.6% 15687 ± 5% softirqs.CPU264.SCHED 119574 ± 2% +11.8% 133693 ± 11% softirqs.CPU266.TIMER 13688 +10.9% 15180 ± 6% softirqs.CPU267.SCHED 11716 ± 4% +19.3% 13974 ± 8% softirqs.CPU27.SCHED 13866 ± 3% +13.7% 15765 ± 4% softirqs.CPU271.SCHED 13887 ± 5% +12.5% 15621 softirqs.CPU272.SCHED 13383 ± 3% +19.8% 16031 ± 2% softirqs.CPU274.SCHED 13347 +14.1% 15232 ± 3% softirqs.CPU275.SCHED 12884 ± 2% +21.0% 15593 ± 4% softirqs.CPU276.SCHED 13131 ± 5% +13.4% 14891 ± 5% softirqs.CPU277.SCHED 12891 ± 2% +19.2% 15371 ± 4% softirqs.CPU278.SCHED 13313 ± 4% +13.0% 15049 ± 2% softirqs.CPU279.SCHED 13514 ± 3% +10.2% 14897 ± 2% softirqs.CPU280.SCHED 13501 ± 3% +13.7% 15346 softirqs.CPU281.SCHED 13261 +17.5% 15577 softirqs.CPU282.SCHED 8076 ± 15% -43.7% 4546 ± 5% softirqs.CPU283.RCU 13686 ± 3% +12.6% 15413 ± 2% softirqs.CPU284.SCHED 13439 ± 2% +9.2% 14670 ± 4% softirqs.CPU285.SCHED 8878 ± 9% -35.4% 5735 ± 4% softirqs.CPU35.RCU 11690 ± 2% +13.6% 13274 ± 5% softirqs.CPU40.SCHED 11714 ± 2% +19.3% 13975 ± 13% softirqs.CPU41.SCHED 11763 +12.5% 13239 ± 4% softirqs.CPU45.SCHED 11662 ± 2% +9.4% 12757 ± 3% softirqs.CPU46.SCHED 11805 ± 2% +9.3% 12902 ± 2% softirqs.CPU50.SCHED 12158 ± 3% +12.3% 13655 ± 8% softirqs.CPU55.SCHED 11716 ± 4% +8.8% 12751 ± 3% softirqs.CPU58.SCHED 11922 ± 2% +9.9% 13100 ± 4% softirqs.CPU64.SCHED 9674 ± 17% -41.8% 5625 ± 6% softirqs.CPU66.RCU 11818 +12.0% 13237 softirqs.CPU66.SCHED 124682 ± 7% -6.1% 117088 ± 5% softirqs.CPU66.TIMER 8637 ± 9% -34.0% 5700 ± 7% softirqs.CPU70.RCU 11624 ± 2% +11.0% 12901 ± 2% softirqs.CPU70.SCHED 12372 ± 2% +13.2% 14003 ± 3% softirqs.CPU71.SCHED 9949 ± 25% -33.9% 6574 ± 31% softirqs.CPU72.RCU 10392 ± 26% -35.1% 6745 ± 35% softirqs.CPU73.RCU 12766 ± 3% +11.1% 14188 ± 3% softirqs.CPU76.SCHED 12611 ± 2% +18.8% 14984 ± 5% softirqs.CPU78.SCHED 12786 ± 3% +17.9% 15079 ± 7% softirqs.CPU79.SCHED 11947 ± 4% +9.7% 13103 ± 4% softirqs.CPU8.SCHED 13379 ± 7% +11.8% 14962 ± 4% softirqs.CPU83.SCHED 13438 ± 5% +9.7% 14738 ± 2% softirqs.CPU84.SCHED 12768 +19.4% 15241 ± 6% softirqs.CPU88.SCHED 8604 ± 13% -39.3% 5222 ± 3% softirqs.CPU89.RCU 13077 ± 2% +17.1% 15308 ± 7% softirqs.CPU89.SCHED 11887 ± 3% +20.1% 14272 ± 5% softirqs.CPU9.SCHED 12723 ± 3% +11.3% 14165 ± 4% softirqs.CPU90.SCHED 8439 ± 12% -38.9% 5153 ± 4% softirqs.CPU91.RCU 13429 ± 3% +10.3% 14806 ± 2% softirqs.CPU95.SCHED 12852 ± 4% +10.3% 14174 ± 5% softirqs.CPU96.SCHED 13010 ± 2% +14.4% 14888 ± 5% softirqs.CPU97.SCHED 2315644 ± 4% -36.2% 1477200 ± 4% softirqs.RCU 1572 ± 10% +63.9% 2578 ± 39% interrupts.CPU0.NMI:Non-maskable_interrupts 1572 ± 10% +63.9% 2578 ± 39% interrupts.CPU0.PMI:Performance_monitoring_interrupts 252.00 ± 11% -35.2% 163.25 ± 13% interrupts.CPU104.RES:Rescheduling_interrupts 2738 ± 24% +52.4% 4173 ± 19% interrupts.CPU105.NMI:Non-maskable_interrupts 2738 ± 24% +52.4% 4173 ± 19% interrupts.CPU105.PMI:Performance_monitoring_interrupts 245.75 ± 19% -31.0% 169.50 ± 7% interrupts.CPU105.RES:Rescheduling_interrupts 228.75 ± 13% -24.7% 172.25 ± 19% interrupts.CPU106.RES:Rescheduling_interrupts 2243 ± 15% +66.3% 3730 ± 35% interrupts.CPU113.NMI:Non-maskable_interrupts 2243 ± 15% +66.3% 3730 ± 35% interrupts.CPU113.PMI:Performance_monitoring_interrupts 2703 ± 31% +67.0% 4514 ± 33% interrupts.CPU118.NMI:Non-maskable_interrupts 2703 ± 31% +67.0% 4514 ± 33% interrupts.CPU118.PMI:Performance_monitoring_interrupts 2613 ± 25% +42.2% 3715 ± 24% interrupts.CPU121.NMI:Non-maskable_interrupts 2613 ± 25% +42.2% 3715 ± 24% interrupts.CPU121.PMI:Performance_monitoring_interrupts 311.50 ± 23% -47.7% 163.00 ± 9% interrupts.CPU122.RES:Rescheduling_interrupts 266.75 ± 19% -31.6% 182.50 ± 15% interrupts.CPU124.RES:Rescheduling_interrupts 293.75 ± 33% -32.3% 198.75 ± 19% interrupts.CPU125.RES:Rescheduling_interrupts 2601 ± 36% +43.2% 3724 ± 29% interrupts.CPU127.NMI:Non-maskable_interrupts 2601 ± 36% +43.2% 3724 ± 29% interrupts.CPU127.PMI:Performance_monitoring_interrupts 2258 ± 21% +68.2% 3797 ± 29% interrupts.CPU13.NMI:Non-maskable_interrupts 2258 ± 21% +68.2% 3797 ± 29% interrupts.CPU13.PMI:Performance_monitoring_interrupts 3338 ± 29% +54.6% 5160 ± 9% interrupts.CPU139.NMI:Non-maskable_interrupts 3338 ± 29% +54.6% 5160 ± 9% interrupts.CPU139.PMI:Performance_monitoring_interrupts 219.50 ± 27% -23.0% 169.00 ± 21% interrupts.CPU139.RES:Rescheduling_interrupts 290.25 ± 25% -32.5% 196.00 ± 11% interrupts.CPU14.RES:Rescheduling_interrupts 243.50 ± 4% -16.0% 204.50 ± 12% interrupts.CPU140.RES:Rescheduling_interrupts 1797 ± 15% +135.0% 4223 ± 46% interrupts.CPU147.NMI:Non-maskable_interrupts 1797 ± 15% +135.0% 4223 ± 46% interrupts.CPU147.PMI:Performance_monitoring_interrupts 2537 ± 22% +89.6% 4812 ± 28% interrupts.CPU15.NMI:Non-maskable_interrupts 2537 ± 22% +89.6% 4812 ± 28% interrupts.CPU15.PMI:Performance_monitoring_interrupts 292.25 ± 34% -33.9% 193.25 ± 6% interrupts.CPU15.RES:Rescheduling_interrupts 424.25 ± 37% -58.5% 176.25 ± 14% interrupts.CPU158.RES:Rescheduling_interrupts 312.50 ± 42% -54.2% 143.00 ± 18% interrupts.CPU159.RES:Rescheduling_interrupts 725.00 ±118% -75.7% 176.25 ± 14% interrupts.CPU163.RES:Rescheduling_interrupts 2367 ± 6% +59.9% 3786 ± 24% interrupts.CPU177.NMI:Non-maskable_interrupts 2367 ± 6% +59.9% 3786 ± 24% interrupts.CPU177.PMI:Performance_monitoring_interrupts 239.50 ± 30% -46.6% 128.00 ± 14% interrupts.CPU179.RES:Rescheduling_interrupts 320.75 ± 15% -24.0% 243.75 ± 20% interrupts.CPU20.RES:Rescheduling_interrupts 302.50 ± 17% -47.2% 159.75 ± 8% interrupts.CPU200.RES:Rescheduling_interrupts 2166 ± 5% +92.0% 4157 ± 40% interrupts.CPU207.NMI:Non-maskable_interrupts 2166 ± 5% +92.0% 4157 ± 40% interrupts.CPU207.PMI:Performance_monitoring_interrupts 217.00 ± 11% -34.6% 142.00 ± 12% interrupts.CPU214.RES:Rescheduling_interrupts 2610 ± 36% +47.4% 3848 ± 35% interrupts.CPU215.NMI:Non-maskable_interrupts 2610 ± 36% +47.4% 3848 ± 35% interrupts.CPU215.PMI:Performance_monitoring_interrupts 2046 ± 13% +118.6% 4475 ± 43% interrupts.CPU22.NMI:Non-maskable_interrupts 2046 ± 13% +118.6% 4475 ± 43% interrupts.CPU22.PMI:Performance_monitoring_interrupts 289.50 ± 28% -41.1% 170.50 ± 8% interrupts.CPU22.RES:Rescheduling_interrupts 2232 ± 6% +33.0% 2970 ± 24% interrupts.CPU221.NMI:Non-maskable_interrupts 2232 ± 6% +33.0% 2970 ± 24% interrupts.CPU221.PMI:Performance_monitoring_interrupts 4552 ± 12% -27.6% 3295 ± 15% interrupts.CPU222.NMI:Non-maskable_interrupts 4552 ± 12% -27.6% 3295 ± 15% interrupts.CPU222.PMI:Performance_monitoring_interrupts 2013 ± 15% +80.9% 3641 ± 27% interrupts.CPU226.NMI:Non-maskable_interrupts 2013 ± 15% +80.9% 3641 ± 27% interrupts.CPU226.PMI:Performance_monitoring_interrupts 2575 ± 49% +67.1% 4302 ± 34% interrupts.CPU227.NMI:Non-maskable_interrupts 2575 ± 49% +67.1% 4302 ± 34% interrupts.CPU227.PMI:Performance_monitoring_interrupts 248.00 ± 36% -36.3% 158.00 ± 19% interrupts.CPU228.RES:Rescheduling_interrupts 2441 ± 24% +43.0% 3490 ± 30% interrupts.CPU23.NMI:Non-maskable_interrupts 2441 ± 24% +43.0% 3490 ± 30% interrupts.CPU23.PMI:Performance_monitoring_interrupts 404.25 ± 69% -65.5% 139.50 ± 17% interrupts.CPU236.RES:Rescheduling_interrupts 566.50 ± 40% -73.6% 149.50 ± 31% interrupts.CPU237.RES:Rescheduling_interrupts 243.50 ± 26% -37.1% 153.25 ± 21% interrupts.CPU248.RES:Rescheduling_interrupts 258.25 ± 12% -53.5% 120.00 ± 18% interrupts.CPU249.RES:Rescheduling_interrupts 2888 ± 27% +49.4% 4313 ± 30% interrupts.CPU253.NMI:Non-maskable_interrupts 2888 ± 27% +49.4% 4313 ± 30% interrupts.CPU253.PMI:Performance_monitoring_interrupts 2468 ± 44% +67.3% 4131 ± 37% interrupts.CPU256.NMI:Non-maskable_interrupts 2468 ± 44% +67.3% 4131 ± 37% interrupts.CPU256.PMI:Performance_monitoring_interrupts 425.00 ± 59% -60.3% 168.75 ± 34% interrupts.CPU258.RES:Rescheduling_interrupts 1859 ± 16% +106.3% 3834 ± 44% interrupts.CPU268.NMI:Non-maskable_interrupts 1859 ± 16% +106.3% 3834 ± 44% interrupts.CPU268.PMI:Performance_monitoring_interrupts 2684 ± 28% +61.2% 4326 ± 36% interrupts.CPU269.NMI:Non-maskable_interrupts 2684 ± 28% +61.2% 4326 ± 36% interrupts.CPU269.PMI:Performance_monitoring_interrupts 2171 ± 6% +108.8% 4533 ± 20% interrupts.CPU270.NMI:Non-maskable_interrupts 2171 ± 6% +108.8% 4533 ± 20% interrupts.CPU270.PMI:Performance_monitoring_interrupts 2262 ± 14% +61.8% 3659 ± 37% interrupts.CPU273.NMI:Non-maskable_interrupts 2262 ± 14% +61.8% 3659 ± 37% interrupts.CPU273.PMI:Performance_monitoring_interrupts 2203 ± 11% +50.7% 3320 ± 38% interrupts.CPU279.NMI:Non-maskable_interrupts 2203 ± 11% +50.7% 3320 ± 38% interrupts.CPU279.PMI:Performance_monitoring_interrupts 2433 ± 17% +52.9% 3721 ± 25% interrupts.CPU280.NMI:Non-maskable_interrupts 2433 ± 17% +52.9% 3721 ± 25% interrupts.CPU280.PMI:Performance_monitoring_interrupts 2778 ± 33% +63.1% 4531 ± 36% interrupts.CPU283.NMI:Non-maskable_interrupts 2778 ± 33% +63.1% 4531 ± 36% interrupts.CPU283.PMI:Performance_monitoring_interrupts 331.75 ± 32% -39.8% 199.75 ± 17% interrupts.CPU29.RES:Rescheduling_interrupts 2178 ± 22% +53.9% 3353 ± 31% interrupts.CPU3.NMI:Non-maskable_interrupts 2178 ± 22% +53.9% 3353 ± 31% interrupts.CPU3.PMI:Performance_monitoring_interrupts 298.50 ± 30% -39.7% 180.00 ± 6% interrupts.CPU34.RES:Rescheduling_interrupts 2490 ± 3% +58.7% 3953 ± 28% interrupts.CPU35.NMI:Non-maskable_interrupts 2490 ± 3% +58.7% 3953 ± 28% interrupts.CPU35.PMI:Performance_monitoring_interrupts 270.50 ± 24% -31.1% 186.25 ± 3% interrupts.CPU36.RES:Rescheduling_interrupts 2493 ± 7% +57.0% 3915 ± 27% interrupts.CPU43.NMI:Non-maskable_interrupts 2493 ± 7% +57.0% 3915 ± 27% interrupts.CPU43.PMI:Performance_monitoring_interrupts 286.75 ± 36% -32.4% 193.75 ± 7% interrupts.CPU45.RES:Rescheduling_interrupts 259.00 ± 12% -23.6% 197.75 ± 13% interrupts.CPU46.RES:Rescheduling_interrupts 244.00 ± 21% -35.6% 157.25 ± 11% interrupts.CPU47.RES:Rescheduling_interrupts 230.00 ± 7% -21.3% 181.00 ± 11% interrupts.CPU48.RES:Rescheduling_interrupts 281.00 ± 13% -27.4% 204.00 ± 15% interrupts.CPU53.RES:Rescheduling_interrupts 256.75 ± 5% -18.4% 209.50 ± 12% interrupts.CPU54.RES:Rescheduling_interrupts 2433 ± 9% +68.4% 4098 ± 35% interrupts.CPU58.NMI:Non-maskable_interrupts 2433 ± 9% +68.4% 4098 ± 35% interrupts.CPU58.PMI:Performance_monitoring_interrupts 316.00 ± 25% -41.4% 185.25 ± 13% interrupts.CPU59.RES:Rescheduling_interrupts 2703 ± 38% +56.0% 4217 ± 31% interrupts.CPU60.NMI:Non-maskable_interrupts 2703 ± 38% +56.0% 4217 ± 31% interrupts.CPU60.PMI:Performance_monitoring_interrupts 2425 ± 16% +39.9% 3394 ± 27% interrupts.CPU61.NMI:Non-maskable_interrupts 2425 ± 16% +39.9% 3394 ± 27% interrupts.CPU61.PMI:Performance_monitoring_interrupts 2388 ± 18% +69.5% 4047 ± 29% interrupts.CPU66.NMI:Non-maskable_interrupts 2388 ± 18% +69.5% 4047 ± 29% interrupts.CPU66.PMI:Performance_monitoring_interrupts 2322 ± 11% +93.4% 4491 ± 35% interrupts.CPU67.NMI:Non-maskable_interrupts 2322 ± 11% +93.4% 4491 ± 35% interrupts.CPU67.PMI:Performance_monitoring_interrupts 319.00 ± 40% -44.7% 176.25 ± 9% interrupts.CPU67.RES:Rescheduling_interrupts 2512 ± 8% +28.1% 3219 ± 25% interrupts.CPU70.NMI:Non-maskable_interrupts 2512 ± 8% +28.1% 3219 ± 25% interrupts.CPU70.PMI:Performance_monitoring_interrupts 2290 ± 39% +78.7% 4094 ± 28% interrupts.CPU74.NMI:Non-maskable_interrupts 2290 ± 39% +78.7% 4094 ± 28% interrupts.CPU74.PMI:Performance_monitoring_interrupts 2446 ± 40% +94.8% 4764 ± 23% interrupts.CPU75.NMI:Non-maskable_interrupts 2446 ± 40% +94.8% 4764 ± 23% interrupts.CPU75.PMI:Performance_monitoring_interrupts 426.75 ± 61% -67.7% 138.00 ± 8% interrupts.CPU75.RES:Rescheduling_interrupts 192.50 ± 13% +45.6% 280.25 ± 45% interrupts.CPU76.RES:Rescheduling_interrupts 274.25 ± 34% -42.2% 158.50 ± 34% interrupts.CPU77.RES:Rescheduling_interrupts 2357 ± 9% +73.0% 4078 ± 23% interrupts.CPU78.NMI:Non-maskable_interrupts 2357 ± 9% +73.0% 4078 ± 23% interrupts.CPU78.PMI:Performance_monitoring_interrupts 348.50 ± 53% -47.3% 183.75 ± 29% interrupts.CPU80.RES:Rescheduling_interrupts 2650 ± 43% +46.2% 3874 ± 36% interrupts.CPU84.NMI:Non-maskable_interrupts 2650 ± 43% +46.2% 3874 ± 36% interrupts.CPU84.PMI:Performance_monitoring_interrupts 2235 ± 10% +117.8% 4867 ± 10% interrupts.CPU90.NMI:Non-maskable_interrupts 2235 ± 10% +117.8% 4867 ± 10% interrupts.CPU90.PMI:Performance_monitoring_interrupts 2606 ± 33% +38.1% 3598 ± 21% interrupts.CPU92.NMI:Non-maskable_interrupts 2606 ± 33% +38.1% 3598 ± 21% interrupts.CPU92.PMI:Performance_monitoring_interrupts 408.75 ± 58% -56.8% 176.75 ± 25% interrupts.CPU92.RES:Rescheduling_interrupts 399.00 ± 64% -63.6% 145.25 ± 16% interrupts.CPU93.RES:Rescheduling_interrupts 314.75 ± 36% -44.2% 175.75 ± 13% interrupts.CPU94.RES:Rescheduling_interrupts 191.00 ± 15% -29.1% 135.50 ± 9% interrupts.CPU97.RES:Rescheduling_interrupts 94.00 ± 8% +50.0% 141.00 ± 12% interrupts.IWI:IRQ_work_interrupts 841457 ± 7% +16.6% 980751 ± 3% interrupts.NMI:Non-maskable_interrupts 841457 ± 7% +16.6% 980751 ± 3% interrupts.PMI:Performance_monitoring_interrupts 12.75 ± 11% -4.1 8.67 ± 31% perf-profile.calltrace.cycles-pp.do_rw_once 1.02 ± 16% -0.6 0.47 ± 59% perf-profile.calltrace.cycles-pp.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle 1.10 ± 15% -0.4 0.66 ± 14% perf-profile.calltrace.cycles-pp.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry 1.05 ± 16% -0.4 0.61 ± 14% perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter 1.58 ± 4% +0.3 1.91 ± 7% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.11 ± 4% +0.5 2.60 ± 7% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault 0.83 ± 26% +0.5 1.32 ± 18% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe 0.83 ± 26% +0.5 1.32 ± 18% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.90 ± 5% +0.6 2.45 ± 7% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage 0.65 ± 62% +0.6 1.20 ± 15% perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.60 ± 62% +0.6 1.16 ± 18% perf-profile.calltrace.cycles-pp.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap 0.95 ± 17% +0.6 1.52 ± 8% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner 0.61 ± 62% +0.6 1.18 ± 18% perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput 0.61 ± 62% +0.6 1.19 ± 19% perf-profile.calltrace.cycles-pp.tlb_finish_mmu.exit_mmap.mmput.do_exit.do_group_exit 0.61 ± 62% +0.6 1.19 ± 19% perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput.do_exit 0.64 ± 61% +0.6 1.23 ± 18% perf-profile.calltrace.cycles-pp.mmput.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64 0.64 ± 61% +0.6 1.23 ± 18% perf-profile.calltrace.cycles-pp.exit_mmap.mmput.do_exit.do_group_exit.__x64_sys_exit_group 1.30 ± 9% +0.6 1.92 ± 8% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock 0.19 ±173% +0.7 0.89 ± 20% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu 0.19 ±173% +0.7 0.90 ± 20% perf-profile.calltrace.cycles-pp._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu 0.00 +0.8 0.77 ± 30% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page 0.00 +0.8 0.78 ± 30% perf-profile.calltrace.cycles-pp._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page 0.00 +0.8 0.79 ± 29% perf-profile.calltrace.cycles-pp.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 0.82 ± 67% +0.9 1.72 ± 22% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.84 ± 66% +0.9 1.74 ± 20% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 2.52 ± 6% +0.9 3.44 ± 9% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page 0.83 ± 67% +0.9 1.75 ± 21% perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 0.84 ± 66% +0.9 1.77 ± 20% perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 1.64 ± 12% +1.0 2.67 ± 7% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault 1.65 ± 45% +1.3 2.99 ± 18% perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 1.74 ± 13% +1.4 3.16 ± 6% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault 2.56 ± 48% +2.2 4.81 ± 19% perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault 12.64 ± 14% +3.6 16.20 ± 8% perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.__do_page_fault 2.97 ± 7% +3.8 6.74 ± 9% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page.hugetlb_cow 19.99 ± 9% +4.1 24.05 ± 6% perf-profile.calltrace.cycles-pp.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault.do_page_fault 1.37 ± 15% -0.5 0.83 ± 13% perf-profile.children.cycles-pp.sched_clock_cpu 1.31 ± 16% -0.5 0.78 ± 13% perf-profile.children.cycles-pp.sched_clock 1.29 ± 16% -0.5 0.77 ± 13% perf-profile.children.cycles-pp.native_sched_clock 1.80 ± 2% -0.3 1.47 ± 10% perf-profile.children.cycles-pp.task_tick_fair 0.73 ± 2% -0.2 0.54 ± 11% perf-profile.children.cycles-pp.update_curr 0.42 ± 17% -0.2 0.27 ± 16% perf-profile.children.cycles-pp.account_process_tick 0.73 ± 10% -0.2 0.58 ± 9% perf-profile.children.cycles-pp.rcu_sched_clock_irq 0.27 ± 6% -0.1 0.14 ± 14% perf-profile.children.cycles-pp.__acct_update_integrals 0.27 ± 18% -0.1 0.16 ± 13% perf-profile.children.cycles-pp.rcu_segcblist_ready_cbs 0.40 ± 12% -0.1 0.30 ± 14% perf-profile.children.cycles-pp.__next_timer_interrupt 0.47 ± 7% -0.1 0.39 ± 13% perf-profile.children.cycles-pp.update_rq_clock 0.29 ± 12% -0.1 0.21 ± 15% perf-profile.children.cycles-pp.cpuidle_governor_latency_req 0.21 ± 7% -0.1 0.14 ± 12% perf-profile.children.cycles-pp.account_system_index_time 0.38 ± 2% -0.1 0.31 ± 12% perf-profile.children.cycles-pp.timerqueue_add 0.26 ± 11% -0.1 0.20 ± 13% perf-profile.children.cycles-pp.find_next_bit 0.23 ± 15% -0.1 0.17 ± 15% perf-profile.children.cycles-pp.rcu_dynticks_eqs_exit 0.14 ± 8% -0.1 0.07 ± 14% perf-profile.children.cycles-pp.account_user_time 0.17 ± 6% -0.0 0.12 ± 10% perf-profile.children.cycles-pp.cpuacct_charge 0.18 ± 20% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.irq_work_tick 0.11 ± 13% -0.0 0.07 ± 25% perf-profile.children.cycles-pp.tick_sched_do_timer 0.12 ± 10% -0.0 0.08 ± 15% perf-profile.children.cycles-pp.get_cpu_device 0.07 ± 11% -0.0 0.04 ± 58% perf-profile.children.cycles-pp.raise_softirq 0.12 ± 3% -0.0 0.09 ± 8% perf-profile.children.cycles-pp.write 0.11 ± 13% +0.0 0.14 ± 8% perf-profile.children.cycles-pp.native_write_msr 0.09 ± 9% +0.0 0.11 ± 7% perf-profile.children.cycles-pp.finish_task_switch 0.10 ± 10% +0.0 0.13 ± 5% perf-profile.children.cycles-pp.schedule_idle 0.07 ± 6% +0.0 0.10 ± 12% perf-profile.children.cycles-pp.__read_nocancel 0.04 ± 58% +0.0 0.07 ± 15% perf-profile.children.cycles-pp.__free_pages_ok 0.06 ± 7% +0.0 0.09 ± 13% perf-profile.children.cycles-pp.perf_read 0.07 +0.0 0.11 ± 14% perf-profile.children.cycles-pp.perf_evsel__read_counter 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.cmd_stat 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.__run_perf_stat 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.process_interval 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.read_counters 0.07 ± 22% +0.0 0.11 ± 19% perf-profile.children.cycles-pp.__handle_mm_fault 0.07 ± 19% +0.1 0.13 ± 8% perf-profile.children.cycles-pp.rb_erase 0.03 ±100% +0.1 0.09 ± 9% perf-profile.children.cycles-pp.smp_call_function_single 0.01 ±173% +0.1 0.08 ± 11% perf-profile.children.cycles-pp.perf_event_read 0.00 +0.1 0.07 ± 13% perf-profile.children.cycles-pp.__perf_event_read_value 0.00 +0.1 0.07 ± 7% perf-profile.children.cycles-pp.__intel_pmu_enable_all 0.08 ± 17% +0.1 0.15 ± 8% perf-profile.children.cycles-pp.native_apic_msr_eoi_write 0.04 ±103% +0.1 0.13 ± 58% perf-profile.children.cycles-pp.shmem_getpage_gfp 0.38 ± 14% +0.1 0.51 ± 6% perf-profile.children.cycles-pp.run_timer_softirq 0.11 ± 4% +0.3 0.37 ± 32% perf-profile.children.cycles-pp.worker_thread 0.20 ± 5% +0.3 0.48 ± 25% perf-profile.children.cycles-pp.ret_from_fork 0.20 ± 4% +0.3 0.48 ± 25% perf-profile.children.cycles-pp.kthread 0.00 +0.3 0.29 ± 38% perf-profile.children.cycles-pp.memcpy_erms 0.00 +0.3 0.29 ± 38% perf-profile.children.cycles-pp.drm_fb_helper_dirty_work 0.00 +0.3 0.31 ± 37% perf-profile.children.cycles-pp.process_one_work 0.47 ± 48% +0.4 0.91 ± 19% perf-profile.children.cycles-pp.prep_new_huge_page 0.70 ± 29% +0.5 1.16 ± 18% perf-profile.children.cycles-pp.free_huge_page 0.73 ± 29% +0.5 1.19 ± 18% perf-profile.children.cycles-pp.tlb_flush_mmu 0.72 ± 29% +0.5 1.18 ± 18% perf-profile.children.cycles-pp.release_pages 0.73 ± 29% +0.5 1.19 ± 18% perf-profile.children.cycles-pp.tlb_finish_mmu 0.76 ± 27% +0.5 1.23 ± 18% perf-profile.children.cycles-pp.exit_mmap 0.77 ± 27% +0.5 1.24 ± 18% perf-profile.children.cycles-pp.mmput 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.__x64_sys_exit_group 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.do_group_exit 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.do_exit 1.28 ± 29% +0.5 1.76 ± 9% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler 0.77 ± 28% +0.5 1.26 ± 13% perf-profile.children.cycles-pp.alloc_fresh_huge_page 1.53 ± 15% +0.7 2.26 ± 14% perf-profile.children.cycles-pp.do_syscall_64 1.53 ± 15% +0.7 2.27 ± 14% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 1.13 ± 3% +0.9 2.07 ± 14% perf-profile.children.cycles-pp.interrupt_entry 0.79 ± 9% +1.0 1.76 ± 5% perf-profile.children.cycles-pp.perf_event_task_tick 1.71 ± 39% +1.4 3.08 ± 16% perf-profile.children.cycles-pp.alloc_surplus_huge_page 2.66 ± 42% +2.3 4.94 ± 17% perf-profile.children.cycles-pp.alloc_huge_page 2.89 ± 45% +2.7 5.54 ± 18% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 3.34 ± 35% +2.7 6.02 ± 17% perf-profile.children.cycles-pp._raw_spin_lock 12.77 ± 14% +3.9 16.63 ± 7% perf-profile.children.cycles-pp.mutex_spin_on_owner 20.12 ± 9% +4.0 24.16 ± 6% perf-profile.children.cycles-pp.hugetlb_cow 15.40 ± 10% -3.6 11.84 ± 28% perf-profile.self.cycles-pp.do_rw_once 4.02 ± 9% -1.3 2.73 ± 30% perf-profile.self.cycles-pp.do_access 2.00 ± 14% -0.6 1.41 ± 13% perf-profile.self.cycles-pp.cpuidle_enter_state 1.26 ± 16% -0.5 0.74 ± 13% perf-profile.self.cycles-pp.native_sched_clock 0.42 ± 17% -0.2 0.27 ± 16% perf-profile.self.cycles-pp.account_process_tick 0.27 ± 19% -0.2 0.12 ± 17% perf-profile.self.cycles-pp.timerqueue_del 0.53 ± 3% -0.1 0.38 ± 11% perf-profile.self.cycles-pp.update_curr 0.27 ± 6% -0.1 0.14 ± 14% perf-profile.self.cycles-pp.__acct_update_integrals 0.27 ± 18% -0.1 0.16 ± 13% perf-profile.self.cycles-pp.rcu_segcblist_ready_cbs 0.61 ± 4% -0.1 0.51 ± 8% perf-profile.self.cycles-pp.task_tick_fair 0.20 ± 8% -0.1 0.12 ± 14% perf-profile.self.cycles-pp.account_system_index_time 0.23 ± 15% -0.1 0.16 ± 17% perf-profile.self.cycles-pp.rcu_dynticks_eqs_exit 0.25 ± 11% -0.1 0.18 ± 14% perf-profile.self.cycles-pp.find_next_bit 0.10 ± 11% -0.1 0.03 ±100% perf-profile.self.cycles-pp.tick_sched_do_timer 0.29 -0.1 0.23 ± 11% perf-profile.self.cycles-pp.timerqueue_add 0.12 ± 10% -0.1 0.06 ± 17% perf-profile.self.cycles-pp.account_user_time 0.22 ± 15% -0.1 0.16 ± 6% perf-profile.self.cycles-pp.scheduler_tick 0.17 ± 6% -0.0 0.12 ± 10% perf-profile.self.cycles-pp.cpuacct_charge 0.18 ± 20% -0.0 0.13 ± 3% perf-profile.self.cycles-pp.irq_work_tick 0.07 ± 13% -0.0 0.03 ±100% perf-profile.self.cycles-pp.update_process_times 0.12 ± 7% -0.0 0.08 ± 15% perf-profile.self.cycles-pp.get_cpu_device 0.07 ± 11% -0.0 0.04 ± 58% perf-profile.self.cycles-pp.raise_softirq 0.12 ± 11% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.tick_nohz_get_sleep_length 0.11 ± 11% +0.0 0.14 ± 6% perf-profile.self.cycles-pp.native_write_msr 0.10 ± 5% +0.1 0.15 ± 8% perf-profile.self.cycles-pp.__remove_hrtimer 0.07 ± 23% +0.1 0.13 ± 8% perf-profile.self.cycles-pp.rb_erase 0.08 ± 17% +0.1 0.15 ± 7% perf-profile.self.cycles-pp.native_apic_msr_eoi_write 0.00 +0.1 0.08 ± 10% perf-profile.self.cycles-pp.smp_call_function_single 0.32 ± 17% +0.1 0.42 ± 7% perf-profile.self.cycles-pp.run_timer_softirq 0.22 ± 5% +0.1 0.34 ± 4% perf-profile.self.cycles-pp.ktime_get_update_offsets_now 0.45 ± 15% +0.2 0.60 ± 12% perf-profile.self.cycles-pp.rcu_irq_enter 0.31 ± 8% +0.2 0.46 ± 16% perf-profile.self.cycles-pp.irq_enter 0.29 ± 10% +0.2 0.44 ± 16% perf-profile.self.cycles-pp.apic_timer_interrupt 0.71 ± 30% +0.2 0.92 ± 8% perf-profile.self.cycles-pp.perf_mux_hrtimer_handler 0.00 +0.3 0.28 ± 37% perf-profile.self.cycles-pp.memcpy_erms 1.12 ± 3% +0.9 2.02 ± 15% perf-profile.self.cycles-pp.interrupt_entry 0.79 ± 9% +0.9 1.73 ± 5% perf-profile.self.cycles-pp.perf_event_task_tick 2.49 ± 45% +2.1 4.55 ± 20% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 10.95 ± 15% +2.7 13.61 ± 8% perf-profile.self.cycles-pp.mutex_spin_on_owner
vm-scalability.throughput
1.6e+07 +-+---------------------------------------------------------------+ |..+.+ +..+.+..+.+. +. +..+.+..+.+..+.+..+.+..+ + | 1.4e+07 +-+ : : O O O O | 1.2e+07 O-+O O O O O O O O O O O O O O O O O O | : : O O O O | 1e+07 +-+ : : | | : : | 8e+06 +-+ : : | | : : | 6e+06 +-+ : : | 4e+06 +-+ : : | | :: | 2e+06 +-+ : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.time.minor_page_faults
2.5e+06 +-+---------------------------------------------------------------+ | | |..+.+ +..+.+..+.+..+.+..+.+.. .+. .+.+..+.+..+.+..+.+..+ | 2e+06 +-+ : : +. +. | O O O: O O O O O O O O O O | | : : O O O O O O O O O O O O O O 1.5e+06 +-+ : : | | : : | 1e+06 +-+ : : | | : : | | : : | 500000 +-+ : : | | : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.workload
3.5e+09 +-+---------------------------------------------------------------+ | .+. .+.+.. .+.. | 3e+09 +-+ + +..+.+..+.+..+.+. +..+.+..+.+..+.+..+.+..+ + | | : : O O O | 2.5e+09 O-+O O: O O O O O O O O O | | : : O O O O O O O O O O O O 2e+09 +-+ : : | | : : | 1.5e+09 +-+ : : | | : : | 1e+09 +-+ : : | | : : | 5e+08 +-+ : | | : | 0 +-+---------------------------------------------------------------+
[*] bisect-good sample [O] bisect-bad sample
Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
Thanks, Rong Chen
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Am 29.07.19 um 11:51 schrieb kernel test robot:
Greeting,
FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
Daniel, Noralf, we may have to revert this patch.
I expected some change in display performance, but not in VM. Since it's a server chipset, probably no one cares much about display performance. So that seemed like a good trade-off for re-using shared code.
Part of the patch set is that the generic fb emulation now maps and unmaps the fbdev BO when updating the screen. I guess that's the cause of the performance regression. And it should be visible with other drivers as well if they use a shadow FB for fbdev emulation.
For fbcon we should need to do any maps/unamps at all, this is for the fbdev mmap support only. If the testcase mentioned here tests fbdev mmap handling it's pretty badly misnamed :-) And as long as you don't have an fbdev mmap there shouldn't be any impact at all.
The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Yeah agreed, fbdev emulation should try to cache the vmap.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
The thing is that we'd need another generic fbdev emulation for ast and mgag200 that handles this issue properly.
Yeah I dont think we want to jump the gun here. If you can try to repro locally and profile where we're wasting cpu time I hope that should sched a light what's going wrong here.
I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
It's likely the test runs on the console and printfs stuff out while running.
Dave.
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Am 29.07.19 um 11:51 schrieb kernel test robot:
Greeting,
FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
Daniel, Noralf, we may have to revert this patch.
I expected some change in display performance, but not in VM. Since it's a server chipset, probably no one cares much about display performance. So that seemed like a good trade-off for re-using shared code.
Part of the patch set is that the generic fb emulation now maps and unmaps the fbdev BO when updating the screen. I guess that's the cause of the performance regression. And it should be visible with other drivers as well if they use a shadow FB for fbdev emulation.
For fbcon we should need to do any maps/unamps at all, this is for the fbdev mmap support only. If the testcase mentioned here tests fbdev mmap handling it's pretty badly misnamed :-) And as long as you don't have an fbdev mmap there shouldn't be any impact at all.
The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Yeah agreed, fbdev emulation should try to cache the vmap.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
The thing is that we'd need another generic fbdev emulation for ast and mgag200 that handles this issue properly.
Yeah I dont think we want to jump the gun here. If you can try to repro locally and profile where we're wasting cpu time I hope that should sched a light what's going wrong here.
I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ... -Daniel
Hi, Daniel,
Daniel Vetter daniel@ffwll.ch writes:
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Am 29.07.19 um 11:51 schrieb kernel test robot: > Greeting, > > FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:> > > commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") > https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
Daniel, Noralf, we may have to revert this patch.
I expected some change in display performance, but not in VM. Since it's a server chipset, probably no one cares much about display performance. So that seemed like a good trade-off for re-using shared code.
Part of the patch set is that the generic fb emulation now maps and unmaps the fbdev BO when updating the screen. I guess that's the cause of the performance regression. And it should be visible with other drivers as well if they use a shadow FB for fbdev emulation.
For fbcon we should need to do any maps/unamps at all, this is for the fbdev mmap support only. If the testcase mentioned here tests fbdev mmap handling it's pretty badly misnamed :-) And as long as you don't have an fbdev mmap there shouldn't be any impact at all.
The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Yeah agreed, fbdev emulation should try to cache the vmap.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
The thing is that we'd need another generic fbdev emulation for ast and mgag200 that handles this issue properly.
Yeah I dont think we want to jump the gun here. If you can try to repro locally and profile where we're wasting cpu time I hope that should sched a light what's going wrong here.
I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Best Regards, Huang, Ying
Hi
Am 31.07.19 um 11:25 schrieb Huang, Ying:
Hi, Daniel,
Daniel Vetter daniel@ffwll.ch writes:
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote: > Am 29.07.19 um 11:51 schrieb kernel test robot: >> Greeting, >> >> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:> >> >> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") >> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master > > Daniel, Noralf, we may have to revert this patch. > > I expected some change in display performance, but not in VM. Since it's > a server chipset, probably no one cares much about display performance. > So that seemed like a good trade-off for re-using shared code. > > Part of the patch set is that the generic fb emulation now maps and > unmaps the fbdev BO when updating the screen. I guess that's the cause > of the performance regression. And it should be visible with other > drivers as well if they use a shadow FB for fbdev emulation.
For fbcon we should need to do any maps/unamps at all, this is for the fbdev mmap support only. If the testcase mentioned here tests fbdev mmap handling it's pretty badly misnamed :-) And as long as you don't have an fbdev mmap there shouldn't be any impact at all.
The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Yeah agreed, fbdev emulation should try to cache the vmap.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
> The thing is that we'd need another generic fbdev emulation for ast and > mgag200 that handles this issue properly.
Yeah I dont think we want to jump the gun here. If you can try to repro locally and profile where we're wasting cpu time I hope that should sched a light what's going wrong here.
I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Take a look at commit cf1ca9aeb930df074bb5bbcde55f935fec04e529
Best regards Thomas
Best Regards, Huang, Ying _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On 2019-07-31 11:25 a.m., Huang, Ying wrote:
Hi, Daniel,
Daniel Vetter daniel@ffwll.ch writes:
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote: > Am 29.07.19 um 11:51 schrieb kernel test robot: >> Greeting, >> >> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:> >> >> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") >> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master > > Daniel, Noralf, we may have to revert this patch. > > I expected some change in display performance, but not in VM. Since it's > a server chipset, probably no one cares much about display performance. > So that seemed like a good trade-off for re-using shared code. > > Part of the patch set is that the generic fb emulation now maps and > unmaps the fbdev BO when updating the screen. I guess that's the cause > of the performance regression. And it should be visible with other > drivers as well if they use a shadow FB for fbdev emulation.
For fbcon we should need to do any maps/unamps at all, this is for the fbdev mmap support only. If the testcase mentioned here tests fbdev mmap handling it's pretty badly misnamed :-) And as long as you don't have an fbdev mmap there shouldn't be any impact at all.
The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Yeah agreed, fbdev emulation should try to cache the vmap.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
> The thing is that we'd need another generic fbdev emulation for ast and > mgag200 that handles this issue properly.
Yeah I dont think we want to jump the gun here. If you can try to repro locally and profile where we're wasting cpu time I hope that should sched a light what's going wrong here.
I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Does the regressed test cause any output on console while it's measuring? If so, it's probably accidentally measuring fbcon/DRM code in addition to the workload it's trying to measure.
Hi,
On 7/31/19 6:21 PM, Michel Dänzer wrote:
On 2019-07-31 11:25 a.m., Huang, Ying wrote:
Hi, Daniel,
Daniel Vetter daniel@ffwll.ch writes:
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter: > On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote: >> Am 29.07.19 um 11:51 schrieb kernel test robot: >>> Greeting, >>> >>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:> >>> >>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") >>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master >> Daniel, Noralf, we may have to revert this patch. >> >> I expected some change in display performance, but not in VM. Since it's >> a server chipset, probably no one cares much about display performance. >> So that seemed like a good trade-off for re-using shared code. >> >> Part of the patch set is that the generic fb emulation now maps and >> unmaps the fbdev BO when updating the screen. I guess that's the cause >> of the performance regression. And it should be visible with other >> drivers as well if they use a shadow FB for fbdev emulation. > For fbcon we should need to do any maps/unamps at all, this is for the > fbdev mmap support only. If the testcase mentioned here tests fbdev > mmap handling it's pretty badly misnamed :-) And as long as you don't > have an fbdev mmap there shouldn't be any impact at all. The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Yeah agreed, fbdev emulation should try to cache the vmap.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
>> The thing is that we'd need another generic fbdev emulation for ast and >> mgag200 that handles this issue properly. > Yeah I dont think we want to jump the gun here. If you can try to > repro locally and profile where we're wasting cpu time I hope that > should sched a light what's going wrong here. I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Does the regressed test cause any output on console while it's measuring? If so, it's probably accidentally measuring fbcon/DRM code in addition to the workload it's trying to measure.
Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
Best Regards, Rong Chen
On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>> >>>>commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") >>>>https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master >>>Daniel, Noralf, we may have to revert this patch. >>> >>>I expected some change in display performance, but not in VM. Since it's >>>a server chipset, probably no one cares much about display performance. >>>So that seemed like a good trade-off for re-using shared code. >>> >>>Part of the patch set is that the generic fb emulation now maps and >>>unmaps the fbdev BO when updating the screen. I guess that's the cause >>>of the performance regression. And it should be visible with other >>>drivers as well if they use a shadow FB for fbdev emulation. >>For fbcon we should need to do any maps/unamps at all, this is for the >>fbdev mmap support only. If the testcase mentioned here tests fbdev >>mmap handling it's pretty badly misnamed :-) And as long as you don't >>have an fbdev mmap there shouldn't be any impact at all. >The ast and mgag200 have only a few MiB of VRAM, so we have to get the >fbdev BO out if it's not being displayed. If not being mapped, it can be >evicted and make room for X, etc. > >To make this work, the BO's memory is mapped and unmapped in >drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] >That fbdev mapping is established on each screen update, more or less. > From my (yet unverified) understanding, this causes the performance >regression in the VM code. > >The original code in mgag200 used to kmap the fbdev BO while it's being >displayed; [2] and the drawing code only mapped it when necessary (i.e., >not being display). [3] Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
>I think this could be added for VRAM helpers as well, but it's still a >workaround and non-VRAM drivers might also run into such a performance >regression if they use the fbdev's shadow fb. Yeah agreed, fbdev emulation should try to cache the vmap.
>Noralf mentioned that there are plans for other DRM clients besides the >console. They would as well run into similar problems. > >>>The thing is that we'd need another generic fbdev emulation for ast and >>>mgag200 that handles this issue properly. >>Yeah I dont think we want to jump the gun here. If you can try to >>repro locally and profile where we're wasting cpu time I hope that >>should sched a light what's going wrong here. >I don't have much time ATM and I'm not even officially at work until >late Aug. I'd send you the revert and investigate later. I agree that >using generic fbdev emulation would be preferable. Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Does the regressed test cause any output on console while it's measuring? If so, it's probably accidentally measuring fbcon/DRM code in addition to the workload it's trying to measure.
Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
We did more check, and found this test machine does use the mgag200 driver.
And we are suspecting the regression is caused by
commit cf1ca9aeb930df074bb5bbcde55f935fec04e529 Author: Thomas Zimmermann tzimmermann@suse.de Date: Wed Jul 3 09:58:24 2019 +0200
drm/fb-helper: Map DRM client buffer only when required
This patch changes DRM clients to not map the buffer by default. The buffer, like any buffer object, should be mapped and unmapped when needed.
An unmapped buffer object can be evicted to system memory and does not consume video ram until displayed. This allows to use generic fbdev emulation with drivers for low-memory devices, such as ast and mgag200.
This change affects the generic framebuffer console. HW-based consoles map their console buffer once and keep it mapped. Userspace can mmap this buffer into its address space. The shadow-buffered framebuffer console only needs the buffer object to be mapped during updates. While not being updated from the shadow buffer, the buffer object can remain unmapped. Userspace will always mmap the shadow buffer.
which may add more load when fbcon is busy printing out messages.
We are doing more test inside 0day to confirm.
Thanks, Feng
Hi
Am 01.08.19 um 10:37 schrieb Feng Tang:
On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>> >>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") >>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master >>>> Daniel, Noralf, we may have to revert this patch. >>>> >>>> I expected some change in display performance, but not in VM. Since it's >>>> a server chipset, probably no one cares much about display performance. >>>> So that seemed like a good trade-off for re-using shared code. >>>> >>>> Part of the patch set is that the generic fb emulation now maps and >>>> unmaps the fbdev BO when updating the screen. I guess that's the cause >>>> of the performance regression. And it should be visible with other >>>> drivers as well if they use a shadow FB for fbdev emulation. >>> For fbcon we should need to do any maps/unamps at all, this is for the >>> fbdev mmap support only. If the testcase mentioned here tests fbdev >>> mmap handling it's pretty badly misnamed :-) And as long as you don't >>> have an fbdev mmap there shouldn't be any impact at all. >> The ast and mgag200 have only a few MiB of VRAM, so we have to get the >> fbdev BO out if it's not being displayed. If not being mapped, it can be >> evicted and make room for X, etc. >> >> To make this work, the BO's memory is mapped and unmapped in >> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] >> That fbdev mapping is established on each screen update, more or less. >> From my (yet unverified) understanding, this causes the performance >> regression in the VM code. >> >> The original code in mgag200 used to kmap the fbdev BO while it's being >> displayed; [2] and the drawing code only mapped it when necessary (i.e., >> not being display). [3] > Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should > cache this. > >> I think this could be added for VRAM helpers as well, but it's still a >> workaround and non-VRAM drivers might also run into such a performance >> regression if they use the fbdev's shadow fb. > Yeah agreed, fbdev emulation should try to cache the vmap. > >> Noralf mentioned that there are plans for other DRM clients besides the >> console. They would as well run into similar problems. >> >>>> The thing is that we'd need another generic fbdev emulation for ast and >>>> mgag200 that handles this issue properly. >>> Yeah I dont think we want to jump the gun here. If you can try to >>> repro locally and profile where we're wasting cpu time I hope that >>> should sched a light what's going wrong here. >> I don't have much time ATM and I'm not even officially at work until >> late Aug. I'd send you the revert and investigate later. I agree that >> using generic fbdev emulation would be preferable. > Still not sure that's the right thing to do really. Yes it's a > regression, but vm testcases shouldn run a single line of fbcon or drm > code. So why this is impacted so heavily by a silly drm change is very > confusing to me. We might be papering over a deeper and much more > serious issue ... It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Does the regressed test cause any output on console while it's measuring? If so, it's probably accidentally measuring fbcon/DRM code in addition to the workload it's trying to measure.
Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
We did more check, and found this test machine does use the mgag200 driver.
And we are suspecting the regression is caused by
commit cf1ca9aeb930df074bb5bbcde55f935fec04e529 Author: Thomas Zimmermann tzimmermann@suse.de Date: Wed Jul 3 09:58:24 2019 +0200
Yes, that's the commit. Unfortunately reverting it would require reverting a hand full of other patches as well.
I have a potential fix for the problem. Could you run and verify that it resolves the problem?
Best regards Thomas
drm/fb-helper: Map DRM client buffer only when required This patch changes DRM clients to not map the buffer by default. The buffer, like any buffer object, should be mapped and unmapped when needed. An unmapped buffer object can be evicted to system memory and does not consume video ram until displayed. This allows to use generic fbdev emulation with drivers for low-memory devices, such as ast and mgag200. This change affects the generic framebuffer console. HW-based consoles map their console buffer once and keep it mapped. Userspace can mmap this buffer into its address space. The shadow-buffered framebuffer console only needs the buffer object to be mapped during updates. While not being updated from the shadow buffer, the buffer object can remain unmapped. Userspace will always mmap the shadow buffer.
which may add more load when fbcon is busy printing out messages.
We are doing more test inside 0day to confirm.
Thanks, Feng _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Thomas,
On Thu, Aug 01, 2019 at 11:59:28AM +0200, Thomas Zimmermann wrote:
Hi
Am 01.08.19 um 10:37 schrieb Feng Tang:
On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>>> >>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") >>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master >>>>> Daniel, Noralf, we may have to revert this patch. >>>>> >>>>> I expected some change in display performance, but not in VM. Since it's >>>>> a server chipset, probably no one cares much about display performance. >>>>> So that seemed like a good trade-off for re-using shared code. >>>>> >>>>> Part of the patch set is that the generic fb emulation now maps and >>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause >>>>> of the performance regression. And it should be visible with other >>>>> drivers as well if they use a shadow FB for fbdev emulation. >>>> For fbcon we should need to do any maps/unamps at all, this is for the >>>> fbdev mmap support only. If the testcase mentioned here tests fbdev >>>> mmap handling it's pretty badly misnamed :-) And as long as you don't >>>> have an fbdev mmap there shouldn't be any impact at all. >>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the >>> fbdev BO out if it's not being displayed. If not being mapped, it can be >>> evicted and make room for X, etc. >>> >>> To make this work, the BO's memory is mapped and unmapped in >>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] >>> That fbdev mapping is established on each screen update, more or less. >>> From my (yet unverified) understanding, this causes the performance >>> regression in the VM code. >>> >>> The original code in mgag200 used to kmap the fbdev BO while it's being >>> displayed; [2] and the drawing code only mapped it when necessary (i.e., >>> not being display). [3] >> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should >> cache this. >> >>> I think this could be added for VRAM helpers as well, but it's still a >>> workaround and non-VRAM drivers might also run into such a performance >>> regression if they use the fbdev's shadow fb. >> Yeah agreed, fbdev emulation should try to cache the vmap. >> >>> Noralf mentioned that there are plans for other DRM clients besides the >>> console. They would as well run into similar problems. >>> >>>>> The thing is that we'd need another generic fbdev emulation for ast and >>>>> mgag200 that handles this issue properly. >>>> Yeah I dont think we want to jump the gun here. If you can try to >>>> repro locally and profile where we're wasting cpu time I hope that >>>> should sched a light what's going wrong here. >>> I don't have much time ATM and I'm not even officially at work until >>> late Aug. I'd send you the revert and investigate later. I agree that >>> using generic fbdev emulation would be preferable. >> Still not sure that's the right thing to do really. Yes it's a >> regression, but vm testcases shouldn run a single line of fbcon or drm >> code. So why this is impacted so heavily by a silly drm change is very >> confusing to me. We might be papering over a deeper and much more >> serious issue ... > It's a regression, the right thing is to revert first and then work > out the right thing to do. Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
> It's likely the test runs on the console and printfs stuff out while running. But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Does the regressed test cause any output on console while it's measuring? If so, it's probably accidentally measuring fbcon/DRM code in addition to the workload it's trying to measure.
Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
We did more check, and found this test machine does use the mgag200 driver.
And we are suspecting the regression is caused by
commit cf1ca9aeb930df074bb5bbcde55f935fec04e529 Author: Thomas Zimmermann tzimmermann@suse.de Date: Wed Jul 3 09:58:24 2019 +0200
Yes, that's the commit. Unfortunately reverting it would require reverting a hand full of other patches as well.
I have a potential fix for the problem. Could you run and verify that it resolves the problem?
Sure, please send it to us. Rong and I will try it.
Thanks, Feng
Best regards Thomas
drm/fb-helper: Map DRM client buffer only when required This patch changes DRM clients to not map the buffer by default. The buffer, like any buffer object, should be mapped and unmapped when needed. An unmapped buffer object can be evicted to system memory and does not consume video ram until displayed. This allows to use generic fbdev emulation with drivers for low-memory devices, such as ast and mgag200. This change affects the generic framebuffer console. HW-based consoles map their console buffer once and keep it mapped. Userspace can mmap this buffer into its address space. The shadow-buffered framebuffer console only needs the buffer object to be mapped during updates. While not being updated from the shadow buffer, the buffer object can remain unmapped. Userspace will always mmap the shadow buffer.
which may add more load when fbcon is busy printing out messages.
We are doing more test inside 0day to confirm.
Thanks, Feng _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
Hi
Am 01.08.19 um 13:25 schrieb Feng Tang:
Hi Thomas,
On Thu, Aug 01, 2019 at 11:59:28AM +0200, Thomas Zimmermann wrote:
Hi
Am 01.08.19 um 10:37 schrieb Feng Tang:
On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>>>> >>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") >>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master >>>>>> Daniel, Noralf, we may have to revert this patch. >>>>>> >>>>>> I expected some change in display performance, but not in VM. Since it's >>>>>> a server chipset, probably no one cares much about display performance. >>>>>> So that seemed like a good trade-off for re-using shared code. >>>>>> >>>>>> Part of the patch set is that the generic fb emulation now maps and >>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause >>>>>> of the performance regression. And it should be visible with other >>>>>> drivers as well if they use a shadow FB for fbdev emulation. >>>>> For fbcon we should need to do any maps/unamps at all, this is for the >>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev >>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't >>>>> have an fbdev mmap there shouldn't be any impact at all. >>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the >>>> fbdev BO out if it's not being displayed. If not being mapped, it can be >>>> evicted and make room for X, etc. >>>> >>>> To make this work, the BO's memory is mapped and unmapped in >>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] >>>> That fbdev mapping is established on each screen update, more or less. >>>> From my (yet unverified) understanding, this causes the performance >>>> regression in the VM code. >>>> >>>> The original code in mgag200 used to kmap the fbdev BO while it's being >>>> displayed; [2] and the drawing code only mapped it when necessary (i.e., >>>> not being display). [3] >>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should >>> cache this. >>> >>>> I think this could be added for VRAM helpers as well, but it's still a >>>> workaround and non-VRAM drivers might also run into such a performance >>>> regression if they use the fbdev's shadow fb. >>> Yeah agreed, fbdev emulation should try to cache the vmap. >>> >>>> Noralf mentioned that there are plans for other DRM clients besides the >>>> console. They would as well run into similar problems. >>>> >>>>>> The thing is that we'd need another generic fbdev emulation for ast and >>>>>> mgag200 that handles this issue properly. >>>>> Yeah I dont think we want to jump the gun here. If you can try to >>>>> repro locally and profile where we're wasting cpu time I hope that >>>>> should sched a light what's going wrong here. >>>> I don't have much time ATM and I'm not even officially at work until >>>> late Aug. I'd send you the revert and investigate later. I agree that >>>> using generic fbdev emulation would be preferable. >>> Still not sure that's the right thing to do really. Yes it's a >>> regression, but vm testcases shouldn run a single line of fbcon or drm >>> code. So why this is impacted so heavily by a silly drm change is very >>> confusing to me. We might be papering over a deeper and much more >>> serious issue ... >> It's a regression, the right thing is to revert first and then work >> out the right thing to do. > Sure, but I have no idea whether the testcase is doing something > reasonable. If it's accidentally testing vm scalability of fbdev and > there's no one else doing something this pointless, then it's not a > real bug. Plus I think we're shooting the messenger here. > >> It's likely the test runs on the console and printfs stuff out while running. > But why did we not regress the world if a few prints on the console > have such a huge impact? We didn't get an entire stream of mails about > breaking stuff ... The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Does the regressed test cause any output on console while it's measuring? If so, it's probably accidentally measuring fbcon/DRM code in addition to the workload it's trying to measure.
Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
We did more check, and found this test machine does use the mgag200 driver.
And we are suspecting the regression is caused by
commit cf1ca9aeb930df074bb5bbcde55f935fec04e529 Author: Thomas Zimmermann tzimmermann@suse.de Date: Wed Jul 3 09:58:24 2019 +0200
Yes, that's the commit. Unfortunately reverting it would require reverting a hand full of other patches as well.
I have a potential fix for the problem. Could you run and verify that it resolves the problem?
Sure, please send it to us. Rong and I will try it.
Fantastic, thank you! The patch set is available on dri-devel at
https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html
Best regards Thomas
Thanks, Feng
Best regards Thomas
drm/fb-helper: Map DRM client buffer only when required This patch changes DRM clients to not map the buffer by default. The buffer, like any buffer object, should be mapped and unmapped when needed. An unmapped buffer object can be evicted to system memory and does not consume video ram until displayed. This allows to use generic fbdev emulation with drivers for low-memory devices, such as ast and mgag200. This change affects the generic framebuffer console. HW-based consoles map their console buffer once and keep it mapped. Userspace can mmap this buffer into its address space. The shadow-buffered framebuffer console only needs the buffer object to be mapped during updates. While not being updated from the shadow buffer, the buffer object can remain unmapped. Userspace will always mmap the shadow buffer.
which may add more load when fbcon is busy printing out messages.
We are doing more test inside 0day to confirm.
Thanks, Feng _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi,
On 8/1/19 7:58 PM, Thomas Zimmermann wrote:
Hi
Am 01.08.19 um 13:25 schrieb Feng Tang:
Hi Thomas,
On Thu, Aug 01, 2019 at 11:59:28AM +0200, Thomas Zimmermann wrote:
Hi
Am 01.08.19 um 10:37 schrieb Feng Tang:
On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") >>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master >>>>>>> Daniel, Noralf, we may have to revert this patch. >>>>>>> >>>>>>> I expected some change in display performance, but not in VM. Since it's >>>>>>> a server chipset, probably no one cares much about display performance. >>>>>>> So that seemed like a good trade-off for re-using shared code. >>>>>>> >>>>>>> Part of the patch set is that the generic fb emulation now maps and >>>>>>> unmaps the fbdev BO when updating the screen. I guess that's the cause >>>>>>> of the performance regression. And it should be visible with other >>>>>>> drivers as well if they use a shadow FB for fbdev emulation. >>>>>> For fbcon we should need to do any maps/unamps at all, this is for the >>>>>> fbdev mmap support only. If the testcase mentioned here tests fbdev >>>>>> mmap handling it's pretty badly misnamed :-) And as long as you don't >>>>>> have an fbdev mmap there shouldn't be any impact at all. >>>>> The ast and mgag200 have only a few MiB of VRAM, so we have to get the >>>>> fbdev BO out if it's not being displayed. If not being mapped, it can be >>>>> evicted and make room for X, etc. >>>>> >>>>> To make this work, the BO's memory is mapped and unmapped in >>>>> drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] >>>>> That fbdev mapping is established on each screen update, more or less. >>>>> From my (yet unverified) understanding, this causes the performance >>>>> regression in the VM code. >>>>> >>>>> The original code in mgag200 used to kmap the fbdev BO while it's being >>>>> displayed; [2] and the drawing code only mapped it when necessary (i.e., >>>>> not being display). [3] >>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should >>>> cache this. >>>> >>>>> I think this could be added for VRAM helpers as well, but it's still a >>>>> workaround and non-VRAM drivers might also run into such a performance >>>>> regression if they use the fbdev's shadow fb. >>>> Yeah agreed, fbdev emulation should try to cache the vmap. >>>> >>>>> Noralf mentioned that there are plans for other DRM clients besides the >>>>> console. They would as well run into similar problems. >>>>> >>>>>>> The thing is that we'd need another generic fbdev emulation for ast and >>>>>>> mgag200 that handles this issue properly. >>>>>> Yeah I dont think we want to jump the gun here. If you can try to >>>>>> repro locally and profile where we're wasting cpu time I hope that >>>>>> should sched a light what's going wrong here. >>>>> I don't have much time ATM and I'm not even officially at work until >>>>> late Aug. I'd send you the revert and investigate later. I agree that >>>>> using generic fbdev emulation would be preferable. >>>> Still not sure that's the right thing to do really. Yes it's a >>>> regression, but vm testcases shouldn run a single line of fbcon or drm >>>> code. So why this is impacted so heavily by a silly drm change is very >>>> confusing to me. We might be papering over a deeper and much more >>>> serious issue ... >>> It's a regression, the right thing is to revert first and then work >>> out the right thing to do. >> Sure, but I have no idea whether the testcase is doing something >> reasonable. If it's accidentally testing vm scalability of fbdev and >> there's no one else doing something this pointless, then it's not a >> real bug. Plus I think we're shooting the messenger here. >> >>> It's likely the test runs on the console and printfs stuff out while running. >> But why did we not regress the world if a few prints on the console >> have such a huge impact? We didn't get an entire stream of mails about >> breaking stuff ... > The regression seems not related to the commit. But we have retested > and confirmed the regression. Hard to understand what happens. Does the regressed test cause any output on console while it's measuring? If so, it's probably accidentally measuring fbcon/DRM code in addition to the workload it's trying to measure.
Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
We did more check, and found this test machine does use the mgag200 driver.
And we are suspecting the regression is caused by
commit cf1ca9aeb930df074bb5bbcde55f935fec04e529 Author: Thomas Zimmermann tzimmermann@suse.de Date: Wed Jul 3 09:58:24 2019 +0200
Yes, that's the commit. Unfortunately reverting it would require reverting a hand full of other patches as well.
I have a potential fix for the problem. Could you run and verify that it resolves the problem?
Sure, please send it to us. Rong and I will try it.
Fantastic, thank you! The patch set is available on dri-devel at
https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html
The patch set improves the performance slightly, but the change is not very obvious.
$ git log --oneline 8f7ec6bcc7 -5 8f7ec6bcc75a9 drm/mgag200: Map fbdev framebuffer while it's being displayed abcb1cf24033a drm/ast: Map fbdev framebuffer while it's being displayed a92f80044c623 drm/vram-helpers: Add kmap ref-counting to GEM VRAM objects 90f479ae51afa drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation f1f8555dfb9a7 drm/bochs: Use shadow buffer for bochs framebuffer console
commit: f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console") 90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") 8f7ec6bcc7 ("drm/mgag200: Map fbdev framebuffer while it's being displayed")
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde 8f7ec6bcc75a996f5c6b39a9cf testcase/testparams/testbox ---------------- -------------------------- -------------------------- --------------------------- %stddev change %stddev change %stddev \ | \ | \ 43921 -18% 35884 -17% 36629 vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43921 -18% 35884 -17% 36629 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen
Best regards Thomas
Thanks, Feng
Best regards Thomas
drm/fb-helper: Map DRM client buffer only when required This patch changes DRM clients to not map the buffer by default. The buffer, like any buffer object, should be mapped and unmapped when needed. An unmapped buffer object can be evicted to system memory and does not consume video ram until displayed. This allows to use generic fbdev emulation with drivers for low-memory devices, such as ast and mgag200. This change affects the generic framebuffer console. HW-based consoles map their console buffer once and keep it mapped. Userspace can mmap this buffer into its address space. The shadow-buffered framebuffer console only needs the buffer object to be mapped during updates. While not being updated from the shadow buffer, the buffer object can remain unmapped. Userspace will always mmap the shadow buffer.
which may add more load when fbcon is busy printing out messages.
We are doing more test inside 0day to confirm.
Thanks, Feng _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi
Am 02.08.19 um 09:11 schrieb Rong Chen:
Hi,
On 8/1/19 7:58 PM, Thomas Zimmermann wrote:
Hi
Am 01.08.19 um 13:25 schrieb Feng Tang:
Hi Thomas,
On Thu, Aug 01, 2019 at 11:59:28AM +0200, Thomas Zimmermann wrote:
Hi
Am 01.08.19 um 10:37 schrieb Feng Tang:
On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 >>>>>>>>> ("drm/mgag200: Replace struct mga_fbdev with generic >>>>>>>>> framebuffer emulation") >>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git >>>>>>>>> master >>>>>>>> Daniel, Noralf, we may have to revert this patch. >>>>>>>> >>>>>>>> I expected some change in display performance, but not in >>>>>>>> VM. Since it's >>>>>>>> a server chipset, probably no one cares much about display >>>>>>>> performance. >>>>>>>> So that seemed like a good trade-off for re-using shared >>>>>>>> code. >>>>>>>> >>>>>>>> Part of the patch set is that the generic fb emulation now >>>>>>>> maps and >>>>>>>> unmaps the fbdev BO when updating the screen. I guess >>>>>>>> that's the cause >>>>>>>> of the performance regression. And it should be visible >>>>>>>> with other >>>>>>>> drivers as well if they use a shadow FB for fbdev emulation. >>>>>>> For fbcon we should need to do any maps/unamps at all, this >>>>>>> is for the >>>>>>> fbdev mmap support only. If the testcase mentioned here >>>>>>> tests fbdev >>>>>>> mmap handling it's pretty badly misnamed :-) And as long as >>>>>>> you don't >>>>>>> have an fbdev mmap there shouldn't be any impact at all. >>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have >>>>>> to get the >>>>>> fbdev BO out if it's not being displayed. If not being >>>>>> mapped, it can be >>>>>> evicted and make room for X, etc. >>>>>> >>>>>> To make this work, the BO's memory is mapped and unmapped in >>>>>> drm_fb_helper_dirty_work() before being updated from the >>>>>> shadow FB. [1] >>>>>> That fbdev mapping is established on each screen update, >>>>>> more or less. >>>>>> From my (yet unverified) understanding, this causes the >>>>>> performance >>>>>> regression in the VM code. >>>>>> >>>>>> The original code in mgag200 used to kmap the fbdev BO while >>>>>> it's being >>>>>> displayed; [2] and the drawing code only mapped it when >>>>>> necessary (i.e., >>>>>> not being display). [3] >>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We >>>>> indeed should >>>>> cache this. >>>>> >>>>>> I think this could be added for VRAM helpers as well, but >>>>>> it's still a >>>>>> workaround and non-VRAM drivers might also run into such a >>>>>> performance >>>>>> regression if they use the fbdev's shadow fb. >>>>> Yeah agreed, fbdev emulation should try to cache the vmap. >>>>> >>>>>> Noralf mentioned that there are plans for other DRM clients >>>>>> besides the >>>>>> console. They would as well run into similar problems. >>>>>> >>>>>>>> The thing is that we'd need another generic fbdev >>>>>>>> emulation for ast and >>>>>>>> mgag200 that handles this issue properly. >>>>>>> Yeah I dont think we want to jump the gun here. If you can >>>>>>> try to >>>>>>> repro locally and profile where we're wasting cpu time I >>>>>>> hope that >>>>>>> should sched a light what's going wrong here. >>>>>> I don't have much time ATM and I'm not even officially at >>>>>> work until >>>>>> late Aug. I'd send you the revert and investigate later. I >>>>>> agree that >>>>>> using generic fbdev emulation would be preferable. >>>>> Still not sure that's the right thing to do really. Yes it's a >>>>> regression, but vm testcases shouldn run a single line of >>>>> fbcon or drm >>>>> code. So why this is impacted so heavily by a silly drm >>>>> change is very >>>>> confusing to me. We might be papering over a deeper and much >>>>> more >>>>> serious issue ... >>>> It's a regression, the right thing is to revert first and then >>>> work >>>> out the right thing to do. >>> Sure, but I have no idea whether the testcase is doing something >>> reasonable. If it's accidentally testing vm scalability of >>> fbdev and >>> there's no one else doing something this pointless, then it's >>> not a >>> real bug. Plus I think we're shooting the messenger here. >>> >>>> It's likely the test runs on the console and printfs stuff out >>>> while running. >>> But why did we not regress the world if a few prints on the >>> console >>> have such a huge impact? We didn't get an entire stream of >>> mails about >>> breaking stuff ... >> The regression seems not related to the commit. But we have >> retested >> and confirmed the regression. Hard to understand what happens. > Does the regressed test cause any output on console while it's > measuring? If so, it's probably accidentally measuring fbcon/DRM > code in > addition to the workload it's trying to measure. > Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
We did more check, and found this test machine does use the mgag200 driver.
And we are suspecting the regression is caused by
commit cf1ca9aeb930df074bb5bbcde55f935fec04e529 Author: Thomas Zimmermann tzimmermann@suse.de Date: Wed Jul 3 09:58:24 2019 +0200
Yes, that's the commit. Unfortunately reverting it would require reverting a hand full of other patches as well.
I have a potential fix for the problem. Could you run and verify that it resolves the problem?
Sure, please send it to us. Rong and I will try it.
Fantastic, thank you! The patch set is available on dri-devel at
https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html
The patch set improves the performance slightly, but the change is not very obvious.
$ git log --oneline 8f7ec6bcc7 -5 8f7ec6bcc75a9 drm/mgag200: Map fbdev framebuffer while it's being displayed abcb1cf24033a drm/ast: Map fbdev framebuffer while it's being displayed a92f80044c623 drm/vram-helpers: Add kmap ref-counting to GEM VRAM objects 90f479ae51afa drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation f1f8555dfb9a7 drm/bochs: Use shadow buffer for bochs framebuffer console
commit: f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console") 90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") 8f7ec6bcc7 ("drm/mgag200: Map fbdev framebuffer while it's being displayed")
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde 8f7ec6bcc75a996f5c6b39a9cf testcase/testparams/testbox
---------------- -------------------------- --------------------------
%stddev change %stddev change %stddev \ | \ | \ 43921 -18% 35884 -17% 36629 vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43921 -18% 35884 -17% 36629 GEO-MEAN vm-scalability.median
The regression goes from -18% to -17%, if I understand this correctly. This is strange, because the patch set restores the way that the original code worked. The heavy map/unmap calls in the fbdev code are gone. Performance should have been back to normal.
I'd like to prepare a patch set for entirely reverting all changes. Can I send it to you for testing?
Best regards Thomas
Best Regards, Rong Chen
Best regards Thomas
Thanks, Feng
Best regards Thomas
drm/fb-helper: Map DRM client buffer only when required This patch changes DRM clients to not map the buffer by default. The buffer, like any buffer object, should be mapped and unmapped when needed. An unmapped buffer object can be evicted to system memory and does not consume video ram until displayed. This allows to use generic fbdev emulation with drivers for low-memory devices, such as ast and mgag200. This change affects the generic framebuffer console. HW-based consoles map their console buffer once and keep it mapped. Userspace can mmap this buffer into its address space. The shadow-buffered framebuffer console only needs the buffer object to be mapped during updates. While not being updated from the shadow buffer, the buffer object can remain unmapped. Userspace will always mmap the shadow buffer. which may add more load when fbcon is busy printing out messages.
We are doing more test inside 0day to confirm.
Thanks, Feng _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi
Am 02.08.19 um 09:11 schrieb Rong Chen:
Hi,
On 8/1/19 7:58 PM, Thomas Zimmermann wrote:
Hi
Am 01.08.19 um 13:25 schrieb Feng Tang:
Hi Thomas,
On Thu, Aug 01, 2019 at 11:59:28AM +0200, Thomas Zimmermann wrote:
Hi
Am 01.08.19 um 10:37 schrieb Feng Tang:
On Thu, Aug 01, 2019 at 02:19:53PM +0800, Rong Chen wrote:
>>>>>>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 >>>>>>>>> ("drm/mgag200: Replace struct mga_fbdev with generic >>>>>>>>> framebuffer emulation") >>>>>>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git >>>>>>>>> master >>>>>>>> Daniel, Noralf, we may have to revert this patch. >>>>>>>> >>>>>>>> I expected some change in display performance, but not in >>>>>>>> VM. Since it's >>>>>>>> a server chipset, probably no one cares much about display >>>>>>>> performance. >>>>>>>> So that seemed like a good trade-off for re-using shared >>>>>>>> code. >>>>>>>> >>>>>>>> Part of the patch set is that the generic fb emulation now >>>>>>>> maps and >>>>>>>> unmaps the fbdev BO when updating the screen. I guess >>>>>>>> that's the cause >>>>>>>> of the performance regression. And it should be visible >>>>>>>> with other >>>>>>>> drivers as well if they use a shadow FB for fbdev emulation. >>>>>>> For fbcon we should need to do any maps/unamps at all, this >>>>>>> is for the >>>>>>> fbdev mmap support only. If the testcase mentioned here >>>>>>> tests fbdev >>>>>>> mmap handling it's pretty badly misnamed :-) And as long as >>>>>>> you don't >>>>>>> have an fbdev mmap there shouldn't be any impact at all. >>>>>> The ast and mgag200 have only a few MiB of VRAM, so we have >>>>>> to get the >>>>>> fbdev BO out if it's not being displayed. If not being >>>>>> mapped, it can be >>>>>> evicted and make room for X, etc. >>>>>> >>>>>> To make this work, the BO's memory is mapped and unmapped in >>>>>> drm_fb_helper_dirty_work() before being updated from the >>>>>> shadow FB. [1] >>>>>> That fbdev mapping is established on each screen update, >>>>>> more or less. >>>>>> From my (yet unverified) understanding, this causes the >>>>>> performance >>>>>> regression in the VM code. >>>>>> >>>>>> The original code in mgag200 used to kmap the fbdev BO while >>>>>> it's being >>>>>> displayed; [2] and the drawing code only mapped it when >>>>>> necessary (i.e., >>>>>> not being display). [3] >>>>> Hm yeah, this vmap/vunmap is going to be pretty bad. We >>>>> indeed should >>>>> cache this. >>>>> >>>>>> I think this could be added for VRAM helpers as well, but >>>>>> it's still a >>>>>> workaround and non-VRAM drivers might also run into such a >>>>>> performance >>>>>> regression if they use the fbdev's shadow fb. >>>>> Yeah agreed, fbdev emulation should try to cache the vmap. >>>>> >>>>>> Noralf mentioned that there are plans for other DRM clients >>>>>> besides the >>>>>> console. They would as well run into similar problems. >>>>>> >>>>>>>> The thing is that we'd need another generic fbdev >>>>>>>> emulation for ast and >>>>>>>> mgag200 that handles this issue properly. >>>>>>> Yeah I dont think we want to jump the gun here. If you can >>>>>>> try to >>>>>>> repro locally and profile where we're wasting cpu time I >>>>>>> hope that >>>>>>> should sched a light what's going wrong here. >>>>>> I don't have much time ATM and I'm not even officially at >>>>>> work until >>>>>> late Aug. I'd send you the revert and investigate later. I >>>>>> agree that >>>>>> using generic fbdev emulation would be preferable. >>>>> Still not sure that's the right thing to do really. Yes it's a >>>>> regression, but vm testcases shouldn run a single line of >>>>> fbcon or drm >>>>> code. So why this is impacted so heavily by a silly drm >>>>> change is very >>>>> confusing to me. We might be papering over a deeper and much >>>>> more >>>>> serious issue ... >>>> It's a regression, the right thing is to revert first and then >>>> work >>>> out the right thing to do. >>> Sure, but I have no idea whether the testcase is doing something >>> reasonable. If it's accidentally testing vm scalability of >>> fbdev and >>> there's no one else doing something this pointless, then it's >>> not a >>> real bug. Plus I think we're shooting the messenger here. >>> >>>> It's likely the test runs on the console and printfs stuff out >>>> while running. >>> But why did we not regress the world if a few prints on the >>> console >>> have such a huge impact? We didn't get an entire stream of >>> mails about >>> breaking stuff ... >> The regression seems not related to the commit. But we have >> retested >> and confirmed the regression. Hard to understand what happens. > Does the regressed test cause any output on console while it's > measuring? If so, it's probably accidentally measuring fbcon/DRM > code in > addition to the workload it's trying to measure. > Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
We did more check, and found this test machine does use the mgag200 driver.
And we are suspecting the regression is caused by
commit cf1ca9aeb930df074bb5bbcde55f935fec04e529 Author: Thomas Zimmermann tzimmermann@suse.de Date: Wed Jul 3 09:58:24 2019 +0200
Yes, that's the commit. Unfortunately reverting it would require reverting a hand full of other patches as well.
I have a potential fix for the problem. Could you run and verify that it resolves the problem?
Sure, please send it to us. Rong and I will try it.
Fantastic, thank you! The patch set is available on dri-devel at
https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html
The patch set improves the performance slightly, but the change is not very obvious.
$ git log --oneline 8f7ec6bcc7 -5 8f7ec6bcc75a9 drm/mgag200: Map fbdev framebuffer while it's being displayed abcb1cf24033a drm/ast: Map fbdev framebuffer while it's being displayed a92f80044c623 drm/vram-helpers: Add kmap ref-counting to GEM VRAM objects 90f479ae51afa drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation f1f8555dfb9a7 drm/bochs: Use shadow buffer for bochs framebuffer console
commit: f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console") 90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") 8f7ec6bcc7 ("drm/mgag200: Map fbdev framebuffer while it's being displayed")
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde 8f7ec6bcc75a996f5c6b39a9cf testcase/testparams/testbox
---------------- -------------------------- --------------------------
%stddev change %stddev change %stddev \ | \ | \ 43921 -18% 35884 -17% 36629 vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43921 -18% 35884 -17% 36629 GEO-MEAN vm-scalability.median
Thank you for testing.
There's another thing I'd like to ask: could you run the test without console output on drm-tip (i.e., disable it or pipe it into /dev/null)? I'd like to see how that impacts performance.
Best regards Thomas
Best Regards, Rong Chen
Best regards Thomas
Thanks, Feng
Best regards Thomas
drm/fb-helper: Map DRM client buffer only when required This patch changes DRM clients to not map the buffer by default. The buffer, like any buffer object, should be mapped and unmapped when needed. An unmapped buffer object can be evicted to system memory and does not consume video ram until displayed. This allows to use generic fbdev emulation with drivers for low-memory devices, such as ast and mgag200. This change affects the generic framebuffer console. HW-based consoles map their console buffer once and keep it mapped. Userspace can mmap this buffer into its address space. The shadow-buffered framebuffer console only needs the buffer object to be mapped during updates. While not being updated from the shadow buffer, the buffer object can remain unmapped. Userspace will always mmap the shadow buffer. which may add more load when fbcon is busy printing out messages.
We are doing more test inside 0day to confirm.
Thanks, Feng _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi
Am 01.08.19 um 08:19 schrieb Rong Chen:
Hi,
On 7/31/19 6:21 PM, Michel Dänzer wrote:
On 2019-07-31 11:25 a.m., Huang, Ying wrote:
Hi, Daniel,
Daniel Vetter daniel@ffwll.ch writes:
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote: > Hi > > Am 30.07.19 um 20:12 schrieb Daniel Vetter: >> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann >> tzimmermann@suse.de wrote: >>> Am 29.07.19 um 11:51 schrieb kernel test robot: >>>> Greeting, >>>> >>>> FYI, we noticed a -18.8% regression of vm-scalability.median >>>> due to commit:> >>>> >>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 >>>> ("drm/mgag200: Replace struct mga_fbdev with generic >>>> framebuffer emulation") >>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git >>>> master >>> Daniel, Noralf, we may have to revert this patch. >>> >>> I expected some change in display performance, but not in VM. >>> Since it's >>> a server chipset, probably no one cares much about display >>> performance. >>> So that seemed like a good trade-off for re-using shared code. >>> >>> Part of the patch set is that the generic fb emulation now maps >>> and >>> unmaps the fbdev BO when updating the screen. I guess that's >>> the cause >>> of the performance regression. And it should be visible with other >>> drivers as well if they use a shadow FB for fbdev emulation. >> For fbcon we should need to do any maps/unamps at all, this is >> for the >> fbdev mmap support only. If the testcase mentioned here tests fbdev >> mmap handling it's pretty badly misnamed :-) And as long as you >> don't >> have an fbdev mmap there shouldn't be any impact at all. > The ast and mgag200 have only a few MiB of VRAM, so we have to > get the > fbdev BO out if it's not being displayed. If not being mapped, it > can be > evicted and make room for X, etc. > > To make this work, the BO's memory is mapped and unmapped in > drm_fb_helper_dirty_work() before being updated from the shadow > FB. [1] > That fbdev mapping is established on each screen update, more or > less. > From my (yet unverified) understanding, this causes the performance > regression in the VM code. > > The original code in mgag200 used to kmap the fbdev BO while it's > being > displayed; [2] and the drawing code only mapped it when necessary > (i.e., > not being display). [3] Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
> I think this could be added for VRAM helpers as well, but it's > still a > workaround and non-VRAM drivers might also run into such a > performance > regression if they use the fbdev's shadow fb. Yeah agreed, fbdev emulation should try to cache the vmap.
> Noralf mentioned that there are plans for other DRM clients > besides the > console. They would as well run into similar problems. > >>> The thing is that we'd need another generic fbdev emulation for >>> ast and >>> mgag200 that handles this issue properly. >> Yeah I dont think we want to jump the gun here. If you can try to >> repro locally and profile where we're wasting cpu time I hope that >> should sched a light what's going wrong here. > I don't have much time ATM and I'm not even officially at work until > late Aug. I'd send you the revert and investigate later. I agree > that > using generic fbdev emulation would be preferable. Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Does the regressed test cause any output on console while it's measuring? If so, it's probably accidentally measuring fbcon/DRM code in addition to the workload it's trying to measure.
Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
I have a patch set for fixing this problem. But I cannot reproduce the issue locally, because my machine is not for testing scalability.
If I send you the patches, could you run them on the machine to test whether they solve the problem?
Best regards Thomas
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
Best Regards, Rong Chen
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On 2019-08-01 8:19 a.m., Rong Chen wrote:
Hi,
On 7/31/19 6:21 PM, Michel Dänzer wrote:
On 2019-07-31 11:25 a.m., Huang, Ying wrote:
Hi, Daniel,
Daniel Vetter daniel@ffwll.ch writes:
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote: > Hi > > Am 30.07.19 um 20:12 schrieb Daniel Vetter: >> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann >> tzimmermann@suse.de wrote: >>> Am 29.07.19 um 11:51 schrieb kernel test robot: >>>> Greeting, >>>> >>>> FYI, we noticed a -18.8% regression of vm-scalability.median >>>> due to commit:> >>>> >>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 >>>> ("drm/mgag200: Replace struct mga_fbdev with generic >>>> framebuffer emulation") >>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git >>>> master >>> Daniel, Noralf, we may have to revert this patch. >>> >>> I expected some change in display performance, but not in VM. >>> Since it's >>> a server chipset, probably no one cares much about display >>> performance. >>> So that seemed like a good trade-off for re-using shared code. >>> >>> Part of the patch set is that the generic fb emulation now maps >>> and >>> unmaps the fbdev BO when updating the screen. I guess that's >>> the cause >>> of the performance regression. And it should be visible with other >>> drivers as well if they use a shadow FB for fbdev emulation. >> For fbcon we should need to do any maps/unamps at all, this is >> for the >> fbdev mmap support only. If the testcase mentioned here tests fbdev >> mmap handling it's pretty badly misnamed :-) And as long as you >> don't >> have an fbdev mmap there shouldn't be any impact at all. > The ast and mgag200 have only a few MiB of VRAM, so we have to > get the > fbdev BO out if it's not being displayed. If not being mapped, it > can be > evicted and make room for X, etc. > > To make this work, the BO's memory is mapped and unmapped in > drm_fb_helper_dirty_work() before being updated from the shadow > FB. [1] > That fbdev mapping is established on each screen update, more or > less. > From my (yet unverified) understanding, this causes the performance > regression in the VM code. > > The original code in mgag200 used to kmap the fbdev BO while it's > being > displayed; [2] and the drawing code only mapped it when necessary > (i.e., > not being display). [3] Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
> I think this could be added for VRAM helpers as well, but it's > still a > workaround and non-VRAM drivers might also run into such a > performance > regression if they use the fbdev's shadow fb. Yeah agreed, fbdev emulation should try to cache the vmap.
> Noralf mentioned that there are plans for other DRM clients > besides the > console. They would as well run into similar problems. > >>> The thing is that we'd need another generic fbdev emulation for >>> ast and >>> mgag200 that handles this issue properly. >> Yeah I dont think we want to jump the gun here. If you can try to >> repro locally and profile where we're wasting cpu time I hope that >> should sched a light what's going wrong here. > I don't have much time ATM and I'm not even officially at work until > late Aug. I'd send you the revert and investigate later. I agree > that > using generic fbdev emulation would be preferable. Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Does the regressed test cause any output on console while it's measuring? If so, it's probably accidentally measuring fbcon/DRM code in addition to the workload it's trying to measure.
Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
I assume the
user :notice: [ xxx.xxxx] xxxxxxxxx bytes / xxxxxxx usecs = xxxxx KB/s
lines are generated by the test?
If so, unless the test is intended to measure console performance, it should be fixed not to generate output to console (while it's measuring).
Hi
Am 01.08.19 um 15:30 schrieb Michel Dänzer:
On 2019-08-01 8:19 a.m., Rong Chen wrote:
Hi,
On 7/31/19 6:21 PM, Michel Dänzer wrote:
On 2019-07-31 11:25 a.m., Huang, Ying wrote:
Hi, Daniel,
Daniel Vetter daniel@ffwll.ch writes:
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote: > On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann > tzimmermann@suse.de wrote: >> Hi >> >> Am 30.07.19 um 20:12 schrieb Daniel Vetter: >>> On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann >>> tzimmermann@suse.de wrote: >>>> Am 29.07.19 um 11:51 schrieb kernel test robot: >>>>> Greeting, >>>>> >>>>> FYI, we noticed a -18.8% regression of vm-scalability.median >>>>> due to commit:> >>>>> >>>>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 >>>>> ("drm/mgag200: Replace struct mga_fbdev with generic >>>>> framebuffer emulation") >>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git >>>>> master >>>> Daniel, Noralf, we may have to revert this patch. >>>> >>>> I expected some change in display performance, but not in VM. >>>> Since it's >>>> a server chipset, probably no one cares much about display >>>> performance. >>>> So that seemed like a good trade-off for re-using shared code. >>>> >>>> Part of the patch set is that the generic fb emulation now maps >>>> and >>>> unmaps the fbdev BO when updating the screen. I guess that's >>>> the cause >>>> of the performance regression. And it should be visible with other >>>> drivers as well if they use a shadow FB for fbdev emulation. >>> For fbcon we should need to do any maps/unamps at all, this is >>> for the >>> fbdev mmap support only. If the testcase mentioned here tests fbdev >>> mmap handling it's pretty badly misnamed :-) And as long as you >>> don't >>> have an fbdev mmap there shouldn't be any impact at all. >> The ast and mgag200 have only a few MiB of VRAM, so we have to >> get the >> fbdev BO out if it's not being displayed. If not being mapped, it >> can be >> evicted and make room for X, etc. >> >> To make this work, the BO's memory is mapped and unmapped in >> drm_fb_helper_dirty_work() before being updated from the shadow >> FB. [1] >> That fbdev mapping is established on each screen update, more or >> less. >> From my (yet unverified) understanding, this causes the performance >> regression in the VM code. >> >> The original code in mgag200 used to kmap the fbdev BO while it's >> being >> displayed; [2] and the drawing code only mapped it when necessary >> (i.e., >> not being display). [3] > Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should > cache this. > >> I think this could be added for VRAM helpers as well, but it's >> still a >> workaround and non-VRAM drivers might also run into such a >> performance >> regression if they use the fbdev's shadow fb. > Yeah agreed, fbdev emulation should try to cache the vmap. > >> Noralf mentioned that there are plans for other DRM clients >> besides the >> console. They would as well run into similar problems. >> >>>> The thing is that we'd need another generic fbdev emulation for >>>> ast and >>>> mgag200 that handles this issue properly. >>> Yeah I dont think we want to jump the gun here. If you can try to >>> repro locally and profile where we're wasting cpu time I hope that >>> should sched a light what's going wrong here. >> I don't have much time ATM and I'm not even officially at work until >> late Aug. I'd send you the revert and investigate later. I agree >> that >> using generic fbdev emulation would be preferable. > Still not sure that's the right thing to do really. Yes it's a > regression, but vm testcases shouldn run a single line of fbcon or > drm > code. So why this is impacted so heavily by a silly drm change is > very > confusing to me. We might be papering over a deeper and much more > serious issue ... It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The regression seems not related to the commit. But we have retested and confirmed the regression. Hard to understand what happens.
Does the regressed test cause any output on console while it's measuring? If so, it's probably accidentally measuring fbcon/DRM code in addition to the workload it's trying to measure.
Sorry, I'm not familiar with DRM, we enabled the console to output logs, and attached please find the log file.
"Command line: ... console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw"
I assume the
user :notice: [ xxx.xxxx] xxxxxxxxx bytes / xxxxxxx usecs = xxxxx KB/s
lines are generated by the test?
If so, unless the test is intended to measure console performance, it should be fixed not to generate output to console (while it's measuring).
Yes, the test prints quite a lot of text to the console. It shouldn't do that.
Best regards Thomas
Hi
Am 31.07.19 um 10:13 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Am 29.07.19 um 11:51 schrieb kernel test robot: > Greeting, > > FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:> > > commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") > https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
Daniel, Noralf, we may have to revert this patch.
I expected some change in display performance, but not in VM. Since it's a server chipset, probably no one cares much about display performance. So that seemed like a good trade-off for re-using shared code.
Part of the patch set is that the generic fb emulation now maps and unmaps the fbdev BO when updating the screen. I guess that's the cause of the performance regression. And it should be visible with other drivers as well if they use a shadow FB for fbdev emulation.
For fbcon we should need to do any maps/unamps at all, this is for the fbdev mmap support only. If the testcase mentioned here tests fbdev mmap handling it's pretty badly misnamed :-) And as long as you don't have an fbdev mmap there shouldn't be any impact at all.
The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Yeah agreed, fbdev emulation should try to cache the vmap.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
The thing is that we'd need another generic fbdev emulation for ast and mgag200 that handles this issue properly.
Yeah I dont think we want to jump the gun here. If you can try to repro locally and profile where we're wasting cpu time I hope that should sched a light what's going wrong here.
I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The vmap/vunmap pair is only executed for fbdev emulation with a shadow FB. And most of those are with shmem helpers, which ref-count the vmap calls internally. My guess is that VRAM helpers are currently the only BOs triggering this problem.
Best regards Thomas
-Daniel
On Wed, Jul 31, 2019 at 12:10:54PM +0200, Thomas Zimmermann wrote:
Hi
Am 31.07.19 um 10:13 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote: > Am 29.07.19 um 11:51 schrieb kernel test robot: >> Greeting, >> >> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:> >> >> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") >> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master > > Daniel, Noralf, we may have to revert this patch. > > I expected some change in display performance, but not in VM. Since it's > a server chipset, probably no one cares much about display performance. > So that seemed like a good trade-off for re-using shared code. > > Part of the patch set is that the generic fb emulation now maps and > unmaps the fbdev BO when updating the screen. I guess that's the cause > of the performance regression. And it should be visible with other > drivers as well if they use a shadow FB for fbdev emulation.
For fbcon we should need to do any maps/unamps at all, this is for the fbdev mmap support only. If the testcase mentioned here tests fbdev mmap handling it's pretty badly misnamed :-) And as long as you don't have an fbdev mmap there shouldn't be any impact at all.
The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Yeah agreed, fbdev emulation should try to cache the vmap.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
> The thing is that we'd need another generic fbdev emulation for ast and > mgag200 that handles this issue properly.
Yeah I dont think we want to jump the gun here. If you can try to repro locally and profile where we're wasting cpu time I hope that should sched a light what's going wrong here.
I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The vmap/vunmap pair is only executed for fbdev emulation with a shadow FB. And most of those are with shmem helpers, which ref-count the vmap calls internally. My guess is that VRAM helpers are currently the only BOs triggering this problem.
I meant that surely this vm-scalability testcase isn't the only thing that's being run by 0day on a machine with mga200g. If a few printks to dmesg/console cause such a huge regression, I'd expect everything to regress on that box. But seems to not be the case. -Daniel
Best regards Thomas
-Daniel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
Hi
Am 02.08.19 um 11:11 schrieb Daniel Vetter:
On Wed, Jul 31, 2019 at 12:10:54PM +0200, Thomas Zimmermann wrote:
Hi
Am 31.07.19 um 10:13 schrieb Daniel Vetter:
On Tue, Jul 30, 2019 at 10:27 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 31 Jul 2019 at 05:00, Daniel Vetter daniel@ffwll.ch wrote:
On Tue, Jul 30, 2019 at 8:50 PM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 30.07.19 um 20:12 schrieb Daniel Vetter: > On Tue, Jul 30, 2019 at 7:50 PM Thomas Zimmermann tzimmermann@suse.de wrote: >> Am 29.07.19 um 11:51 schrieb kernel test robot: >>> Greeting, >>> >>> FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:> >>> >>> commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") >>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master >> >> Daniel, Noralf, we may have to revert this patch. >> >> I expected some change in display performance, but not in VM. Since it's >> a server chipset, probably no one cares much about display performance. >> So that seemed like a good trade-off for re-using shared code. >> >> Part of the patch set is that the generic fb emulation now maps and >> unmaps the fbdev BO when updating the screen. I guess that's the cause >> of the performance regression. And it should be visible with other >> drivers as well if they use a shadow FB for fbdev emulation. > > For fbcon we should need to do any maps/unamps at all, this is for the > fbdev mmap support only. If the testcase mentioned here tests fbdev > mmap handling it's pretty badly misnamed :-) And as long as you don't > have an fbdev mmap there shouldn't be any impact at all.
The ast and mgag200 have only a few MiB of VRAM, so we have to get the fbdev BO out if it's not being displayed. If not being mapped, it can be evicted and make room for X, etc.
To make this work, the BO's memory is mapped and unmapped in drm_fb_helper_dirty_work() before being updated from the shadow FB. [1] That fbdev mapping is established on each screen update, more or less. From my (yet unverified) understanding, this causes the performance regression in the VM code.
The original code in mgag200 used to kmap the fbdev BO while it's being displayed; [2] and the drawing code only mapped it when necessary (i.e., not being display). [3]
Hm yeah, this vmap/vunmap is going to be pretty bad. We indeed should cache this.
I think this could be added for VRAM helpers as well, but it's still a workaround and non-VRAM drivers might also run into such a performance regression if they use the fbdev's shadow fb.
Yeah agreed, fbdev emulation should try to cache the vmap.
Noralf mentioned that there are plans for other DRM clients besides the console. They would as well run into similar problems.
>> The thing is that we'd need another generic fbdev emulation for ast and >> mgag200 that handles this issue properly. > > Yeah I dont think we want to jump the gun here. If you can try to > repro locally and profile where we're wasting cpu time I hope that > should sched a light what's going wrong here.
I don't have much time ATM and I'm not even officially at work until late Aug. I'd send you the revert and investigate later. I agree that using generic fbdev emulation would be preferable.
Still not sure that's the right thing to do really. Yes it's a regression, but vm testcases shouldn run a single line of fbcon or drm code. So why this is impacted so heavily by a silly drm change is very confusing to me. We might be papering over a deeper and much more serious issue ...
It's a regression, the right thing is to revert first and then work out the right thing to do.
Sure, but I have no idea whether the testcase is doing something reasonable. If it's accidentally testing vm scalability of fbdev and there's no one else doing something this pointless, then it's not a real bug. Plus I think we're shooting the messenger here.
It's likely the test runs on the console and printfs stuff out while running.
But why did we not regress the world if a few prints on the console have such a huge impact? We didn't get an entire stream of mails about breaking stuff ...
The vmap/vunmap pair is only executed for fbdev emulation with a shadow FB. And most of those are with shmem helpers, which ref-count the vmap calls internally. My guess is that VRAM helpers are currently the only BOs triggering this problem.
I meant that surely this vm-scalability testcase isn't the only thing that's being run by 0day on a machine with mga200g. If a few printks to dmesg/console cause such a huge regression, I'd expect everything to regress on that box. But seems to not be the case.
True. And according to Rong Chen's feedback, vmap and vunmap have only a small impact. The other difference is that there's now a shadow FB for the the console; including the dirty worker with an additional memcpy. mgag200 used to update the console directly in VRAM.
I'd expect to see every driver with shadow-FB console to show bad performance, but that doesn't seem to be the case either.
Best regards Thomas
-Daniel
Best regards Thomas
-Daniel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
Hi
I did some further analysis on this problem and found that the blinking cursor affects performance of the vm-scalability test case.
I only have a 4-core machine, so scalability is not really testable. Yet I see the effects of running vm-scalibility against drm-tip, a revert of the mgag200 patch and the vmap fixes that I posted a few days ago.
After reverting the mgag200 patch, running the test as described in the report
bin/lkp run job.yaml
gives results like
2019-08-02 19:34:37 ./case-anon-cow-seq-hugetlb 2019-08-02 19:34:37 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815395225 917319627 bytes / 756534 usecs = 1184110 KB/s 917319627 bytes / 764675 usecs = 1171504 KB/s 917319627 bytes / 766414 usecs = 1168846 KB/s 917319627 bytes / 777990 usecs = 1151454 KB/s
Running the test against current drm-tip gives slightly worse results, such as.
2019-08-03 19:17:06 ./case-anon-cow-seq-hugetlb 2019-08-03 19:17:06 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 871607 usecs = 1027778 KB/s 917318700 bytes / 894173 usecs = 1001840 KB/s 917318700 bytes / 919694 usecs = 974040 KB/s 917318700 bytes / 923341 usecs = 970193 KB/s
The test puts out roughly one result per second. Strangely sending the output to /dev/null can make results significantly worse.
bin/lkp run job.yaml > /dev/null
2019-08-03 19:23:04 ./case-anon-cow-seq-hugetlb 2019-08-03 19:23:04 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 1207358 usecs = 741966 KB/s 917318700 bytes / 1210456 usecs = 740067 KB/s 917318700 bytes / 1216572 usecs = 736346 KB/s 917318700 bytes / 1239152 usecs = 722929 KB/s
I realized that there's still a blinking cursor on the screen, which I disabled with
tput civis
or alternatively
echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
Running the the test now gives the original or even better results, such as
bin/lkp run job.yaml > /dev/null
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
Am 30.07.19 um 19:50 schrieb Thomas Zimmermann:
Am 29.07.19 um 11:51 schrieb kernel test robot:
Greeting,
FYI, we noticed a -18.8% regression of vm-scalability.median due to commit:>
commit: 90f479ae51afa45efab97afdde9b94b9660dd3e4 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation") https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git master
Daniel, Noralf, we may have to revert this patch.
I expected some change in display performance, but not in VM. Since it's a server chipset, probably no one cares much about display performance. So that seemed like a good trade-off for re-using shared code.
Part of the patch set is that the generic fb emulation now maps and unmaps the fbdev BO when updating the screen. I guess that's the cause of the performance regression. And it should be visible with other drivers as well if they use a shadow FB for fbdev emulation.
The thing is that we'd need another generic fbdev emulation for ast and mgag200 that handles this issue properly.
Best regards Thomas
in testcase: vm-scalability on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory with following parameters:
runtime: 300s size: 8T test: anon-cow-seq-hugetlb cpufreq_governor: performance
test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us. test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
Details are as below: -------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml
========================================================================================= compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase: gcc-7/performance/x86_64-rhel-7.6/debian-x86_64-2019-05-14.cgz/300s/8T/lkp-knm01/anon-cow-seq-hugetlb/vm-scalability
commit: f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console") 90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9
fail:runs %reproduction fail:runs | | | 2:4 -50% :4 dmesg.WARNING:at#for_ip_interrupt_entry/0x :4 25% 1:4 dmesg.WARNING:at_ip___perf_sw_event/0x :4 25% 1:4 dmesg.WARNING:at_ip__fsnotify_parent/0x %stddev %change %stddev \ | \ 43955 ± 2% -18.8% 35691 vm-scalability.median 0.06 ± 7% +193.0% 0.16 ± 2% vm-scalability.median_stddev
14906559 ± 2% -17.9% 12237079 vm-scalability.throughput 87651 ± 2% -17.4% 72374 vm-scalability.time.involuntary_context_switches 2086168 -23.6% 1594224 vm-scalability.time.minor_page_faults 15082 ± 2% -10.4% 13517 vm-scalability.time.percent_of_cpu_this_job_got 29987 -8.9% 27327 vm-scalability.time.system_time 15755 -12.4% 13795 vm-scalability.time.user_time 122011 -19.3% 98418 vm-scalability.time.voluntary_context_switches 3.034e+09 -23.6% 2.318e+09 vm-scalability.workload 242478 ± 12% +68.5% 408518 ± 23% cpuidle.POLL.time 2788 ± 21% +117.4% 6062 ± 26% cpuidle.POLL.usage 56653 ± 10% +64.4% 93144 ± 20% meminfo.Mapped 120392 ± 7% +14.0% 137212 ± 4% meminfo.Shmem 47221 ± 11% +77.1% 83634 ± 22% numa-meminfo.node0.Mapped 120465 ± 7% +13.9% 137205 ± 4% numa-meminfo.node0.Shmem 2885513 -16.5% 2409384 numa-numastat.node0.local_node 2885471 -16.5% 2409354 numa-numastat.node0.numa_hit 11813 ± 11% +76.3% 20824 ± 22% numa-vmstat.node0.nr_mapped 30096 ± 7% +13.8% 34238 ± 4% numa-vmstat.node0.nr_shmem 43.72 ± 2% +5.5 49.20 mpstat.cpu.all.idle% 0.03 ± 4% +0.0 0.05 ± 6% mpstat.cpu.all.soft% 19.51 -2.4 17.08 mpstat.cpu.all.usr% 1012 -7.9% 932.75 turbostat.Avg_MHz 32.38 ± 10% +25.8% 40.73 turbostat.CPU%c1 145.51 -3.1% 141.01 turbostat.PkgWatt 15.09 -19.2% 12.19 turbostat.RAMWatt 43.50 ± 2% +13.2% 49.25 vmstat.cpu.id 18.75 ± 2% -13.3% 16.25 ± 2% vmstat.cpu.us 152.00 ± 2% -9.5% 137.50 vmstat.procs.r 4800 -13.1% 4173 vmstat.system.cs 156170 -11.9% 137594 slabinfo.anon_vma.active_objs 3395 -11.9% 2991 slabinfo.anon_vma.active_slabs 156190 -11.9% 137606 slabinfo.anon_vma.num_objs 3395 -11.9% 2991 slabinfo.anon_vma.num_slabs 1716 ± 5% +11.5% 1913 ± 8% slabinfo.dmaengine-unmap-16.active_objs 1716 ± 5% +11.5% 1913 ± 8% slabinfo.dmaengine-unmap-16.num_objs 1767 ± 2% -19.0% 1431 ± 2% slabinfo.hugetlbfs_inode_cache.active_objs 1767 ± 2% -19.0% 1431 ± 2% slabinfo.hugetlbfs_inode_cache.num_objs 3597 ± 5% -16.4% 3006 ± 3% slabinfo.skbuff_ext_cache.active_objs 3597 ± 5% -16.4% 3006 ± 3% slabinfo.skbuff_ext_cache.num_objs 1330122 -23.6% 1016557 proc-vmstat.htlb_buddy_alloc_success 77214 ± 3% +6.4% 82128 ± 2% proc-vmstat.nr_active_anon 67277 +2.9% 69246 proc-vmstat.nr_anon_pages 218.50 ± 3% -10.6% 195.25 proc-vmstat.nr_dirtied 288628 +1.4% 292755 proc-vmstat.nr_file_pages 360.50 -2.7% 350.75 proc-vmstat.nr_inactive_file 14225 ± 9% +63.8% 23304 ± 20% proc-vmstat.nr_mapped 30109 ± 7% +13.8% 34259 ± 4% proc-vmstat.nr_shmem 99870 -1.3% 98597 proc-vmstat.nr_slab_unreclaimable 204.00 ± 4% -12.1% 179.25 proc-vmstat.nr_written 77214 ± 3% +6.4% 82128 ± 2% proc-vmstat.nr_zone_active_anon 360.50 -2.7% 350.75 proc-vmstat.nr_zone_inactive_file 8810 ± 19% -66.1% 2987 ± 42% proc-vmstat.numa_hint_faults 8810 ± 19% -66.1% 2987 ± 42% proc-vmstat.numa_hint_faults_local 2904082 -16.4% 2427026 proc-vmstat.numa_hit 2904081 -16.4% 2427025 proc-vmstat.numa_local 6.828e+08 -23.5% 5.221e+08 proc-vmstat.pgalloc_normal 2900008 -17.2% 2400195 proc-vmstat.pgfault 6.827e+08 -23.5% 5.22e+08 proc-vmstat.pgfree 1.635e+10 -17.0% 1.357e+10 perf-stat.i.branch-instructions 1.53 ± 4% -0.1 1.45 ± 3% perf-stat.i.branch-miss-rate% 2.581e+08 ± 3% -20.5% 2.051e+08 ± 2% perf-stat.i.branch-misses 12.66 +1.1 13.78 perf-stat.i.cache-miss-rate% 72720849 -12.0% 63958986 perf-stat.i.cache-misses 5.766e+08 -18.6% 4.691e+08 perf-stat.i.cache-references 4674 ± 2% -13.0% 4064 perf-stat.i.context-switches 4.29 +12.5% 4.83 perf-stat.i.cpi 2.573e+11 -7.4% 2.383e+11 perf-stat.i.cpu-cycles 231.35 -21.5% 181.56 perf-stat.i.cpu-migrations 3522 +4.4% 3677 perf-stat.i.cycles-between-cache-misses 0.09 ± 13% +0.0 0.12 ± 5% perf-stat.i.iTLB-load-miss-rate% 5.894e+10 -15.8% 4.961e+10 perf-stat.i.iTLB-loads 5.901e+10 -15.8% 4.967e+10 perf-stat.i.instructions 1291 ± 14% -21.8% 1010 perf-stat.i.instructions-per-iTLB-miss 0.24 -11.0% 0.21 perf-stat.i.ipc 9476 -17.5% 7821 perf-stat.i.minor-faults 9478 -17.5% 7821 perf-stat.i.page-faults 9.76 -3.6% 9.41 perf-stat.overall.MPKI 1.59 ± 4% -0.1 1.52 perf-stat.overall.branch-miss-rate% 12.61 +1.1 13.71 perf-stat.overall.cache-miss-rate% 4.38 +10.5% 4.83 perf-stat.overall.cpi 3557 +5.3% 3747 perf-stat.overall.cycles-between-cache-misses 0.08 ± 12% +0.0 0.10 perf-stat.overall.iTLB-load-miss-rate% 1268 ± 15% -23.0% 976.22 perf-stat.overall.instructions-per-iTLB-miss 0.23 -9.5% 0.21 perf-stat.overall.ipc 5815 +9.7% 6378 perf-stat.overall.path-length 1.634e+10 -17.5% 1.348e+10 perf-stat.ps.branch-instructions 2.595e+08 ± 3% -21.2% 2.043e+08 ± 2% perf-stat.ps.branch-misses 72565205 -12.2% 63706339 perf-stat.ps.cache-misses 5.754e+08 -19.2% 4.646e+08 perf-stat.ps.cache-references 4640 ± 2% -12.5% 4060 perf-stat.ps.context-switches 2.581e+11 -7.5% 2.387e+11 perf-stat.ps.cpu-cycles 229.91 -22.0% 179.42 perf-stat.ps.cpu-migrations 5.889e+10 -16.3% 4.927e+10 perf-stat.ps.iTLB-loads 5.899e+10 -16.3% 4.938e+10 perf-stat.ps.instructions 9388 -18.2% 7677 perf-stat.ps.minor-faults 9389 -18.2% 7677 perf-stat.ps.page-faults 1.764e+13 -16.2% 1.479e+13 perf-stat.total.instructions 46803 ± 3% -18.8% 37982 ± 6% sched_debug.cfs_rq:/.exec_clock.min 5320 ± 3% +23.7% 6581 ± 3% sched_debug.cfs_rq:/.exec_clock.stddev 6737 ± 14% +58.1% 10649 ± 10% sched_debug.cfs_rq:/.load.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cfs_rq:/.load.max 46952 ± 16% +64.8% 77388 ± 11% sched_debug.cfs_rq:/.load.stddev 7.12 ± 4% +49.1% 10.62 ± 6% sched_debug.cfs_rq:/.load_avg.avg 474.40 ± 23% +67.5% 794.60 ± 10% sched_debug.cfs_rq:/.load_avg.max 37.70 ± 11% +74.8% 65.90 ± 9% sched_debug.cfs_rq:/.load_avg.stddev 13424269 ± 4% -15.6% 11328098 ± 2% sched_debug.cfs_rq:/.min_vruntime.avg 15411275 ± 3% -12.4% 13505072 ± 2% sched_debug.cfs_rq:/.min_vruntime.max 7939295 ± 6% -17.5% 6551322 ± 7% sched_debug.cfs_rq:/.min_vruntime.min 21.44 ± 7% -56.1% 9.42 ± 4% sched_debug.cfs_rq:/.nr_spread_over.avg 117.45 ± 11% -60.6% 46.30 ± 14% sched_debug.cfs_rq:/.nr_spread_over.max 19.33 ± 8% -66.4% 6.49 ± 9% sched_debug.cfs_rq:/.nr_spread_over.stddev 4.32 ± 15% +84.4% 7.97 ± 3% sched_debug.cfs_rq:/.runnable_load_avg.avg 353.85 ± 29% +118.8% 774.35 ± 11% sched_debug.cfs_rq:/.runnable_load_avg.max 27.30 ± 24% +118.5% 59.64 ± 9% sched_debug.cfs_rq:/.runnable_load_avg.stddev 6729 ± 14% +58.2% 10644 ± 10% sched_debug.cfs_rq:/.runnable_weight.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cfs_rq:/.runnable_weight.max 46950 ± 16% +64.8% 77387 ± 11% sched_debug.cfs_rq:/.runnable_weight.stddev 5305069 ± 4% -17.4% 4380376 ± 7% sched_debug.cfs_rq:/.spread0.avg 7328745 ± 3% -9.9% 6600897 ± 3% sched_debug.cfs_rq:/.spread0.max 2220837 ± 4% +55.8% 3460596 ± 5% sched_debug.cpu.avg_idle.avg 4590666 ± 9% +76.8% 8117037 ± 15% sched_debug.cpu.avg_idle.max 485052 ± 7% +80.3% 874679 ± 10% sched_debug.cpu.avg_idle.stddev 561.50 ± 26% +37.7% 773.30 ± 15% sched_debug.cpu.clock.stddev 561.50 ± 26% +37.7% 773.30 ± 15% sched_debug.cpu.clock_task.stddev 3.20 ± 10% +109.6% 6.70 ± 3% sched_debug.cpu.cpu_load[0].avg 309.10 ± 20% +150.3% 773.75 ± 12% sched_debug.cpu.cpu_load[0].max 21.02 ± 14% +160.8% 54.80 ± 9% sched_debug.cpu.cpu_load[0].stddev 3.19 ± 8% +109.8% 6.70 ± 3% sched_debug.cpu.cpu_load[1].avg 299.75 ± 19% +158.0% 773.30 ± 12% sched_debug.cpu.cpu_load[1].max 20.32 ± 12% +168.7% 54.62 ± 9% sched_debug.cpu.cpu_load[1].stddev 3.20 ± 8% +109.1% 6.69 ± 4% sched_debug.cpu.cpu_load[2].avg 288.90 ± 20% +167.0% 771.40 ± 12% sched_debug.cpu.cpu_load[2].max 19.70 ± 12% +175.4% 54.27 ± 9% sched_debug.cpu.cpu_load[2].stddev 3.16 ± 8% +110.9% 6.66 ± 6% sched_debug.cpu.cpu_load[3].avg 275.50 ± 24% +178.4% 766.95 ± 12% sched_debug.cpu.cpu_load[3].max 18.92 ± 15% +184.2% 53.77 ± 10% sched_debug.cpu.cpu_load[3].stddev 3.08 ± 8% +115.7% 6.65 ± 7% sched_debug.cpu.cpu_load[4].avg 263.55 ± 28% +188.7% 760.85 ± 12% sched_debug.cpu.cpu_load[4].max 18.03 ± 18% +196.6% 53.46 ± 11% sched_debug.cpu.cpu_load[4].stddev 14543 -9.6% 13150 sched_debug.cpu.curr->pid.max 5293 ± 16% +74.7% 9248 ± 11% sched_debug.cpu.load.avg 587978 ± 17% +58.2% 930382 ± 9% sched_debug.cpu.load.max 40887 ± 19% +78.3% 72891 ± 9% sched_debug.cpu.load.stddev 1141679 ± 4% +56.9% 1790907 ± 5% sched_debug.cpu.max_idle_balance_cost.avg 2432100 ± 9% +72.6% 4196779 ± 13% sched_debug.cpu.max_idle_balance_cost.max 745656 +29.3% 964170 ± 5% sched_debug.cpu.max_idle_balance_cost.min 239032 ± 9% +81.9% 434806 ± 10% sched_debug.cpu.max_idle_balance_cost.stddev 0.00 ± 27% +92.1% 0.00 ± 31% sched_debug.cpu.next_balance.stddev 1030 ± 4% -10.4% 924.00 ± 2% sched_debug.cpu.nr_switches.min 0.04 ± 26% +139.0% 0.09 ± 41% sched_debug.cpu.nr_uninterruptible.avg 830.35 ± 6% -12.0% 730.50 ± 2% sched_debug.cpu.sched_count.min 912.00 ± 2% -9.5% 825.38 sched_debug.cpu.ttwu_count.avg 433.05 ± 3% -19.2% 350.05 ± 3% sched_debug.cpu.ttwu_count.min 160.70 ± 3% -12.5% 140.60 ± 4% sched_debug.cpu.ttwu_local.min 9072 ± 11% -36.4% 5767 ± 8% softirqs.CPU1.RCU 12769 ± 5% +15.3% 14718 ± 3% softirqs.CPU101.SCHED 13198 +11.5% 14717 ± 3% softirqs.CPU102.SCHED 12981 ± 4% +13.9% 14788 ± 3% softirqs.CPU105.SCHED 13486 ± 3% +11.8% 15071 ± 4% softirqs.CPU111.SCHED 12794 ± 4% +14.1% 14601 ± 9% softirqs.CPU112.SCHED 12999 ± 4% +10.1% 14314 ± 4% softirqs.CPU115.SCHED 12844 ± 4% +10.6% 14202 ± 2% softirqs.CPU120.SCHED 13336 ± 3% +9.4% 14585 ± 3% softirqs.CPU122.SCHED 12639 ± 4% +20.2% 15195 softirqs.CPU123.SCHED 13040 ± 5% +15.2% 15024 ± 5% softirqs.CPU126.SCHED 13123 +15.1% 15106 ± 5% softirqs.CPU127.SCHED 9188 ± 6% -35.7% 5911 ± 2% softirqs.CPU13.RCU 13054 ± 3% +13.1% 14761 ± 5% softirqs.CPU130.SCHED 13158 ± 2% +13.9% 14985 ± 5% softirqs.CPU131.SCHED 12797 ± 6% +13.5% 14524 ± 3% softirqs.CPU133.SCHED 12452 ± 5% +14.8% 14297 softirqs.CPU134.SCHED 13078 ± 3% +10.4% 14439 ± 3% softirqs.CPU138.SCHED 12617 ± 2% +14.5% 14442 ± 5% softirqs.CPU139.SCHED 12974 ± 3% +13.7% 14752 ± 4% softirqs.CPU142.SCHED 12579 ± 4% +19.1% 14983 ± 3% softirqs.CPU143.SCHED 9122 ± 24% -44.6% 5053 ± 5% softirqs.CPU144.RCU 13366 ± 2% +11.1% 14848 ± 3% softirqs.CPU149.SCHED 13246 ± 2% +22.0% 16162 ± 7% softirqs.CPU150.SCHED 13452 ± 3% +20.5% 16210 ± 7% softirqs.CPU151.SCHED 13507 +10.1% 14869 softirqs.CPU156.SCHED 13808 ± 3% +9.2% 15079 ± 4% softirqs.CPU157.SCHED 13442 ± 2% +13.4% 15248 ± 4% softirqs.CPU160.SCHED 13311 +12.1% 14920 ± 2% softirqs.CPU162.SCHED 13544 ± 3% +8.5% 14695 ± 4% softirqs.CPU163.SCHED 13648 ± 3% +11.2% 15179 ± 2% softirqs.CPU166.SCHED 13404 ± 4% +12.5% 15079 ± 3% softirqs.CPU168.SCHED 13421 ± 6% +16.0% 15568 ± 8% softirqs.CPU169.SCHED 13115 ± 3% +23.1% 16139 ± 10% softirqs.CPU171.SCHED 13424 ± 6% +10.4% 14822 ± 3% softirqs.CPU175.SCHED 13274 ± 3% +13.7% 15087 ± 9% softirqs.CPU185.SCHED 13409 ± 3% +12.3% 15063 ± 3% softirqs.CPU190.SCHED 13181 ± 7% +13.4% 14946 ± 3% softirqs.CPU196.SCHED 13578 ± 3% +10.9% 15061 softirqs.CPU197.SCHED 13323 ± 5% +24.8% 16627 ± 6% softirqs.CPU198.SCHED 14072 ± 2% +12.3% 15798 ± 7% softirqs.CPU199.SCHED 12604 ± 13% +17.9% 14865 softirqs.CPU201.SCHED 13380 ± 4% +14.8% 15356 ± 3% softirqs.CPU203.SCHED 13481 ± 8% +14.2% 15390 ± 3% softirqs.CPU204.SCHED 12921 ± 2% +13.8% 14710 ± 3% softirqs.CPU206.SCHED 13468 +13.0% 15218 ± 2% softirqs.CPU208.SCHED 13253 ± 2% +13.1% 14992 softirqs.CPU209.SCHED 13319 ± 2% +14.3% 15225 ± 7% softirqs.CPU210.SCHED 13673 ± 5% +16.3% 15895 ± 3% softirqs.CPU211.SCHED 13290 +17.0% 15556 ± 5% softirqs.CPU212.SCHED 13455 ± 4% +14.4% 15392 ± 3% softirqs.CPU213.SCHED 13454 ± 4% +14.3% 15377 ± 3% softirqs.CPU215.SCHED 13872 ± 7% +9.7% 15221 ± 5% softirqs.CPU220.SCHED 13555 ± 4% +17.3% 15896 ± 5% softirqs.CPU222.SCHED 13411 ± 4% +20.8% 16197 ± 6% softirqs.CPU223.SCHED 8472 ± 21% -44.8% 4680 ± 3% softirqs.CPU224.RCU 13141 ± 3% +16.2% 15265 ± 7% softirqs.CPU225.SCHED 14084 ± 3% +8.2% 15242 ± 2% softirqs.CPU226.SCHED 13528 ± 4% +11.3% 15063 ± 4% softirqs.CPU228.SCHED 13218 ± 3% +16.3% 15377 ± 4% softirqs.CPU229.SCHED 14031 ± 4% +10.2% 15467 ± 2% softirqs.CPU231.SCHED 13770 ± 3% +14.0% 15700 ± 3% softirqs.CPU232.SCHED 13456 ± 3% +12.3% 15105 ± 3% softirqs.CPU233.SCHED 13137 ± 4% +13.5% 14909 ± 3% softirqs.CPU234.SCHED 13318 ± 2% +14.7% 15280 ± 2% softirqs.CPU235.SCHED 13690 ± 2% +13.7% 15563 ± 7% softirqs.CPU238.SCHED 13771 ± 5% +20.8% 16634 ± 7% softirqs.CPU241.SCHED 13317 ± 7% +19.5% 15919 ± 9% softirqs.CPU243.SCHED 8234 ± 16% -43.9% 4616 ± 5% softirqs.CPU244.RCU 13845 ± 6% +13.0% 15643 ± 3% softirqs.CPU244.SCHED 13179 ± 3% +16.3% 15323 softirqs.CPU246.SCHED 13754 +12.2% 15438 ± 3% softirqs.CPU248.SCHED 13769 ± 4% +10.9% 15276 ± 2% softirqs.CPU252.SCHED 13702 +10.5% 15147 ± 2% softirqs.CPU254.SCHED 13315 ± 2% +12.5% 14980 ± 3% softirqs.CPU255.SCHED 13785 ± 3% +12.9% 15568 ± 5% softirqs.CPU256.SCHED 13307 ± 3% +15.0% 15298 ± 3% softirqs.CPU257.SCHED 13864 ± 3% +10.5% 15313 ± 2% softirqs.CPU259.SCHED 13879 ± 2% +11.4% 15465 softirqs.CPU261.SCHED 13815 +13.6% 15687 ± 5% softirqs.CPU264.SCHED 119574 ± 2% +11.8% 133693 ± 11% softirqs.CPU266.TIMER 13688 +10.9% 15180 ± 6% softirqs.CPU267.SCHED 11716 ± 4% +19.3% 13974 ± 8% softirqs.CPU27.SCHED 13866 ± 3% +13.7% 15765 ± 4% softirqs.CPU271.SCHED 13887 ± 5% +12.5% 15621 softirqs.CPU272.SCHED 13383 ± 3% +19.8% 16031 ± 2% softirqs.CPU274.SCHED 13347 +14.1% 15232 ± 3% softirqs.CPU275.SCHED 12884 ± 2% +21.0% 15593 ± 4% softirqs.CPU276.SCHED 13131 ± 5% +13.4% 14891 ± 5% softirqs.CPU277.SCHED 12891 ± 2% +19.2% 15371 ± 4% softirqs.CPU278.SCHED 13313 ± 4% +13.0% 15049 ± 2% softirqs.CPU279.SCHED 13514 ± 3% +10.2% 14897 ± 2% softirqs.CPU280.SCHED 13501 ± 3% +13.7% 15346 softirqs.CPU281.SCHED 13261 +17.5% 15577 softirqs.CPU282.SCHED 8076 ± 15% -43.7% 4546 ± 5% softirqs.CPU283.RCU 13686 ± 3% +12.6% 15413 ± 2% softirqs.CPU284.SCHED 13439 ± 2% +9.2% 14670 ± 4% softirqs.CPU285.SCHED 8878 ± 9% -35.4% 5735 ± 4% softirqs.CPU35.RCU 11690 ± 2% +13.6% 13274 ± 5% softirqs.CPU40.SCHED 11714 ± 2% +19.3% 13975 ± 13% softirqs.CPU41.SCHED 11763 +12.5% 13239 ± 4% softirqs.CPU45.SCHED 11662 ± 2% +9.4% 12757 ± 3% softirqs.CPU46.SCHED 11805 ± 2% +9.3% 12902 ± 2% softirqs.CPU50.SCHED 12158 ± 3% +12.3% 13655 ± 8% softirqs.CPU55.SCHED 11716 ± 4% +8.8% 12751 ± 3% softirqs.CPU58.SCHED 11922 ± 2% +9.9% 13100 ± 4% softirqs.CPU64.SCHED 9674 ± 17% -41.8% 5625 ± 6% softirqs.CPU66.RCU 11818 +12.0% 13237 softirqs.CPU66.SCHED 124682 ± 7% -6.1% 117088 ± 5% softirqs.CPU66.TIMER 8637 ± 9% -34.0% 5700 ± 7% softirqs.CPU70.RCU 11624 ± 2% +11.0% 12901 ± 2% softirqs.CPU70.SCHED 12372 ± 2% +13.2% 14003 ± 3% softirqs.CPU71.SCHED 9949 ± 25% -33.9% 6574 ± 31% softirqs.CPU72.RCU 10392 ± 26% -35.1% 6745 ± 35% softirqs.CPU73.RCU 12766 ± 3% +11.1% 14188 ± 3% softirqs.CPU76.SCHED 12611 ± 2% +18.8% 14984 ± 5% softirqs.CPU78.SCHED 12786 ± 3% +17.9% 15079 ± 7% softirqs.CPU79.SCHED 11947 ± 4% +9.7% 13103 ± 4% softirqs.CPU8.SCHED 13379 ± 7% +11.8% 14962 ± 4% softirqs.CPU83.SCHED 13438 ± 5% +9.7% 14738 ± 2% softirqs.CPU84.SCHED 12768 +19.4% 15241 ± 6% softirqs.CPU88.SCHED 8604 ± 13% -39.3% 5222 ± 3% softirqs.CPU89.RCU 13077 ± 2% +17.1% 15308 ± 7% softirqs.CPU89.SCHED 11887 ± 3% +20.1% 14272 ± 5% softirqs.CPU9.SCHED 12723 ± 3% +11.3% 14165 ± 4% softirqs.CPU90.SCHED 8439 ± 12% -38.9% 5153 ± 4% softirqs.CPU91.RCU 13429 ± 3% +10.3% 14806 ± 2% softirqs.CPU95.SCHED 12852 ± 4% +10.3% 14174 ± 5% softirqs.CPU96.SCHED 13010 ± 2% +14.4% 14888 ± 5% softirqs.CPU97.SCHED 2315644 ± 4% -36.2% 1477200 ± 4% softirqs.RCU 1572 ± 10% +63.9% 2578 ± 39% interrupts.CPU0.NMI:Non-maskable_interrupts 1572 ± 10% +63.9% 2578 ± 39% interrupts.CPU0.PMI:Performance_monitoring_interrupts 252.00 ± 11% -35.2% 163.25 ± 13% interrupts.CPU104.RES:Rescheduling_interrupts 2738 ± 24% +52.4% 4173 ± 19% interrupts.CPU105.NMI:Non-maskable_interrupts 2738 ± 24% +52.4% 4173 ± 19% interrupts.CPU105.PMI:Performance_monitoring_interrupts 245.75 ± 19% -31.0% 169.50 ± 7% interrupts.CPU105.RES:Rescheduling_interrupts 228.75 ± 13% -24.7% 172.25 ± 19% interrupts.CPU106.RES:Rescheduling_interrupts 2243 ± 15% +66.3% 3730 ± 35% interrupts.CPU113.NMI:Non-maskable_interrupts 2243 ± 15% +66.3% 3730 ± 35% interrupts.CPU113.PMI:Performance_monitoring_interrupts 2703 ± 31% +67.0% 4514 ± 33% interrupts.CPU118.NMI:Non-maskable_interrupts 2703 ± 31% +67.0% 4514 ± 33% interrupts.CPU118.PMI:Performance_monitoring_interrupts 2613 ± 25% +42.2% 3715 ± 24% interrupts.CPU121.NMI:Non-maskable_interrupts 2613 ± 25% +42.2% 3715 ± 24% interrupts.CPU121.PMI:Performance_monitoring_interrupts 311.50 ± 23% -47.7% 163.00 ± 9% interrupts.CPU122.RES:Rescheduling_interrupts 266.75 ± 19% -31.6% 182.50 ± 15% interrupts.CPU124.RES:Rescheduling_interrupts 293.75 ± 33% -32.3% 198.75 ± 19% interrupts.CPU125.RES:Rescheduling_interrupts 2601 ± 36% +43.2% 3724 ± 29% interrupts.CPU127.NMI:Non-maskable_interrupts 2601 ± 36% +43.2% 3724 ± 29% interrupts.CPU127.PMI:Performance_monitoring_interrupts 2258 ± 21% +68.2% 3797 ± 29% interrupts.CPU13.NMI:Non-maskable_interrupts 2258 ± 21% +68.2% 3797 ± 29% interrupts.CPU13.PMI:Performance_monitoring_interrupts 3338 ± 29% +54.6% 5160 ± 9% interrupts.CPU139.NMI:Non-maskable_interrupts 3338 ± 29% +54.6% 5160 ± 9% interrupts.CPU139.PMI:Performance_monitoring_interrupts 219.50 ± 27% -23.0% 169.00 ± 21% interrupts.CPU139.RES:Rescheduling_interrupts 290.25 ± 25% -32.5% 196.00 ± 11% interrupts.CPU14.RES:Rescheduling_interrupts 243.50 ± 4% -16.0% 204.50 ± 12% interrupts.CPU140.RES:Rescheduling_interrupts 1797 ± 15% +135.0% 4223 ± 46% interrupts.CPU147.NMI:Non-maskable_interrupts 1797 ± 15% +135.0% 4223 ± 46% interrupts.CPU147.PMI:Performance_monitoring_interrupts 2537 ± 22% +89.6% 4812 ± 28% interrupts.CPU15.NMI:Non-maskable_interrupts 2537 ± 22% +89.6% 4812 ± 28% interrupts.CPU15.PMI:Performance_monitoring_interrupts 292.25 ± 34% -33.9% 193.25 ± 6% interrupts.CPU15.RES:Rescheduling_interrupts 424.25 ± 37% -58.5% 176.25 ± 14% interrupts.CPU158.RES:Rescheduling_interrupts 312.50 ± 42% -54.2% 143.00 ± 18% interrupts.CPU159.RES:Rescheduling_interrupts 725.00 ±118% -75.7% 176.25 ± 14% interrupts.CPU163.RES:Rescheduling_interrupts 2367 ± 6% +59.9% 3786 ± 24% interrupts.CPU177.NMI:Non-maskable_interrupts 2367 ± 6% +59.9% 3786 ± 24% interrupts.CPU177.PMI:Performance_monitoring_interrupts 239.50 ± 30% -46.6% 128.00 ± 14% interrupts.CPU179.RES:Rescheduling_interrupts 320.75 ± 15% -24.0% 243.75 ± 20% interrupts.CPU20.RES:Rescheduling_interrupts 302.50 ± 17% -47.2% 159.75 ± 8% interrupts.CPU200.RES:Rescheduling_interrupts 2166 ± 5% +92.0% 4157 ± 40% interrupts.CPU207.NMI:Non-maskable_interrupts 2166 ± 5% +92.0% 4157 ± 40% interrupts.CPU207.PMI:Performance_monitoring_interrupts 217.00 ± 11% -34.6% 142.00 ± 12% interrupts.CPU214.RES:Rescheduling_interrupts 2610 ± 36% +47.4% 3848 ± 35% interrupts.CPU215.NMI:Non-maskable_interrupts 2610 ± 36% +47.4% 3848 ± 35% interrupts.CPU215.PMI:Performance_monitoring_interrupts 2046 ± 13% +118.6% 4475 ± 43% interrupts.CPU22.NMI:Non-maskable_interrupts 2046 ± 13% +118.6% 4475 ± 43% interrupts.CPU22.PMI:Performance_monitoring_interrupts 289.50 ± 28% -41.1% 170.50 ± 8% interrupts.CPU22.RES:Rescheduling_interrupts 2232 ± 6% +33.0% 2970 ± 24% interrupts.CPU221.NMI:Non-maskable_interrupts 2232 ± 6% +33.0% 2970 ± 24% interrupts.CPU221.PMI:Performance_monitoring_interrupts 4552 ± 12% -27.6% 3295 ± 15% interrupts.CPU222.NMI:Non-maskable_interrupts 4552 ± 12% -27.6% 3295 ± 15% interrupts.CPU222.PMI:Performance_monitoring_interrupts 2013 ± 15% +80.9% 3641 ± 27% interrupts.CPU226.NMI:Non-maskable_interrupts 2013 ± 15% +80.9% 3641 ± 27% interrupts.CPU226.PMI:Performance_monitoring_interrupts 2575 ± 49% +67.1% 4302 ± 34% interrupts.CPU227.NMI:Non-maskable_interrupts 2575 ± 49% +67.1% 4302 ± 34% interrupts.CPU227.PMI:Performance_monitoring_interrupts 248.00 ± 36% -36.3% 158.00 ± 19% interrupts.CPU228.RES:Rescheduling_interrupts 2441 ± 24% +43.0% 3490 ± 30% interrupts.CPU23.NMI:Non-maskable_interrupts 2441 ± 24% +43.0% 3490 ± 30% interrupts.CPU23.PMI:Performance_monitoring_interrupts 404.25 ± 69% -65.5% 139.50 ± 17% interrupts.CPU236.RES:Rescheduling_interrupts 566.50 ± 40% -73.6% 149.50 ± 31% interrupts.CPU237.RES:Rescheduling_interrupts 243.50 ± 26% -37.1% 153.25 ± 21% interrupts.CPU248.RES:Rescheduling_interrupts 258.25 ± 12% -53.5% 120.00 ± 18% interrupts.CPU249.RES:Rescheduling_interrupts 2888 ± 27% +49.4% 4313 ± 30% interrupts.CPU253.NMI:Non-maskable_interrupts 2888 ± 27% +49.4% 4313 ± 30% interrupts.CPU253.PMI:Performance_monitoring_interrupts 2468 ± 44% +67.3% 4131 ± 37% interrupts.CPU256.NMI:Non-maskable_interrupts 2468 ± 44% +67.3% 4131 ± 37% interrupts.CPU256.PMI:Performance_monitoring_interrupts 425.00 ± 59% -60.3% 168.75 ± 34% interrupts.CPU258.RES:Rescheduling_interrupts 1859 ± 16% +106.3% 3834 ± 44% interrupts.CPU268.NMI:Non-maskable_interrupts 1859 ± 16% +106.3% 3834 ± 44% interrupts.CPU268.PMI:Performance_monitoring_interrupts 2684 ± 28% +61.2% 4326 ± 36% interrupts.CPU269.NMI:Non-maskable_interrupts 2684 ± 28% +61.2% 4326 ± 36% interrupts.CPU269.PMI:Performance_monitoring_interrupts 2171 ± 6% +108.8% 4533 ± 20% interrupts.CPU270.NMI:Non-maskable_interrupts 2171 ± 6% +108.8% 4533 ± 20% interrupts.CPU270.PMI:Performance_monitoring_interrupts 2262 ± 14% +61.8% 3659 ± 37% interrupts.CPU273.NMI:Non-maskable_interrupts 2262 ± 14% +61.8% 3659 ± 37% interrupts.CPU273.PMI:Performance_monitoring_interrupts 2203 ± 11% +50.7% 3320 ± 38% interrupts.CPU279.NMI:Non-maskable_interrupts 2203 ± 11% +50.7% 3320 ± 38% interrupts.CPU279.PMI:Performance_monitoring_interrupts 2433 ± 17% +52.9% 3721 ± 25% interrupts.CPU280.NMI:Non-maskable_interrupts 2433 ± 17% +52.9% 3721 ± 25% interrupts.CPU280.PMI:Performance_monitoring_interrupts 2778 ± 33% +63.1% 4531 ± 36% interrupts.CPU283.NMI:Non-maskable_interrupts 2778 ± 33% +63.1% 4531 ± 36% interrupts.CPU283.PMI:Performance_monitoring_interrupts 331.75 ± 32% -39.8% 199.75 ± 17% interrupts.CPU29.RES:Rescheduling_interrupts 2178 ± 22% +53.9% 3353 ± 31% interrupts.CPU3.NMI:Non-maskable_interrupts 2178 ± 22% +53.9% 3353 ± 31% interrupts.CPU3.PMI:Performance_monitoring_interrupts 298.50 ± 30% -39.7% 180.00 ± 6% interrupts.CPU34.RES:Rescheduling_interrupts 2490 ± 3% +58.7% 3953 ± 28% interrupts.CPU35.NMI:Non-maskable_interrupts 2490 ± 3% +58.7% 3953 ± 28% interrupts.CPU35.PMI:Performance_monitoring_interrupts 270.50 ± 24% -31.1% 186.25 ± 3% interrupts.CPU36.RES:Rescheduling_interrupts 2493 ± 7% +57.0% 3915 ± 27% interrupts.CPU43.NMI:Non-maskable_interrupts 2493 ± 7% +57.0% 3915 ± 27% interrupts.CPU43.PMI:Performance_monitoring_interrupts 286.75 ± 36% -32.4% 193.75 ± 7% interrupts.CPU45.RES:Rescheduling_interrupts 259.00 ± 12% -23.6% 197.75 ± 13% interrupts.CPU46.RES:Rescheduling_interrupts 244.00 ± 21% -35.6% 157.25 ± 11% interrupts.CPU47.RES:Rescheduling_interrupts 230.00 ± 7% -21.3% 181.00 ± 11% interrupts.CPU48.RES:Rescheduling_interrupts 281.00 ± 13% -27.4% 204.00 ± 15% interrupts.CPU53.RES:Rescheduling_interrupts 256.75 ± 5% -18.4% 209.50 ± 12% interrupts.CPU54.RES:Rescheduling_interrupts 2433 ± 9% +68.4% 4098 ± 35% interrupts.CPU58.NMI:Non-maskable_interrupts 2433 ± 9% +68.4% 4098 ± 35% interrupts.CPU58.PMI:Performance_monitoring_interrupts 316.00 ± 25% -41.4% 185.25 ± 13% interrupts.CPU59.RES:Rescheduling_interrupts 2703 ± 38% +56.0% 4217 ± 31% interrupts.CPU60.NMI:Non-maskable_interrupts 2703 ± 38% +56.0% 4217 ± 31% interrupts.CPU60.PMI:Performance_monitoring_interrupts 2425 ± 16% +39.9% 3394 ± 27% interrupts.CPU61.NMI:Non-maskable_interrupts 2425 ± 16% +39.9% 3394 ± 27% interrupts.CPU61.PMI:Performance_monitoring_interrupts 2388 ± 18% +69.5% 4047 ± 29% interrupts.CPU66.NMI:Non-maskable_interrupts 2388 ± 18% +69.5% 4047 ± 29% interrupts.CPU66.PMI:Performance_monitoring_interrupts 2322 ± 11% +93.4% 4491 ± 35% interrupts.CPU67.NMI:Non-maskable_interrupts 2322 ± 11% +93.4% 4491 ± 35% interrupts.CPU67.PMI:Performance_monitoring_interrupts 319.00 ± 40% -44.7% 176.25 ± 9% interrupts.CPU67.RES:Rescheduling_interrupts 2512 ± 8% +28.1% 3219 ± 25% interrupts.CPU70.NMI:Non-maskable_interrupts 2512 ± 8% +28.1% 3219 ± 25% interrupts.CPU70.PMI:Performance_monitoring_interrupts 2290 ± 39% +78.7% 4094 ± 28% interrupts.CPU74.NMI:Non-maskable_interrupts 2290 ± 39% +78.7% 4094 ± 28% interrupts.CPU74.PMI:Performance_monitoring_interrupts 2446 ± 40% +94.8% 4764 ± 23% interrupts.CPU75.NMI:Non-maskable_interrupts 2446 ± 40% +94.8% 4764 ± 23% interrupts.CPU75.PMI:Performance_monitoring_interrupts 426.75 ± 61% -67.7% 138.00 ± 8% interrupts.CPU75.RES:Rescheduling_interrupts 192.50 ± 13% +45.6% 280.25 ± 45% interrupts.CPU76.RES:Rescheduling_interrupts 274.25 ± 34% -42.2% 158.50 ± 34% interrupts.CPU77.RES:Rescheduling_interrupts 2357 ± 9% +73.0% 4078 ± 23% interrupts.CPU78.NMI:Non-maskable_interrupts 2357 ± 9% +73.0% 4078 ± 23% interrupts.CPU78.PMI:Performance_monitoring_interrupts 348.50 ± 53% -47.3% 183.75 ± 29% interrupts.CPU80.RES:Rescheduling_interrupts 2650 ± 43% +46.2% 3874 ± 36% interrupts.CPU84.NMI:Non-maskable_interrupts 2650 ± 43% +46.2% 3874 ± 36% interrupts.CPU84.PMI:Performance_monitoring_interrupts 2235 ± 10% +117.8% 4867 ± 10% interrupts.CPU90.NMI:Non-maskable_interrupts 2235 ± 10% +117.8% 4867 ± 10% interrupts.CPU90.PMI:Performance_monitoring_interrupts 2606 ± 33% +38.1% 3598 ± 21% interrupts.CPU92.NMI:Non-maskable_interrupts 2606 ± 33% +38.1% 3598 ± 21% interrupts.CPU92.PMI:Performance_monitoring_interrupts 408.75 ± 58% -56.8% 176.75 ± 25% interrupts.CPU92.RES:Rescheduling_interrupts 399.00 ± 64% -63.6% 145.25 ± 16% interrupts.CPU93.RES:Rescheduling_interrupts 314.75 ± 36% -44.2% 175.75 ± 13% interrupts.CPU94.RES:Rescheduling_interrupts 191.00 ± 15% -29.1% 135.50 ± 9% interrupts.CPU97.RES:Rescheduling_interrupts 94.00 ± 8% +50.0% 141.00 ± 12% interrupts.IWI:IRQ_work_interrupts 841457 ± 7% +16.6% 980751 ± 3% interrupts.NMI:Non-maskable_interrupts 841457 ± 7% +16.6% 980751 ± 3% interrupts.PMI:Performance_monitoring_interrupts 12.75 ± 11% -4.1 8.67 ± 31% perf-profile.calltrace.cycles-pp.do_rw_once 1.02 ± 16% -0.6 0.47 ± 59% perf-profile.calltrace.cycles-pp.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle 1.10 ± 15% -0.4 0.66 ± 14% perf-profile.calltrace.cycles-pp.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry 1.05 ± 16% -0.4 0.61 ± 14% perf-profile.calltrace.cycles-pp.native_sched_clock.sched_clock.sched_clock_cpu.cpuidle_enter_state.cpuidle_enter 1.58 ± 4% +0.3 1.91 ± 7% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.calltrace.cycles-pp.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.11 ± 4% +0.5 2.60 ± 7% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.osq_lock.__mutex_lock.hugetlb_fault.handle_mm_fault 0.83 ± 26% +0.5 1.32 ± 18% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe 0.83 ± 26% +0.5 1.32 ± 18% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.90 ± 5% +0.6 2.45 ± 7% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage 0.65 ± 62% +0.6 1.20 ± 15% perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.60 ± 62% +0.6 1.16 ± 18% perf-profile.calltrace.cycles-pp.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap 0.95 ± 17% +0.6 1.52 ± 8% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner 0.61 ± 62% +0.6 1.18 ± 18% perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput 0.61 ± 62% +0.6 1.19 ± 19% perf-profile.calltrace.cycles-pp.tlb_finish_mmu.exit_mmap.mmput.do_exit.do_group_exit 0.61 ± 62% +0.6 1.19 ± 19% perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.exit_mmap.mmput.do_exit 0.64 ± 61% +0.6 1.23 ± 18% perf-profile.calltrace.cycles-pp.mmput.do_exit.do_group_exit.__x64_sys_exit_group.do_syscall_64 0.64 ± 61% +0.6 1.23 ± 18% perf-profile.calltrace.cycles-pp.exit_mmap.mmput.do_exit.do_group_exit.__x64_sys_exit_group 1.30 ± 9% +0.6 1.92 ± 8% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock 0.19 ±173% +0.7 0.89 ± 20% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu 0.19 ±173% +0.7 0.90 ± 20% perf-profile.calltrace.cycles-pp._raw_spin_lock.free_huge_page.release_pages.tlb_flush_mmu.tlb_finish_mmu 0.00 +0.8 0.77 ± 30% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page 0.00 +0.8 0.78 ± 30% perf-profile.calltrace.cycles-pp._raw_spin_lock.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page 0.00 +0.8 0.79 ± 29% perf-profile.calltrace.cycles-pp.prep_new_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 0.82 ± 67% +0.9 1.72 ± 22% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault 0.84 ± 66% +0.9 1.74 ± 20% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow 2.52 ± 6% +0.9 3.44 ± 9% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page 0.83 ± 67% +0.9 1.75 ± 21% perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 0.84 ± 66% +0.9 1.77 ± 20% perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault 1.64 ± 12% +1.0 2.67 ± 7% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault 1.65 ± 45% +1.3 2.99 ± 18% perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault 1.74 ± 13% +1.4 3.16 ± 6% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault 2.56 ± 48% +2.2 4.81 ± 19% perf-profile.calltrace.cycles-pp.alloc_huge_page.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault 12.64 ± 14% +3.6 16.20 ± 8% perf-profile.calltrace.cycles-pp.mutex_spin_on_owner.__mutex_lock.hugetlb_fault.handle_mm_fault.__do_page_fault 2.97 ± 7% +3.8 6.74 ± 9% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.copy_page.copy_subpage.copy_user_huge_page.hugetlb_cow 19.99 ± 9% +4.1 24.05 ± 6% perf-profile.calltrace.cycles-pp.hugetlb_cow.hugetlb_fault.handle_mm_fault.__do_page_fault.do_page_fault 1.37 ± 15% -0.5 0.83 ± 13% perf-profile.children.cycles-pp.sched_clock_cpu 1.31 ± 16% -0.5 0.78 ± 13% perf-profile.children.cycles-pp.sched_clock 1.29 ± 16% -0.5 0.77 ± 13% perf-profile.children.cycles-pp.native_sched_clock 1.80 ± 2% -0.3 1.47 ± 10% perf-profile.children.cycles-pp.task_tick_fair 0.73 ± 2% -0.2 0.54 ± 11% perf-profile.children.cycles-pp.update_curr 0.42 ± 17% -0.2 0.27 ± 16% perf-profile.children.cycles-pp.account_process_tick 0.73 ± 10% -0.2 0.58 ± 9% perf-profile.children.cycles-pp.rcu_sched_clock_irq 0.27 ± 6% -0.1 0.14 ± 14% perf-profile.children.cycles-pp.__acct_update_integrals 0.27 ± 18% -0.1 0.16 ± 13% perf-profile.children.cycles-pp.rcu_segcblist_ready_cbs 0.40 ± 12% -0.1 0.30 ± 14% perf-profile.children.cycles-pp.__next_timer_interrupt 0.47 ± 7% -0.1 0.39 ± 13% perf-profile.children.cycles-pp.update_rq_clock 0.29 ± 12% -0.1 0.21 ± 15% perf-profile.children.cycles-pp.cpuidle_governor_latency_req 0.21 ± 7% -0.1 0.14 ± 12% perf-profile.children.cycles-pp.account_system_index_time 0.38 ± 2% -0.1 0.31 ± 12% perf-profile.children.cycles-pp.timerqueue_add 0.26 ± 11% -0.1 0.20 ± 13% perf-profile.children.cycles-pp.find_next_bit 0.23 ± 15% -0.1 0.17 ± 15% perf-profile.children.cycles-pp.rcu_dynticks_eqs_exit 0.14 ± 8% -0.1 0.07 ± 14% perf-profile.children.cycles-pp.account_user_time 0.17 ± 6% -0.0 0.12 ± 10% perf-profile.children.cycles-pp.cpuacct_charge 0.18 ± 20% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.irq_work_tick 0.11 ± 13% -0.0 0.07 ± 25% perf-profile.children.cycles-pp.tick_sched_do_timer 0.12 ± 10% -0.0 0.08 ± 15% perf-profile.children.cycles-pp.get_cpu_device 0.07 ± 11% -0.0 0.04 ± 58% perf-profile.children.cycles-pp.raise_softirq 0.12 ± 3% -0.0 0.09 ± 8% perf-profile.children.cycles-pp.write 0.11 ± 13% +0.0 0.14 ± 8% perf-profile.children.cycles-pp.native_write_msr 0.09 ± 9% +0.0 0.11 ± 7% perf-profile.children.cycles-pp.finish_task_switch 0.10 ± 10% +0.0 0.13 ± 5% perf-profile.children.cycles-pp.schedule_idle 0.07 ± 6% +0.0 0.10 ± 12% perf-profile.children.cycles-pp.__read_nocancel 0.04 ± 58% +0.0 0.07 ± 15% perf-profile.children.cycles-pp.__free_pages_ok 0.06 ± 7% +0.0 0.09 ± 13% perf-profile.children.cycles-pp.perf_read 0.07 +0.0 0.11 ± 14% perf-profile.children.cycles-pp.perf_evsel__read_counter 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.cmd_stat 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.__run_perf_stat 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.process_interval 0.07 +0.0 0.11 ± 13% perf-profile.children.cycles-pp.read_counters 0.07 ± 22% +0.0 0.11 ± 19% perf-profile.children.cycles-pp.__handle_mm_fault 0.07 ± 19% +0.1 0.13 ± 8% perf-profile.children.cycles-pp.rb_erase 0.03 ±100% +0.1 0.09 ± 9% perf-profile.children.cycles-pp.smp_call_function_single 0.01 ±173% +0.1 0.08 ± 11% perf-profile.children.cycles-pp.perf_event_read 0.00 +0.1 0.07 ± 13% perf-profile.children.cycles-pp.__perf_event_read_value 0.00 +0.1 0.07 ± 7% perf-profile.children.cycles-pp.__intel_pmu_enable_all 0.08 ± 17% +0.1 0.15 ± 8% perf-profile.children.cycles-pp.native_apic_msr_eoi_write 0.04 ±103% +0.1 0.13 ± 58% perf-profile.children.cycles-pp.shmem_getpage_gfp 0.38 ± 14% +0.1 0.51 ± 6% perf-profile.children.cycles-pp.run_timer_softirq 0.11 ± 4% +0.3 0.37 ± 32% perf-profile.children.cycles-pp.worker_thread 0.20 ± 5% +0.3 0.48 ± 25% perf-profile.children.cycles-pp.ret_from_fork 0.20 ± 4% +0.3 0.48 ± 25% perf-profile.children.cycles-pp.kthread 0.00 +0.3 0.29 ± 38% perf-profile.children.cycles-pp.memcpy_erms 0.00 +0.3 0.29 ± 38% perf-profile.children.cycles-pp.drm_fb_helper_dirty_work 0.00 +0.3 0.31 ± 37% perf-profile.children.cycles-pp.process_one_work 0.47 ± 48% +0.4 0.91 ± 19% perf-profile.children.cycles-pp.prep_new_huge_page 0.70 ± 29% +0.5 1.16 ± 18% perf-profile.children.cycles-pp.free_huge_page 0.73 ± 29% +0.5 1.19 ± 18% perf-profile.children.cycles-pp.tlb_flush_mmu 0.72 ± 29% +0.5 1.18 ± 18% perf-profile.children.cycles-pp.release_pages 0.73 ± 29% +0.5 1.19 ± 18% perf-profile.children.cycles-pp.tlb_finish_mmu 0.76 ± 27% +0.5 1.23 ± 18% perf-profile.children.cycles-pp.exit_mmap 0.77 ± 27% +0.5 1.24 ± 18% perf-profile.children.cycles-pp.mmput 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.__x64_sys_exit_group 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.do_group_exit 0.79 ± 26% +0.5 1.27 ± 18% perf-profile.children.cycles-pp.do_exit 1.28 ± 29% +0.5 1.76 ± 9% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler 0.77 ± 28% +0.5 1.26 ± 13% perf-profile.children.cycles-pp.alloc_fresh_huge_page 1.53 ± 15% +0.7 2.26 ± 14% perf-profile.children.cycles-pp.do_syscall_64 1.53 ± 15% +0.7 2.27 ± 14% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 1.13 ± 3% +0.9 2.07 ± 14% perf-profile.children.cycles-pp.interrupt_entry 0.79 ± 9% +1.0 1.76 ± 5% perf-profile.children.cycles-pp.perf_event_task_tick 1.71 ± 39% +1.4 3.08 ± 16% perf-profile.children.cycles-pp.alloc_surplus_huge_page 2.66 ± 42% +2.3 4.94 ± 17% perf-profile.children.cycles-pp.alloc_huge_page 2.89 ± 45% +2.7 5.54 ± 18% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 3.34 ± 35% +2.7 6.02 ± 17% perf-profile.children.cycles-pp._raw_spin_lock 12.77 ± 14% +3.9 16.63 ± 7% perf-profile.children.cycles-pp.mutex_spin_on_owner 20.12 ± 9% +4.0 24.16 ± 6% perf-profile.children.cycles-pp.hugetlb_cow 15.40 ± 10% -3.6 11.84 ± 28% perf-profile.self.cycles-pp.do_rw_once 4.02 ± 9% -1.3 2.73 ± 30% perf-profile.self.cycles-pp.do_access 2.00 ± 14% -0.6 1.41 ± 13% perf-profile.self.cycles-pp.cpuidle_enter_state 1.26 ± 16% -0.5 0.74 ± 13% perf-profile.self.cycles-pp.native_sched_clock 0.42 ± 17% -0.2 0.27 ± 16% perf-profile.self.cycles-pp.account_process_tick 0.27 ± 19% -0.2 0.12 ± 17% perf-profile.self.cycles-pp.timerqueue_del 0.53 ± 3% -0.1 0.38 ± 11% perf-profile.self.cycles-pp.update_curr 0.27 ± 6% -0.1 0.14 ± 14% perf-profile.self.cycles-pp.__acct_update_integrals 0.27 ± 18% -0.1 0.16 ± 13% perf-profile.self.cycles-pp.rcu_segcblist_ready_cbs 0.61 ± 4% -0.1 0.51 ± 8% perf-profile.self.cycles-pp.task_tick_fair 0.20 ± 8% -0.1 0.12 ± 14% perf-profile.self.cycles-pp.account_system_index_time 0.23 ± 15% -0.1 0.16 ± 17% perf-profile.self.cycles-pp.rcu_dynticks_eqs_exit 0.25 ± 11% -0.1 0.18 ± 14% perf-profile.self.cycles-pp.find_next_bit 0.10 ± 11% -0.1 0.03 ±100% perf-profile.self.cycles-pp.tick_sched_do_timer 0.29 -0.1 0.23 ± 11% perf-profile.self.cycles-pp.timerqueue_add 0.12 ± 10% -0.1 0.06 ± 17% perf-profile.self.cycles-pp.account_user_time 0.22 ± 15% -0.1 0.16 ± 6% perf-profile.self.cycles-pp.scheduler_tick 0.17 ± 6% -0.0 0.12 ± 10% perf-profile.self.cycles-pp.cpuacct_charge 0.18 ± 20% -0.0 0.13 ± 3% perf-profile.self.cycles-pp.irq_work_tick 0.07 ± 13% -0.0 0.03 ±100% perf-profile.self.cycles-pp.update_process_times 0.12 ± 7% -0.0 0.08 ± 15% perf-profile.self.cycles-pp.get_cpu_device 0.07 ± 11% -0.0 0.04 ± 58% perf-profile.self.cycles-pp.raise_softirq 0.12 ± 11% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.tick_nohz_get_sleep_length 0.11 ± 11% +0.0 0.14 ± 6% perf-profile.self.cycles-pp.native_write_msr 0.10 ± 5% +0.1 0.15 ± 8% perf-profile.self.cycles-pp.__remove_hrtimer 0.07 ± 23% +0.1 0.13 ± 8% perf-profile.self.cycles-pp.rb_erase 0.08 ± 17% +0.1 0.15 ± 7% perf-profile.self.cycles-pp.native_apic_msr_eoi_write 0.00 +0.1 0.08 ± 10% perf-profile.self.cycles-pp.smp_call_function_single 0.32 ± 17% +0.1 0.42 ± 7% perf-profile.self.cycles-pp.run_timer_softirq 0.22 ± 5% +0.1 0.34 ± 4% perf-profile.self.cycles-pp.ktime_get_update_offsets_now 0.45 ± 15% +0.2 0.60 ± 12% perf-profile.self.cycles-pp.rcu_irq_enter 0.31 ± 8% +0.2 0.46 ± 16% perf-profile.self.cycles-pp.irq_enter 0.29 ± 10% +0.2 0.44 ± 16% perf-profile.self.cycles-pp.apic_timer_interrupt 0.71 ± 30% +0.2 0.92 ± 8% perf-profile.self.cycles-pp.perf_mux_hrtimer_handler 0.00 +0.3 0.28 ± 37% perf-profile.self.cycles-pp.memcpy_erms 1.12 ± 3% +0.9 2.02 ± 15% perf-profile.self.cycles-pp.interrupt_entry 0.79 ± 9% +0.9 1.73 ± 5% perf-profile.self.cycles-pp.perf_event_task_tick 2.49 ± 45% +2.1 4.55 ± 20% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 10.95 ± 15% +2.7 13.61 ± 8% perf-profile.self.cycles-pp.mutex_spin_on_owner
vm-scalability.throughput
1.6e+07 +-+---------------------------------------------------------------+ |..+.+ +..+.+..+.+. +. +..+.+..+.+..+.+..+.+..+ + | 1.4e+07 +-+ : : O O O O | 1.2e+07 O-+O O O O O O O O O O O O O O O O O O | : : O O O O | 1e+07 +-+ : : | | : : | 8e+06 +-+ : : | | : : | 6e+06 +-+ : : | 4e+06 +-+ : : | | :: | 2e+06 +-+ : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.time.minor_page_faults
2.5e+06 +-+---------------------------------------------------------------+ | | |..+.+ +..+.+..+.+..+.+..+.+.. .+. .+.+..+.+..+.+..+.+..+ | 2e+06 +-+ : : +. +. | O O O: O O O O O O O O O O | | : : O O O O O O O O O O O O O O 1.5e+06 +-+ : : | | : : | 1e+06 +-+ : : | | : : | | : : | 500000 +-+ : : | | : | | : | 0 +-+---------------------------------------------------------------+
vm-scalability.workload
3.5e+09 +-+---------------------------------------------------------------+ | .+. .+.+.. .+.. | 3e+09 +-+ + +..+.+..+.+..+.+. +..+.+..+.+..+.+..+.+..+ + | | : : O O O | 2.5e+09 O-+O O: O O O O O O O O O | | : : O O O O O O O O O O O O 2e+09 +-+ : : | | : : | 1.5e+09 +-+ : : | | : : | 1e+09 +-+ : : | | : : | 5e+08 +-+ : | | : | 0 +-+---------------------------------------------------------------+
[*] bisect-good sample [O] bisect-bad sample
Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
Thanks, Rong Chen
Hi Thomas,
On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
Hi
I did some further analysis on this problem and found that the blinking cursor affects performance of the vm-scalability test case.
I only have a 4-core machine, so scalability is not really testable. Yet I see the effects of running vm-scalibility against drm-tip, a revert of the mgag200 patch and the vmap fixes that I posted a few days ago.
After reverting the mgag200 patch, running the test as described in the report
bin/lkp run job.yaml
gives results like
2019-08-02 19:34:37 ./case-anon-cow-seq-hugetlb 2019-08-02 19:34:37 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815395225 917319627 bytes / 756534 usecs = 1184110 KB/s 917319627 bytes / 764675 usecs = 1171504 KB/s 917319627 bytes / 766414 usecs = 1168846 KB/s 917319627 bytes / 777990 usecs = 1151454 KB/s
Running the test against current drm-tip gives slightly worse results, such as.
2019-08-03 19:17:06 ./case-anon-cow-seq-hugetlb 2019-08-03 19:17:06 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 871607 usecs = 1027778 KB/s 917318700 bytes / 894173 usecs = 1001840 KB/s 917318700 bytes / 919694 usecs = 974040 KB/s 917318700 bytes / 923341 usecs = 970193 KB/s
The test puts out roughly one result per second. Strangely sending the output to /dev/null can make results significantly worse.
bin/lkp run job.yaml > /dev/null
2019-08-03 19:23:04 ./case-anon-cow-seq-hugetlb 2019-08-03 19:23:04 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 1207358 usecs = 741966 KB/s 917318700 bytes / 1210456 usecs = 740067 KB/s 917318700 bytes / 1216572 usecs = 736346 KB/s 917318700 bytes / 1239152 usecs = 722929 KB/s
I realized that there's still a blinking cursor on the screen, which I disabled with
tput civis
or alternatively
echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
Running the the test now gives the original or even better results, such as
bin/lkp run job.yaml > /dev/null
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5 ---------------- --------------------------- --------------------------- 43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median 14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput
commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24
Signed-off-by: Feng Tang feng.tang@intel.com
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else - dev->mode_config.preferred_depth = 32; + dev->mode_config.preferred_depth = 24; dev->mode_config.prefer_shadow = 1;
r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
Hi,
On 8/5/19 3:02 PM, Feng Tang wrote:
Hi Thomas,
On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
Hi
I did some further analysis on this problem and found that the blinking cursor affects performance of the vm-scalability test case.
I only have a 4-core machine, so scalability is not really testable. Yet I see the effects of running vm-scalibility against drm-tip, a revert of the mgag200 patch and the vmap fixes that I posted a few days ago.
After reverting the mgag200 patch, running the test as described in the report
bin/lkp run job.yaml
gives results like
2019-08-02 19:34:37 ./case-anon-cow-seq-hugetlb 2019-08-02 19:34:37 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815395225 917319627 bytes / 756534 usecs = 1184110 KB/s 917319627 bytes / 764675 usecs = 1171504 KB/s 917319627 bytes / 766414 usecs = 1168846 KB/s 917319627 bytes / 777990 usecs = 1151454 KB/s
Running the test against current drm-tip gives slightly worse results, such as.
2019-08-03 19:17:06 ./case-anon-cow-seq-hugetlb 2019-08-03 19:17:06 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 871607 usecs = 1027778 KB/s 917318700 bytes / 894173 usecs = 1001840 KB/s 917318700 bytes / 919694 usecs = 974040 KB/s 917318700 bytes / 923341 usecs = 970193 KB/s
The test puts out roughly one result per second. Strangely sending the output to /dev/null can make results significantly worse.
bin/lkp run job.yaml > /dev/null
2019-08-03 19:23:04 ./case-anon-cow-seq-hugetlb 2019-08-03 19:23:04 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 1207358 usecs = 741966 KB/s 917318700 bytes / 1210456 usecs = 740067 KB/s 917318700 bytes / 1216572 usecs = 736346 KB/s 917318700 bytes / 1239152 usecs = 722929 KB/s
I realized that there's still a blinking cursor on the screen, which I disabled with
tput civis
or alternatively
echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
Running the the test now gives the original or even better results, such as
bin/lkp run job.yaml > /dev/null
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
I set "echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink" for both commits, and the regression has no obvious change.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43394 -20% 34575 ± 3% vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43393 -20% 34575 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median
14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput
commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24 Signed-off-by: Feng Tang <feng.tang@intel.com>
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else
dev->mode_config.preferred_depth = 32;
dev->mode_config.preferred_depth = 24;
dev->mode_config.prefer_shadow = 1;
r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
Hi
Am 05.08.19 um 09:28 schrieb Rong Chen:
Hi,
On 8/5/19 3:02 PM, Feng Tang wrote:
Hi Thomas,
On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
Hi
I did some further analysis on this problem and found that the blinking cursor affects performance of the vm-scalability test case.
I only have a 4-core machine, so scalability is not really testable. Yet I see the effects of running vm-scalibility against drm-tip, a revert of the mgag200 patch and the vmap fixes that I posted a few days ago.
After reverting the mgag200 patch, running the test as described in the report
bin/lkp run job.yaml
gives results like
2019-08-02 19:34:37 ./case-anon-cow-seq-hugetlb 2019-08-02 19:34:37 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815395225 917319627 bytes / 756534 usecs = 1184110 KB/s 917319627 bytes / 764675 usecs = 1171504 KB/s 917319627 bytes / 766414 usecs = 1168846 KB/s 917319627 bytes / 777990 usecs = 1151454 KB/s
Running the test against current drm-tip gives slightly worse results, such as.
2019-08-03 19:17:06 ./case-anon-cow-seq-hugetlb 2019-08-03 19:17:06 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 871607 usecs = 1027778 KB/s 917318700 bytes / 894173 usecs = 1001840 KB/s 917318700 bytes / 919694 usecs = 974040 KB/s 917318700 bytes / 923341 usecs = 970193 KB/s
The test puts out roughly one result per second. Strangely sending the output to /dev/null can make results significantly worse.
bin/lkp run job.yaml > /dev/null
2019-08-03 19:23:04 ./case-anon-cow-seq-hugetlb 2019-08-03 19:23:04 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 1207358 usecs = 741966 KB/s 917318700 bytes / 1210456 usecs = 740067 KB/s 917318700 bytes / 1216572 usecs = 736346 KB/s 917318700 bytes / 1239152 usecs = 722929 KB/s
I realized that there's still a blinking cursor on the screen, which I disabled with
tput civis
or alternatively
echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
Running the the test now gives the original or even better results, such as
bin/lkp run job.yaml > /dev/null
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
I set "echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink" for both commits, and the regression has no obvious change.
Ah, I see. Thank you for testing. There are two questions that come to my mind: did you send the regular output to /dev/null? And what happens if you disable the cursor with 'tput civis'?
If there is absolutely nothing changing on the screen, I don't see how the regression could persist.
Best regards Thomas
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43394 -20% 34575 ± 3% vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43393 -20% 34575 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median 14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24 Signed-off-by: Feng Tang feng.tang@intel.com
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else - dev->mode_config.preferred_depth = 32; + dev->mode_config.preferred_depth = 24; dev->mode_config.prefer_shadow = 1; r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
Hi,
On 8/5/2019 6:25 PM, Thomas Zimmermann wrote:
Hi
Am 05.08.19 um 09:28 schrieb Rong Chen:
Hi,
On 8/5/19 3:02 PM, Feng Tang wrote:
Hi Thomas,
On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
Hi
I did some further analysis on this problem and found that the blinking cursor affects performance of the vm-scalability test case.
I only have a 4-core machine, so scalability is not really testable. Yet I see the effects of running vm-scalibility against drm-tip, a revert of the mgag200 patch and the vmap fixes that I posted a few days ago.
After reverting the mgag200 patch, running the test as described in the report
bin/lkp run job.yaml
gives results like
2019-08-02 19:34:37 ./case-anon-cow-seq-hugetlb 2019-08-02 19:34:37 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815395225 917319627 bytes / 756534 usecs = 1184110 KB/s 917319627 bytes / 764675 usecs = 1171504 KB/s 917319627 bytes / 766414 usecs = 1168846 KB/s 917319627 bytes / 777990 usecs = 1151454 KB/s
Running the test against current drm-tip gives slightly worse results, such as.
2019-08-03 19:17:06 ./case-anon-cow-seq-hugetlb 2019-08-03 19:17:06 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 871607 usecs = 1027778 KB/s 917318700 bytes / 894173 usecs = 1001840 KB/s 917318700 bytes / 919694 usecs = 974040 KB/s 917318700 bytes / 923341 usecs = 970193 KB/s
The test puts out roughly one result per second. Strangely sending the output to /dev/null can make results significantly worse.
bin/lkp run job.yaml > /dev/null
2019-08-03 19:23:04 ./case-anon-cow-seq-hugetlb 2019-08-03 19:23:04 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 1207358 usecs = 741966 KB/s 917318700 bytes / 1210456 usecs = 740067 KB/s 917318700 bytes / 1216572 usecs = 736346 KB/s 917318700 bytes / 1239152 usecs = 722929 KB/s
I realized that there's still a blinking cursor on the screen, which I disabled with
tput civis
or alternatively
echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
Running the the test now gives the original or even better results, such as
bin/lkp run job.yaml > /dev/null
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
I set "echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink" for both commits, and the regression has no obvious change.
Ah, I see. Thank you for testing. There are two questions that come to my mind: did you send the regular output to /dev/null? And what happens if you disable the cursor with 'tput civis'?
I didn't send the output to /dev/null because we need to collect data from the output, Actually we run the benchmark as a background process, do we need to disable the cursor and test again?
Best Regards, Rong Chen
If there is absolutely nothing changing on the screen, I don't see how the regression could persist.
Best regards Thomas
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43394 -20% 34575 ± 3% vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43393 -20% 34575 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median 14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24 Signed-off-by: Feng Tang feng.tang@intel.com
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else - dev->mode_config.preferred_depth = 32; + dev->mode_config.preferred_depth = 24; dev->mode_config.prefer_shadow = 1; r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
LKP mailing list LKP@lists.01.org https://lists.01.org/mailman/listinfo/lkp
Hi Rong
Am 06.08.19 um 14:59 schrieb Chen, Rong A:
Hi,
On 8/5/2019 6:25 PM, Thomas Zimmermann wrote:
Hi
Am 05.08.19 um 09:28 schrieb Rong Chen:
Hi,
On 8/5/19 3:02 PM, Feng Tang wrote:
Hi Thomas,
On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
Hi
I did some further analysis on this problem and found that the blinking cursor affects performance of the vm-scalability test case.
I only have a 4-core machine, so scalability is not really testable. Yet I see the effects of running vm-scalibility against drm-tip, a revert of the mgag200 patch and the vmap fixes that I posted a few days ago.
After reverting the mgag200 patch, running the test as described in the report
bin/lkp run job.yaml
gives results like
2019-08-02 19:34:37 ./case-anon-cow-seq-hugetlb 2019-08-02 19:34:37 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815395225 917319627 bytes / 756534 usecs = 1184110 KB/s 917319627 bytes / 764675 usecs = 1171504 KB/s 917319627 bytes / 766414 usecs = 1168846 KB/s 917319627 bytes / 777990 usecs = 1151454 KB/s
Running the test against current drm-tip gives slightly worse results, such as.
2019-08-03 19:17:06 ./case-anon-cow-seq-hugetlb 2019-08-03 19:17:06 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 871607 usecs = 1027778 KB/s 917318700 bytes / 894173 usecs = 1001840 KB/s 917318700 bytes / 919694 usecs = 974040 KB/s 917318700 bytes / 923341 usecs = 970193 KB/s
The test puts out roughly one result per second. Strangely sending the output to /dev/null can make results significantly worse.
bin/lkp run job.yaml > /dev/null
2019-08-03 19:23:04 ./case-anon-cow-seq-hugetlb 2019-08-03 19:23:04 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 1207358 usecs = 741966 KB/s 917318700 bytes / 1210456 usecs = 740067 KB/s 917318700 bytes / 1216572 usecs = 736346 KB/s 917318700 bytes / 1239152 usecs = 722929 KB/s
I realized that there's still a blinking cursor on the screen, which I disabled with
tput civis
or alternatively
echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
Running the the test now gives the original or even better results, such as
bin/lkp run job.yaml > /dev/null
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
I set "echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink" for both commits, and the regression has no obvious change.
Ah, I see. Thank you for testing. There are two questions that come to my mind: did you send the regular output to /dev/null? And what happens if you disable the cursor with 'tput civis'?
I didn't send the output to /dev/null because we need to collect data from the output,
You can send it to any file, as long as it doesn't show up on the console. I also found the latest results in the file result/vm-scalability.
Actually we run the benchmark as a background process, do we need to disable the cursor and test again?
There's a worker thread that updates the display from the shadow buffer. The blinking cursor periodically triggers the worker thread, but the actual update is just the size of one character.
The point of the test without output is to see if the regression comes from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or from the worker thread. If the regression goes away after disabling the blinking cursor, then the worker thread is the problem. If it already goes away if there's simply no output from the test, the screen update is the problem. On my machine I have to disable the blinking cursor, so I think the worker causes the performance drop.
Best regards Thomas
Best Regards, Rong Chen
If there is absolutely nothing changing on the screen, I don't see how the regression could persist.
Best regards Thomas
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43394 -20% 34575 ± 3% vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43393 -20% 34575 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median 14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24 Signed-off-by: Feng Tang feng.tang@intel.com
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else - dev->mode_config.preferred_depth = 32; + dev->mode_config.preferred_depth = 24; dev->mode_config.prefer_shadow = 1; r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
LKP mailing list LKP@lists.01.org https://lists.01.org/mailman/listinfo/lkp
Hi,
On 8/7/19 6:42 PM, Thomas Zimmermann wrote:
Hi Rong
Am 06.08.19 um 14:59 schrieb Chen, Rong A:
Hi,
On 8/5/2019 6:25 PM, Thomas Zimmermann wrote:
Hi
Am 05.08.19 um 09:28 schrieb Rong Chen:
Hi,
On 8/5/19 3:02 PM, Feng Tang wrote:
Hi Thomas,
On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
Hi
I did some further analysis on this problem and found that the blinking cursor affects performance of the vm-scalability test case.
I only have a 4-core machine, so scalability is not really testable. Yet I see the effects of running vm-scalibility against drm-tip, a revert of the mgag200 patch and the vmap fixes that I posted a few days ago.
After reverting the mgag200 patch, running the test as described in the report
bin/lkp run job.yaml
gives results like
2019-08-02 19:34:37 ./case-anon-cow-seq-hugetlb 2019-08-02 19:34:37 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815395225 917319627 bytes / 756534 usecs = 1184110 KB/s 917319627 bytes / 764675 usecs = 1171504 KB/s 917319627 bytes / 766414 usecs = 1168846 KB/s 917319627 bytes / 777990 usecs = 1151454 KB/s
Running the test against current drm-tip gives slightly worse results, such as.
2019-08-03 19:17:06 ./case-anon-cow-seq-hugetlb 2019-08-03 19:17:06 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 871607 usecs = 1027778 KB/s 917318700 bytes / 894173 usecs = 1001840 KB/s 917318700 bytes / 919694 usecs = 974040 KB/s 917318700 bytes / 923341 usecs = 970193 KB/s
The test puts out roughly one result per second. Strangely sending the output to /dev/null can make results significantly worse.
bin/lkp run job.yaml > /dev/null
2019-08-03 19:23:04 ./case-anon-cow-seq-hugetlb 2019-08-03 19:23:04 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 1207358 usecs = 741966 KB/s 917318700 bytes / 1210456 usecs = 740067 KB/s 917318700 bytes / 1216572 usecs = 736346 KB/s 917318700 bytes / 1239152 usecs = 722929 KB/s
I realized that there's still a blinking cursor on the screen, which I disabled with
tput civis
or alternatively
echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
Running the the test now gives the original or even better results, such as
bin/lkp run job.yaml > /dev/null
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
I set "echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink" for both commits, and the regression has no obvious change.
Ah, I see. Thank you for testing. There are two questions that come to my mind: did you send the regular output to /dev/null? And what happens if you disable the cursor with 'tput civis'?
I didn't send the output to /dev/null because we need to collect data from the output,
You can send it to any file, as long as it doesn't show up on the console. I also found the latest results in the file result/vm-scalability.
Actually we run the benchmark as a background process, do we need to disable the cursor and test again?
There's a worker thread that updates the display from the shadow buffer. The blinking cursor periodically triggers the worker thread, but the actual update is just the size of one character.
The point of the test without output is to see if the regression comes from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or from the worker thread. If the regression goes away after disabling the blinking cursor, then the worker thread is the problem. If it already goes away if there's simply no output from the test, the screen update is the problem. On my machine I have to disable the blinking cursor, so I think the worker causes the performance drop.
We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is gone.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43785 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43785 44481 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen
Best regards Thomas
Best Regards, Rong Chen
If there is absolutely nothing changing on the screen, I don't see how the regression could persist.
Best regards Thomas
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43394 -20% 34575 ± 3% vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43393 -20% 34575 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median 14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24 Signed-off-by: Feng Tang feng.tang@intel.com
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else - dev->mode_config.preferred_depth = 32; + dev->mode_config.preferred_depth = 24; dev->mode_config.prefer_shadow = 1; r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
LKP mailing list LKP@lists.01.org https://lists.01.org/mailman/listinfo/lkp
Hi Thomas,
On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
Hi,
Actually we run the benchmark as a background process, do we need to disable the cursor and test again?
There's a worker thread that updates the display from the shadow buffer. The blinking cursor periodically triggers the worker thread, but the actual update is just the size of one character.
The point of the test without output is to see if the regression comes from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or from the worker thread. If the regression goes away after disabling the blinking cursor, then the worker thread is the problem. If it already goes away if there's simply no output from the test, the screen update is the problem. On my machine I have to disable the blinking cursor, so I think the worker causes the performance drop.
We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is gone.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43785 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43785 44481 GEO-MEAN vm-scalability.median
Till now, from Rong's tests: 1. Disabling cursor blinking doesn't cure the regression. 2. Disabling printint test results to console can workaround the regression.
Also if we set the perfer_shadown to 0, the regression is also gone.
--- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) dev->mode_config.preferred_depth = 16; else dev->mode_config.preferred_depth = 32; - dev->mode_config.prefer_shadow = 1; + dev->mode_config.prefer_shadow = 0;
And from the perf data, one obvious difference is good case don't call drm_fb_helper_dirty_work(), while bad case calls.
Thanks, Feng
Best Regards, Rong Chen
Hi Thomas,
On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
Hi Thomas,
On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
Hi,
Actually we run the benchmark as a background process, do we need to disable the cursor and test again?
There's a worker thread that updates the display from the shadow buffer. The blinking cursor periodically triggers the worker thread, but the actual update is just the size of one character.
The point of the test without output is to see if the regression comes from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or from the worker thread. If the regression goes away after disabling the blinking cursor, then the worker thread is the problem. If it already goes away if there's simply no output from the test, the screen update is the problem. On my machine I have to disable the blinking cursor, so I think the worker causes the performance drop.
We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is gone.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43785 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43785 44481 GEO-MEAN vm-scalability.median
Till now, from Rong's tests:
- Disabling cursor blinking doesn't cure the regression.
- Disabling printint test results to console can workaround the
regression.
Also if we set the perfer_shadown to 0, the regression is also gone.
We also did some further break down for the time consumed by the new code.
The drm_fb_helper_dirty_work() calls sequentially 1. drm_client_buffer_vmap (290 us) 2. drm_fb_helper_dirty_blit_real (19240 us) 3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver 4. drm_client_buffer_vunmap (215 us)
The average run time is listed after the function names.
From it, we can see drm_fb_helper_dirty_blit_real() takes too long
time (about 20ms for each run). I guess this is the root cause of this regression, as the original code doesn't use this dirty worker.
As said in last email, setting the prefer_shadow to 0 can avoid the regrssion. Could it be an option?
Thanks, Feng
--- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) dev->mode_config.preferred_depth = 16; else dev->mode_config.preferred_depth = 32;
- dev->mode_config.prefer_shadow = 1;
- dev->mode_config.prefer_shadow = 0;
And from the perf data, one obvious difference is good case don't call drm_fb_helper_dirty_work(), while bad case calls.
Thanks, Feng
Best Regards, Rong Chen
Hi Thomas,
On Tue, Aug 13, 2019 at 05:36:16PM +0800, Feng Tang wrote:
Hi Thomas,
On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
Hi Thomas,
On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
Hi,
Actually we run the benchmark as a background process, do we need to disable the cursor and test again?
There's a worker thread that updates the display from the shadow buffer. The blinking cursor periodically triggers the worker thread, but the actual update is just the size of one character.
The point of the test without output is to see if the regression comes from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or from the worker thread. If the regression goes away after disabling the blinking cursor, then the worker thread is the problem. If it already goes away if there's simply no output from the test, the screen update is the problem. On my machine I have to disable the blinking cursor, so I think the worker causes the performance drop.
We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is gone.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43785 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43785 44481 GEO-MEAN vm-scalability.median
Till now, from Rong's tests:
- Disabling cursor blinking doesn't cure the regression.
- Disabling printint test results to console can workaround the
regression.
Also if we set the perfer_shadown to 0, the regression is also gone.
We also did some further break down for the time consumed by the new code.
The drm_fb_helper_dirty_work() calls sequentially
- drm_client_buffer_vmap (290 us)
- drm_fb_helper_dirty_blit_real (19240 us)
- helper->fb->funcs->dirty() ---> NULL for mgag200 driver
- drm_client_buffer_vunmap (215 us)
The average run time is listed after the function names.
From it, we can see drm_fb_helper_dirty_blit_real() takes too long time (about 20ms for each run). I guess this is the root cause of this regression, as the original code doesn't use this dirty worker.
As said in last email, setting the prefer_shadow to 0 can avoid the regrssion. Could it be an option?
Any comments on this? thanks
- Feng
Thanks, Feng
--- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) dev->mode_config.preferred_depth = 16; else dev->mode_config.preferred_depth = 32;
- dev->mode_config.prefer_shadow = 1;
- dev->mode_config.prefer_shadow = 0;
And from the perf data, one obvious difference is good case don't call drm_fb_helper_dirty_work(), while bad case calls.
Thanks, Feng
Best Regards, Rong Chen
LKP mailing list LKP@lists.01.org https://lists.01.org/mailman/listinfo/lkp
Hi
I was traveling and could reply earlier. Sorry for taking so long.
Am 13.08.19 um 11:36 schrieb Feng Tang:
Hi Thomas,
On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
Hi Thomas,
On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
Hi,
Actually we run the benchmark as a background process, do we need to disable the cursor and test again?
There's a worker thread that updates the display from the shadow buffer. The blinking cursor periodically triggers the worker thread, but the actual update is just the size of one character.
The point of the test without output is to see if the regression comes from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or from the worker thread. If the regression goes away after disabling the blinking cursor, then the worker thread is the problem. If it already goes away if there's simply no output from the test, the screen update is the problem. On my machine I have to disable the blinking cursor, so I think the worker causes the performance drop.
We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is gone.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43785 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43785 44481 GEO-MEAN vm-scalability.median
Till now, from Rong's tests:
- Disabling cursor blinking doesn't cure the regression.
- Disabling printint test results to console can workaround the
regression.
Also if we set the perfer_shadown to 0, the regression is also gone.
We also did some further break down for the time consumed by the new code.
The drm_fb_helper_dirty_work() calls sequentially
- drm_client_buffer_vmap (290 us)
- drm_fb_helper_dirty_blit_real (19240 us)
- helper->fb->funcs->dirty() ---> NULL for mgag200 driver
- drm_client_buffer_vunmap (215 us)
It's somewhat different to what I observed, but maybe I just couldn't reproduce the problem correctly.
The average run time is listed after the function names.
From it, we can see drm_fb_helper_dirty_blit_real() takes too long time (about 20ms for each run). I guess this is the root cause of this regression, as the original code doesn't use this dirty worker.
True, the original code uses a temporary buffer, but updates the display immediately.
My guess is that this could be a caching problem. The worker runs on a different CPU, which doesn't have the shadow buffer in cache.
As said in last email, setting the prefer_shadow to 0 can avoid the regrssion. Could it be an option?
Unfortunately not. Without the shadow buffer, the console's display buffer permanently resides in video memory. It consumes significant amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave enough room for anything else.
The best option is to not print to the console.
Best regards Thomas
Thanks, Feng
--- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) dev->mode_config.preferred_depth = 16; else dev->mode_config.preferred_depth = 32;
- dev->mode_config.prefer_shadow = 1;
- dev->mode_config.prefer_shadow = 0;
And from the perf data, one obvious difference is good case don't call drm_fb_helper_dirty_work(), while bad case calls.
Thanks, Feng
Best Regards, Rong Chen
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Fri, 23 Aug 2019 at 03:25, Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
I was traveling and could reply earlier. Sorry for taking so long.
Am 13.08.19 um 11:36 schrieb Feng Tang:
Hi Thomas,
On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
Hi Thomas,
On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
Hi,
Actually we run the benchmark as a background process, do we need to disable the cursor and test again?
There's a worker thread that updates the display from the shadow buffer. The blinking cursor periodically triggers the worker thread, but the actual update is just the size of one character.
The point of the test without output is to see if the regression comes from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or from the worker thread. If the regression goes away after disabling the blinking cursor, then the worker thread is the problem. If it already goes away if there's simply no output from the test, the screen update is the problem. On my machine I have to disable the blinking cursor, so I think the worker causes the performance drop.
We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is gone.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox
%stddev change %stddev \ | \ 43785 44481
vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43785 44481 GEO-MEAN vm-scalability.median
Till now, from Rong's tests:
- Disabling cursor blinking doesn't cure the regression.
- Disabling printint test results to console can workaround the
regression.
Also if we set the perfer_shadown to 0, the regression is also gone.
We also did some further break down for the time consumed by the new code.
The drm_fb_helper_dirty_work() calls sequentially
- drm_client_buffer_vmap (290 us)
- drm_fb_helper_dirty_blit_real (19240 us)
- helper->fb->funcs->dirty() ---> NULL for mgag200 driver
- drm_client_buffer_vunmap (215 us)
It's somewhat different to what I observed, but maybe I just couldn't reproduce the problem correctly.
The average run time is listed after the function names.
From it, we can see drm_fb_helper_dirty_blit_real() takes too long time (about 20ms for each run). I guess this is the root cause of this regression, as the original code doesn't use this dirty worker.
True, the original code uses a temporary buffer, but updates the display immediately.
My guess is that this could be a caching problem. The worker runs on a different CPU, which doesn't have the shadow buffer in cache.
As said in last email, setting the prefer_shadow to 0 can avoid the regrssion. Could it be an option?
Unfortunately not. Without the shadow buffer, the console's display buffer permanently resides in video memory. It consumes significant amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave enough room for anything else.
The best option is to not print to the console.
Wait a second, I thought the driver did an eviction on modeset of the scanned out object, this was a deliberate design decision made when writing those drivers, has this been removed in favour of gem and generic code paths?
Dave.
Hi
Am 22.08.19 um 22:02 schrieb Dave Airlie:
On Fri, 23 Aug 2019 at 03:25, Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
I was traveling and could reply earlier. Sorry for taking so long.
Am 13.08.19 um 11:36 schrieb Feng Tang:
Hi Thomas,
On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
Hi Thomas,
On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
Hi,
> Actually we run the benchmark as a background process, do we need to > disable the cursor and test again? There's a worker thread that updates the display from the shadow buffer. The blinking cursor periodically triggers the worker thread, but the actual update is just the size of one character.
The point of the test without output is to see if the regression comes from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or from the worker thread. If the regression goes away after disabling the blinking cursor, then the worker thread is the problem. If it already goes away if there's simply no output from the test, the screen update is the problem. On my machine I have to disable the blinking cursor, so I think the worker causes the performance drop.
We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is gone.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox
%stddev change %stddev \ | \ 43785 44481
vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43785 44481 GEO-MEAN vm-scalability.median
Till now, from Rong's tests:
- Disabling cursor blinking doesn't cure the regression.
- Disabling printint test results to console can workaround the
regression.
Also if we set the perfer_shadown to 0, the regression is also gone.
We also did some further break down for the time consumed by the new code.
The drm_fb_helper_dirty_work() calls sequentially
- drm_client_buffer_vmap (290 us)
- drm_fb_helper_dirty_blit_real (19240 us)
- helper->fb->funcs->dirty() ---> NULL for mgag200 driver
- drm_client_buffer_vunmap (215 us)
It's somewhat different to what I observed, but maybe I just couldn't reproduce the problem correctly.
The average run time is listed after the function names.
From it, we can see drm_fb_helper_dirty_blit_real() takes too long time (about 20ms for each run). I guess this is the root cause of this regression, as the original code doesn't use this dirty worker.
True, the original code uses a temporary buffer, but updates the display immediately.
My guess is that this could be a caching problem. The worker runs on a different CPU, which doesn't have the shadow buffer in cache.
As said in last email, setting the prefer_shadow to 0 can avoid the regrssion. Could it be an option?
Unfortunately not. Without the shadow buffer, the console's display buffer permanently resides in video memory. It consumes significant amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave enough room for anything else.
The best option is to not print to the console.
Wait a second, I thought the driver did an eviction on modeset of the scanned out object, this was a deliberate design decision made when writing those drivers, has this been removed in favour of gem and generic code paths?
Yes. We added back this feature for testing in [1]. It was only an improvement of ~1% compared to the original report. I wouldn't mind landing this patch set, but it probably doesn't make a difference either.
Best regards Thomas
[1] https://lists.freedesktop.org/archives/dri-devel/2019-August/228950.html
Dave.
Hi Thomas,
On Thu, Aug 22, 2019 at 07:25:11PM +0200, Thomas Zimmermann wrote:
Hi
I was traveling and could reply earlier. Sorry for taking so long.
No problem! I guessed so :)
Am 13.08.19 um 11:36 schrieb Feng Tang:
Hi Thomas,
On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
Hi Thomas,
On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
Hi,
Actually we run the benchmark as a background process, do we need to disable the cursor and test again?
There's a worker thread that updates the display from the shadow buffer. The blinking cursor periodically triggers the worker thread, but the actual update is just the size of one character.
The point of the test without output is to see if the regression comes from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or from the worker thread. If the regression goes away after disabling the blinking cursor, then the worker thread is the problem. If it already goes away if there's simply no output from the test, the screen update is the problem. On my machine I have to disable the blinking cursor, so I think the worker causes the performance drop.
We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is gone.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ---------------- -------------------------- --------------------------- %stddev change %stddev \ | \ 43785 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43785 44481 GEO-MEAN vm-scalability.median
Till now, from Rong's tests:
- Disabling cursor blinking doesn't cure the regression.
- Disabling printint test results to console can workaround the
regression.
Also if we set the perfer_shadown to 0, the regression is also gone.
We also did some further break down for the time consumed by the new code.
The drm_fb_helper_dirty_work() calls sequentially
- drm_client_buffer_vmap (290 us)
- drm_fb_helper_dirty_blit_real (19240 us)
- helper->fb->funcs->dirty() ---> NULL for mgag200 driver
- drm_client_buffer_vunmap (215 us)
It's somewhat different to what I observed, but maybe I just couldn't reproduce the problem correctly.
The average run time is listed after the function names.
From it, we can see drm_fb_helper_dirty_blit_real() takes too long time (about 20ms for each run). I guess this is the root cause of this regression, as the original code doesn't use this dirty worker.
True, the original code uses a temporary buffer, but updates the display immediately.
My guess is that this could be a caching problem. The worker runs on a different CPU, which doesn't have the shadow buffer in cache.
Yes, that's my thought too. I profiled the working set size, for most of the drm_fb_helper_dirty_blit_real(), it will update a buffer 4096x768(3 MB), and as it is called 30~40 times per second, it surely will affect the cache.
As said in last email, setting the prefer_shadow to 0 can avoid the regrssion. Could it be an option?
Unfortunately not. Without the shadow buffer, the console's display buffer permanently resides in video memory. It consumes significant amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave enough room for anything else.
The best option is to not print to the console.
Do we have other options here?
My thought is this is clearly a regression, that the old driver works fine, while the new version in linux-next doesn't. Also for a frame buffer console, writting dozens line of message to it is not a rare user case. We have many test platforms (servers/desktops/laptops) with different kinds of GFX hardwares, and this model works fine for many years :)
Thanks, Feng
Best regards Thomas
Thanks, Feng
--- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) dev->mode_config.preferred_depth = 16; else dev->mode_config.preferred_depth = 32;
- dev->mode_config.prefer_shadow = 1;
- dev->mode_config.prefer_shadow = 0;
And from the perf data, one obvious difference is good case don't call drm_fb_helper_dirty_work(), while bad case calls.
Thanks, Feng
Best Regards, Rong Chen
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
Hi Feng
Am 24.08.19 um 07:16 schrieb Feng Tang:
Hi Thomas,
On Thu, Aug 22, 2019 at 07:25:11PM +0200, Thomas Zimmermann wrote:
Hi
I was traveling and could reply earlier. Sorry for taking so long.
No problem! I guessed so :)
Am 13.08.19 um 11:36 schrieb Feng Tang:
Hi Thomas,
On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
Hi Thomas,
On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
Hi,
> Actually we run the benchmark as a background process, do > we need to disable the cursor and test again? There's a worker thread that updates the display from the shadow buffer. The blinking cursor periodically triggers the worker thread, but the actual update is just the size of one character.
The point of the test without output is to see if the regression comes from the buffer update (i.e., the memcpy from shadow buffer to VRAM), or from the worker thread. If the regression goes away after disabling the blinking cursor, then the worker thread is the problem. If it already goes away if there's simply no output from the test, the screen update is the problem. On my machine I have to disable the blinking cursor, so I think the worker causes the performance drop.
We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is gone.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ----------------
%stddev change %stddev \ | \ 43785 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43785 44481 GEO-MEAN vm-scalability.median
Till now, from Rong's tests: 1. Disabling cursor blinking doesn't cure the regression. 2. Disabling printint test results to console can workaround the regression.
Also if we set the perfer_shadown to 0, the regression is also gone.
We also did some further break down for the time consumed by the new code.
The drm_fb_helper_dirty_work() calls sequentially 1. drm_client_buffer_vmap (290 us) 2. drm_fb_helper_dirty_blit_real (19240 us) 3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver 4. drm_client_buffer_vunmap (215 us)
It's somewhat different to what I observed, but maybe I just couldn't reproduce the problem correctly.
The average run time is listed after the function names.
From it, we can see drm_fb_helper_dirty_blit_real() takes too long time (about 20ms for each run). I guess this is the root cause of this regression, as the original code doesn't use this dirty worker.
True, the original code uses a temporary buffer, but updates the display immediately.
My guess is that this could be a caching problem. The worker runs on a different CPU, which doesn't have the shadow buffer in cache.
Yes, that's my thought too. I profiled the working set size, for most of the drm_fb_helper_dirty_blit_real(), it will update a buffer 4096x768(3 MB), and as it is called 30~40 times per second, it surely will affect the cache.
As said in last email, setting the prefer_shadow to 0 can avoid the regrssion. Could it be an option?
Unfortunately not. Without the shadow buffer, the console's display buffer permanently resides in video memory. It consumes significant amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave enough room for anything else.
The best option is to not print to the console.
Do we have other options here?
I attached two patches. Both show an improvement in my setup at least. Could you please test them independently from each other and report back?
prefetch.patch prefetches the shadow buffer two scanlines ahead during the blit function. The idea is to have the scanlines in cache when they are supposed to go to hardware.
schedule.patch schedules the dirty worker on the current CPU core (i.e., the one that did the drawing to the shadow buffer). Hopefully the shadow buffer remains in cache meanwhile.
Best regards Thomas
My thought is this is clearly a regression, that the old driver works fine, while the new version in linux-next doesn't. Also for a frame buffer console, writting dozens line of message to it is not a rare user case. We have many test platforms (servers/desktops/laptops) with different kinds of GFX hardwares, and this model works fine for many years :)
Thanks, Feng
Best regards Thomas
Thanks, Feng
--- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -167,7 +167,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) dev->mode_config.preferred_depth = 16; else dev->mode_config.preferred_depth = 32; - dev->mode_config.prefer_shadow = 1; + dev->mode_config.prefer_shadow = 0;
And from the perf data, one obvious difference is good case don't call drm_fb_helper_dirty_work(), while bad case calls.
Thanks, Feng
Best Regards, Rong Chen
_______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
_______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Thomas,
On 8/26/2019 6:50 PM, Thomas Zimmermann wrote:
Hi Feng
Am 24.08.19 um 07:16 schrieb Feng Tang:
Hi Thomas,
On Thu, Aug 22, 2019 at 07:25:11PM +0200, Thomas Zimmermann wrote:
Hi
I was traveling and could reply earlier. Sorry for taking so long.
No problem! I guessed so :)
Am 13.08.19 um 11:36 schrieb Feng Tang:
Hi Thomas,
On Mon, Aug 12, 2019 at 03:25:45PM +0800, Feng Tang wrote:
Hi Thomas,
On Fri, Aug 09, 2019 at 04:12:29PM +0800, Rong Chen wrote:
Hi,
>> Actually we run the benchmark as a background process, do >> we need to disable the cursor and test again? > There's a worker thread that updates the display from the > shadow buffer. The blinking cursor periodically triggers > the worker thread, but the actual update is just the size > of one character. > > The point of the test without output is to see if the > regression comes from the buffer update (i.e., the memcpy > from shadow buffer to VRAM), or from the worker thread. If > the regression goes away after disabling the blinking > cursor, then the worker thread is the problem. If it > already goes away if there's simply no output from the > test, the screen update is the problem. On my machine I > have to disable the blinking cursor, so I think the worker > causes the performance drop. We disabled redirecting stdout/stderr to /dev/kmsg, and the regression is gone.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde testcase/testparams/testbox ----------------
%stddev change %stddev \ | \ 43785 44481 vm-scalability/300s-8T-anon-cow-seq-hugetlb/lkp-knm01 43785 44481 GEO-MEAN vm-scalability.median
Till now, from Rong's tests: 1. Disabling cursor blinking doesn't cure the regression. 2. Disabling printint test results to console can workaround the regression.
Also if we set the perfer_shadown to 0, the regression is also gone.
We also did some further break down for the time consumed by the new code.
The drm_fb_helper_dirty_work() calls sequentially 1. drm_client_buffer_vmap (290 us) 2. drm_fb_helper_dirty_blit_real (19240 us) 3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver 4. drm_client_buffer_vunmap (215 us)
It's somewhat different to what I observed, but maybe I just couldn't reproduce the problem correctly.
The average run time is listed after the function names.
From it, we can see drm_fb_helper_dirty_blit_real() takes too long time (about 20ms for each run). I guess this is the root cause of this regression, as the original code doesn't use this dirty worker.
True, the original code uses a temporary buffer, but updates the display immediately.
My guess is that this could be a caching problem. The worker runs on a different CPU, which doesn't have the shadow buffer in cache.
Yes, that's my thought too. I profiled the working set size, for most of the drm_fb_helper_dirty_blit_real(), it will update a buffer 4096x768(3 MB), and as it is called 30~40 times per second, it surely will affect the cache.
As said in last email, setting the prefer_shadow to 0 can avoid the regrssion. Could it be an option?
Unfortunately not. Without the shadow buffer, the console's display buffer permanently resides in video memory. It consumes significant amount of that memory (say 8 MiB out of 16 MiB). That doesn't leave enough room for anything else.
The best option is to not print to the console.
Do we have other options here?
I attached two patches. Both show an improvement in my setup at least. Could you please test them independently from each other and report back?
prefetch.patch prefetches the shadow buffer two scanlines ahead during the blit function. The idea is to have the scanlines in cache when they are supposed to go to hardware.
schedule.patch schedules the dirty worker on the current CPU core (i.e., the one that did the drawing to the shadow buffer). Hopefully the shadow buffer remains in cache meanwhile.
Best regards Thomas
Both patches have little impact on the performance from our side.
prefetch.patch: commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 77459f56994 prefetch shadow buffer two lines ahead of blit offset
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde 77459f56994ab87ee5459920b3 testcase/testparams/testbox ---------------- -------------------------- -------------------------- --------------------------- %stddev change %stddev change %stddev \ | \ | \ 42912 -15% 36517 -17% 35515 vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 42912 -15% 36517 -17% 35515 GEO-MEAN vm-scalability.median
schedule.patch: commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation ccc5f095c61 schedule dirty worker on local core
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde ccc5f095c61ff6eded0f0ab1b7 testcase/testparams/testbox ---------------- -------------------------- -------------------------- --------------------------- %stddev change %stddev change %stddev \ | \ | \ 42912 -15% 36517 -15% 36556 ± 4% vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 42912 -15% 36517 -15% 36556 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen
Hi
Am 27.08.19 um 14:33 schrieb Chen, Rong A:
Both patches have little impact on the performance from our side.
Thanks for testing. Too bad they doesn't solve the issue.
There's another patch attached. Could you please tests this as well? Thanks a lot!
The patch comes from Daniel Vetter after discussing the problem on IRC. The idea of the patch is that the old mgag200 code might display much less frames that the generic code, because mgag200 only prints from non-atomic context. If we simulate this with the generic code, we should see roughly the original performance.
Best regards Thomas
prefetch.patch: commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 77459f56994 prefetch shadow buffer two lines ahead of blit offset
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde 77459f56994ab87ee5459920b3 testcase/testparams/testbox
---------------- -------------------------- --------------------------
%stddev change %stddev change %stddev \ | \ | \ 42912 -15% 36517 -17% 35515 vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 42912 -15% 36517 -17% 35515 GEO-MEAN vm-scalability.median
schedule.patch: commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation ccc5f095c61 schedule dirty worker on local core
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde ccc5f095c61ff6eded0f0ab1b7 testcase/testparams/testbox
---------------- -------------------------- --------------------------
%stddev change %stddev change %stddev \ | \ | \ 42912 -15% 36517 -15% 36556 ± 4% vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 42912 -15% 36517 -15% 36556 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Thomas,
On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
Hi
Am 27.08.19 um 14:33 schrieb Chen, Rong A:
Both patches have little impact on the performance from our side.
Thanks for testing. Too bad they doesn't solve the issue.
There's another patch attached. Could you please tests this as well? Thanks a lot!
The patch comes from Daniel Vetter after discussing the problem on IRC. The idea of the patch is that the old mgag200 code might display much less frames that the generic code, because mgag200 only prints from non-atomic context. If we simulate this with the generic code, we should see roughly the original performance.
It's cool, the patch "usecansleep.patch" can fix the issue.
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation b976b04c2bc only schedule worker from non-atomic context
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde b976b04c2bcf33148d6c7bc1a2 testcase/testparams/testbox ---------------- -------------------------- -------------------------- --------------------------- %stddev change %stddev change %stddev \ | \ | \ 42912 -15% 36517 44093 vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 42912 -15% 36517 44093 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen
Hi
Am 28.08.19 um 11:37 schrieb Rong Chen:
Hi Thomas,
On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
Hi
Am 27.08.19 um 14:33 schrieb Chen, Rong A:
Both patches have little impact on the performance from our side.
Thanks for testing. Too bad they doesn't solve the issue.
There's another patch attached. Could you please tests this as well? Thanks a lot!
The patch comes from Daniel Vetter after discussing the problem on IRC. The idea of the patch is that the old mgag200 code might display much less frames that the generic code, because mgag200 only prints from non-atomic context. If we simulate this with the generic code, we should see roughly the original performance.
It's cool, the patch "usecansleep.patch" can fix the issue.
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Best regards Thomas
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation b976b04c2bc only schedule worker from non-atomic context
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde b976b04c2bcf33148d6c7bc1a2 testcase/testparams/testbox
---------------- -------------------------- --------------------------
%stddev change %stddev change %stddev \ | \ | \ 42912 -15% 36517 44093 vm-scalability/performance-300s-8T-anon-cow-seq-hugetlb/lkp-knm01 42912 -15% 36517 44093 GEO-MEAN vm-scalability.median
Best Regards, Rong Chen _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Thomas,
On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
Hi
Am 28.08.19 um 11:37 schrieb Rong Chen:
Hi Thomas,
On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
Hi
Am 27.08.19 um 14:33 schrieb Chen, Rong A:
Both patches have little impact on the performance from our side.
Thanks for testing. Too bad they doesn't solve the issue.
There's another patch attached. Could you please tests this as well? Thanks a lot!
The patch comes from Daniel Vetter after discussing the problem on IRC. The idea of the patch is that the old mgag200 code might display much less frames that the generic code, because mgag200 only prints from non-atomic context. If we simulate this with the generic code, we should see roughly the original performance.
It's cool, the patch "usecansleep.patch" can fix the issue.
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
Thanks, Feng
Best regards Thomas
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
To my knowledge, there's only one instance of the worker. Re-scheduling the worker before a previous instance started, will not create a second instance. The worker's instance will complete all pending updates. So in some way, skipping workers already happens.
Best regards Thomas
Thanks, Feng
Best regards Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
To my knowledge, there's only one instance of the worker. Re-scheduling the worker before a previous instance started, will not create a second instance. The worker's instance will complete all pending updates. So in some way, skipping workers already happens.
So I think that the most often fbcon update from atomic context is the blinking cursor. If you disable that one you should be back to the old performance level I think, since just writing to dmesg is from process context, so shouldn't change.
https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinki...
Bunch of tricks, but tbh I haven't tested them.
In any case, I still strongly advice you don't print anything to dmesg or fbcon while benchmarking, because dmesg/printf are anything but fast, especially if a gpu driver is involved. There's some efforts to make the dmesg/printk side less painful (untangling the console_lock from printk), but fundamentally printing to the gpu from the kernel through dmesg/fbcon won't be cheap. It's just not something we optimize beyond "make sure it works for emergencies". -Daniel
Best regards Thomas
Thanks, Feng
Best regards Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Daniel,
On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
To my knowledge, there's only one instance of the worker. Re-scheduling the worker before a previous instance started, will not create a second instance. The worker's instance will complete all pending updates. So in some way, skipping workers already happens.
So I think that the most often fbcon update from atomic context is the blinking cursor. If you disable that one you should be back to the old performance level I think, since just writing to dmesg is from process context, so shouldn't change.
Hmm, then for the old driver, it should also do the most update in non-atomic context?
One other thing is, I profiled that updating a 3MB shadow buffer needs 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with the cache setting of DRM shadow buffer? say the orginal code use a cachable buffer?
https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinki...
Bunch of tricks, but tbh I haven't tested them.
Thomas has suggested to disable curson by echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
We tried that way, and no change for the performance data.
Thanks, Feng
In any case, I still strongly advice you don't print anything to dmesg or fbcon while benchmarking, because dmesg/printf are anything but fast, especially if a gpu driver is involved. There's some efforts to make the dmesg/printk side less painful (untangling the console_lock from printk), but fundamentally printing to the gpu from the kernel through dmesg/fbcon won't be cheap. It's just not something we optimize beyond "make sure it works for emergencies". -Daniel
Best regards Thomas
Thanks, Feng
Best regards Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Hi
Am 04.09.19 um 10:35 schrieb Feng Tang:
Hi Daniel,
On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
To my knowledge, there's only one instance of the worker. Re-scheduling the worker before a previous instance started, will not create a second instance. The worker's instance will complete all pending updates. So in some way, skipping workers already happens.
So I think that the most often fbcon update from atomic context is the blinking cursor. If you disable that one you should be back to the old performance level I think, since just writing to dmesg is from process context, so shouldn't change.
Hmm, then for the old driver, it should also do the most update in non-atomic context?
One other thing is, I profiled that updating a 3MB shadow buffer needs 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with the cache setting of DRM shadow buffer? say the orginal code use a cachable buffer?
https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinki...
Bunch of tricks, but tbh I haven't tested them.
Thomas has suggested to disable curson by echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
We tried that way, and no change for the performance data.
There are several ways of disabling the cursor. On my test system, I entered
tput civis
before the test and got better performance. Did you try this as well?
Best regards Thomas
Thanks, Feng
In any case, I still strongly advice you don't print anything to dmesg or fbcon while benchmarking, because dmesg/printf are anything but fast, especially if a gpu driver is involved. There's some efforts to make the dmesg/printk side less painful (untangling the console_lock from printk), but fundamentally printing to the gpu from the kernel through dmesg/fbcon won't be cheap. It's just not something we optimize beyond "make sure it works for emergencies". -Daniel
Best regards Thomas
Thanks, Feng
Best regards Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Hi Thomas,
On 9/4/2019 4:43 PM, Thomas Zimmermann wrote:
Hi
Am 04.09.19 um 10:35 schrieb Feng Tang:
Hi Daniel,
On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
To my knowledge, there's only one instance of the worker. Re-scheduling the worker before a previous instance started, will not create a second instance. The worker's instance will complete all pending updates. So in some way, skipping workers already happens.
So I think that the most often fbcon update from atomic context is the blinking cursor. If you disable that one you should be back to the old performance level I think, since just writing to dmesg is from process context, so shouldn't change.
Hmm, then for the old driver, it should also do the most update in non-atomic context?
One other thing is, I profiled that updating a 3MB shadow buffer needs 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with the cache setting of DRM shadow buffer? say the orginal code use a cachable buffer?
https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinki...
Bunch of tricks, but tbh I haven't tested them.
Thomas has suggested to disable curson by echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
We tried that way, and no change for the performance data.
There are several ways of disabling the cursor. On my test system, I entered
tput civis
before the test and got better performance. Did you try this as well?
There's no obvious change on our system.
Best Regards, Rong Chen
Best regards Thomas
Thanks, Feng
In any case, I still strongly advice you don't print anything to dmesg or fbcon while benchmarking, because dmesg/printf are anything but fast, especially if a gpu driver is involved. There's some efforts to make the dmesg/printk side less painful (untangling the console_lock from printk), but fundamentally printing to the gpu from the kernel through dmesg/fbcon won't be cheap. It's just not something we optimize beyond "make sure it works for emergencies". -Daniel
Best regards Thomas
Thanks, Feng
Best regards Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
LKP mailing list LKP@lists.01.org https://lists.01.org/mailman/listinfo/lkp
On Wed, Sep 4, 2019 at 10:35 AM Feng Tang feng.tang@intel.com wrote:
Hi Daniel,
On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
To my knowledge, there's only one instance of the worker. Re-scheduling the worker before a previous instance started, will not create a second instance. The worker's instance will complete all pending updates. So in some way, skipping workers already happens.
So I think that the most often fbcon update from atomic context is the blinking cursor. If you disable that one you should be back to the old performance level I think, since just writing to dmesg is from process context, so shouldn't change.
Hmm, then for the old driver, it should also do the most update in non-atomic context?
One other thing is, I profiled that updating a 3MB shadow buffer needs 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with the cache setting of DRM shadow buffer? say the orginal code use a cachable buffer?
Hm, that would indicate the write-combining got broken somewhere. This should definitely be faster. Also we shouldn't transfer the hole thing, except when scrolling ...
https://unix.stackexchange.com/questions/3759/how-to-stop-cursor-from-blinki...
Bunch of tricks, but tbh I haven't tested them.
Thomas has suggested to disable curson by echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
We tried that way, and no change for the performance data.
Huh, if there's other atomic contexts for fbcon update then I'm not aware ... and if it's all the updates, then you wouldn't see a hole lot on your screen, neither with the old or new fbdev support in mgag200. I'm a bit confused ... -Daniel
Thanks, Feng
In any case, I still strongly advice you don't print anything to dmesg or fbcon while benchmarking, because dmesg/printf are anything but fast, especially if a gpu driver is involved. There's some efforts to make the dmesg/printk side less painful (untangling the console_lock from printk), but fundamentally printing to the gpu from the kernel through dmesg/fbcon won't be cheap. It's just not something we optimize beyond "make sure it works for emergencies". -Daniel
Best regards Thomas
Thanks, Feng
Best regards Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Wed, 4 Sep 2019 at 19:17, Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Sep 4, 2019 at 10:35 AM Feng Tang feng.tang@intel.com wrote:
Hi Daniel,
On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
To my knowledge, there's only one instance of the worker. Re-scheduling the worker before a previous instance started, will not create a second instance. The worker's instance will complete all pending updates. So in some way, skipping workers already happens.
So I think that the most often fbcon update from atomic context is the blinking cursor. If you disable that one you should be back to the old performance level I think, since just writing to dmesg is from process context, so shouldn't change.
Hmm, then for the old driver, it should also do the most update in non-atomic context?
One other thing is, I profiled that updating a 3MB shadow buffer needs 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with the cache setting of DRM shadow buffer? say the orginal code use a cachable buffer?
Hm, that would indicate the write-combining got broken somewhere. This should definitely be faster. Also we shouldn't transfer the hole thing, except when scrolling ...
First rule of fbcon usage, you are always effectively scrolling.
Also these devices might be on a PCIE 1x piece of wet string, not sure if the numbers reflect that.
Dave.
On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 4 Sep 2019 at 19:17, Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Sep 4, 2019 at 10:35 AM Feng Tang feng.tang@intel.com wrote:
Hi Daniel,
On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
> Thank you for testing. But don't get too excited, because the patch > simulates a bug that was present in the original mgag200 code. A > significant number of frames are simply skipped. That is apparently the > reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
To my knowledge, there's only one instance of the worker. Re-scheduling the worker before a previous instance started, will not create a second instance. The worker's instance will complete all pending updates. So in some way, skipping workers already happens.
So I think that the most often fbcon update from atomic context is the blinking cursor. If you disable that one you should be back to the old performance level I think, since just writing to dmesg is from process context, so shouldn't change.
Hmm, then for the old driver, it should also do the most update in non-atomic context?
One other thing is, I profiled that updating a 3MB shadow buffer needs 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with the cache setting of DRM shadow buffer? say the orginal code use a cachable buffer?
Hm, that would indicate the write-combining got broken somewhere. This should definitely be faster. Also we shouldn't transfer the hole thing, except when scrolling ...
First rule of fbcon usage, you are always effectively scrolling.
Also these devices might be on a PCIE 1x piece of wet string, not sure if the numbers reflect that.
pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and overhead not entirely out of the question that 150MB/s is actually the hw limit. If it's really pcie 1x 1.0, no idea where to check that. Also might be worth to double-check that the gpu pci bar is listed as wc in debugfs/x86/pat_memtype_list. -Daniel
Hi Vetter,
On Wed, Sep 04, 2019 at 01:20:29PM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 4 Sep 2019 at 19:17, Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Sep 4, 2019 at 10:35 AM Feng Tang feng.tang@intel.com wrote:
Hi Daniel,
On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang: >> Thank you for testing. But don't get too excited, because the patch >> simulates a bug that was present in the original mgag200 code. A >> significant number of frames are simply skipped. That is apparently the >> reason why it's faster. > > Thanks for the detailed info, so the original code skips time-consuming > work inside atomic context on purpose. Is there any space to optmise it? > If 2 scheduled update worker are handled at almost same time, can one be > skipped?
To my knowledge, there's only one instance of the worker. Re-scheduling the worker before a previous instance started, will not create a second instance. The worker's instance will complete all pending updates. So in some way, skipping workers already happens.
So I think that the most often fbcon update from atomic context is the blinking cursor. If you disable that one you should be back to the old performance level I think, since just writing to dmesg is from process context, so shouldn't change.
Hmm, then for the old driver, it should also do the most update in non-atomic context?
One other thing is, I profiled that updating a 3MB shadow buffer needs 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with the cache setting of DRM shadow buffer? say the orginal code use a cachable buffer?
Hm, that would indicate the write-combining got broken somewhere. This should definitely be faster. Also we shouldn't transfer the hole thing, except when scrolling ...
First rule of fbcon usage, you are always effectively scrolling.
Also these devices might be on a PCIE 1x piece of wet string, not sure if the numbers reflect that.
pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and overhead not entirely out of the question that 150MB/s is actually the hw limit. If it's really pcie 1x 1.0, no idea where to check that. Also might be worth to double-check that the gpu pci bar is listed as wc in debugfs/x86/pat_memtype_list.
Here is some dump of the device info and the pat_memtype_list, while it is running other 0day task:
controller info ================= 03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 05) (prog-if 00 [VGA controller]) Subsystem: Intel Corporation MGA G200e [Pilot] ServerEngines (SEP1) Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 16 NUMA node: 0 Region 0: Memory at d0000000 (32-bit, prefetchable) [size=16M] Region 1: Memory at d1800000 (32-bit, non-prefetchable) [size=16K] Region 2: Memory at d1000000 (32-bit, non-prefetchable) [size=8M] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [e4] Express (v1) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit- Address: 00000000 Data: 0000 Kernel driver in use: mgag200 Kernel modules: mgag200
Related pat setting =================== uncached-minus @ 0xc0000000-0xc0001000 uncached-minus @ 0xc0000000-0xd0000000 uncached-minus @ 0xc0008000-0xc0009000 uncached-minus @ 0xc0009000-0xc000a000 uncached-minus @ 0xc0010000-0xc0011000 uncached-minus @ 0xc0011000-0xc0012000 uncached-minus @ 0xc0012000-0xc0013000 uncached-minus @ 0xc0013000-0xc0014000 uncached-minus @ 0xc0018000-0xc0019000 uncached-minus @ 0xc0019000-0xc001a000 uncached-minus @ 0xc001a000-0xc001b000 write-combining @ 0xd0000000-0xd0300000 write-combining @ 0xd0000000-0xd1000000 uncached-minus @ 0xd1800000-0xd1804000 uncached-minus @ 0xd1900000-0xd1980000 uncached-minus @ 0xd1980000-0xd1981000 uncached-minus @ 0xd1a00000-0xd1a80000 uncached-minus @ 0xd1a80000-0xd1a81000 uncached-minus @ 0xd1f10000-0xd1f11000 uncached-minus @ 0xd1f11000-0xd1f12000 uncached-minus @ 0xd1f12000-0xd1f13000
Host bridge info ================ 00:00.0 Host bridge: Intel Corporation Device 7853 Subsystem: Intel Corporation Device 0000 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 0 NUMA node: 0 Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <512ns, L1 <4us ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- RootCtl: ErrCorrectable+ ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?> Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?> Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?> Capabilities: [250 v1] #19 Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?> Capabilities: [298 v1] Vendor Specific Information: ID=0007 Rev=0 Len=024 <?>
Thanks, Feng
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Sep 5, 2019 at 8:58 AM Feng Tang feng.tang@intel.com wrote:
Hi Vetter,
On Wed, Sep 04, 2019 at 01:20:29PM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 4 Sep 2019 at 19:17, Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Sep 4, 2019 at 10:35 AM Feng Tang feng.tang@intel.com wrote:
Hi Daniel,
On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann tzimmermann@suse.de wrote: > > Hi > > Am 04.09.19 um 08:27 schrieb Feng Tang: > >> Thank you for testing. But don't get too excited, because the patch > >> simulates a bug that was present in the original mgag200 code. A > >> significant number of frames are simply skipped. That is apparently the > >> reason why it's faster. > > > > Thanks for the detailed info, so the original code skips time-consuming > > work inside atomic context on purpose. Is there any space to optmise it? > > If 2 scheduled update worker are handled at almost same time, can one be > > skipped? > > To my knowledge, there's only one instance of the worker. Re-scheduling > the worker before a previous instance started, will not create a second > instance. The worker's instance will complete all pending updates. So in > some way, skipping workers already happens.
So I think that the most often fbcon update from atomic context is the blinking cursor. If you disable that one you should be back to the old performance level I think, since just writing to dmesg is from process context, so shouldn't change.
Hmm, then for the old driver, it should also do the most update in non-atomic context?
One other thing is, I profiled that updating a 3MB shadow buffer needs 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with the cache setting of DRM shadow buffer? say the orginal code use a cachable buffer?
Hm, that would indicate the write-combining got broken somewhere. This should definitely be faster. Also we shouldn't transfer the hole thing, except when scrolling ...
First rule of fbcon usage, you are always effectively scrolling.
Also these devices might be on a PCIE 1x piece of wet string, not sure if the numbers reflect that.
pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and overhead not entirely out of the question that 150MB/s is actually the hw limit. If it's really pcie 1x 1.0, no idea where to check that. Also might be worth to double-check that the gpu pci bar is listed as wc in debugfs/x86/pat_memtype_list.
Here is some dump of the device info and the pat_memtype_list, while it is running other 0day task:
Looks all good, I guess Dave is right with this probably only being a real slow, real old pcie link, plus maybe some inefficiencies in the mapping. Your 150MB/s, was that just the copy, or did you include all the setup/map/unmap/teardown too in your measurement in the trace? -Daniel
controller info
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 05) (prog-if 00 [VGA controller]) Subsystem: Intel Corporation MGA G200e [Pilot] ServerEngines (SEP1) Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 16 NUMA node: 0 Region 0: Memory at d0000000 (32-bit, prefetchable) [size=16M] Region 1: Memory at d1800000 (32-bit, non-prefetchable) [size=16K] Region 2: Memory at d1000000 (32-bit, non-prefetchable) [size=8M] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [e4] Express (v1) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit- Address: 00000000 Data: 0000 Kernel driver in use: mgag200 Kernel modules: mgag200
Related pat setting
uncached-minus @ 0xc0000000-0xc0001000 uncached-minus @ 0xc0000000-0xd0000000 uncached-minus @ 0xc0008000-0xc0009000 uncached-minus @ 0xc0009000-0xc000a000 uncached-minus @ 0xc0010000-0xc0011000 uncached-minus @ 0xc0011000-0xc0012000 uncached-minus @ 0xc0012000-0xc0013000 uncached-minus @ 0xc0013000-0xc0014000 uncached-minus @ 0xc0018000-0xc0019000 uncached-minus @ 0xc0019000-0xc001a000 uncached-minus @ 0xc001a000-0xc001b000 write-combining @ 0xd0000000-0xd0300000 write-combining @ 0xd0000000-0xd1000000 uncached-minus @ 0xd1800000-0xd1804000 uncached-minus @ 0xd1900000-0xd1980000 uncached-minus @ 0xd1980000-0xd1981000 uncached-minus @ 0xd1a00000-0xd1a80000 uncached-minus @ 0xd1a80000-0xd1a81000 uncached-minus @ 0xd1f10000-0xd1f11000 uncached-minus @ 0xd1f11000-0xd1f12000 uncached-minus @ 0xd1f12000-0xd1f13000
Host bridge info
00:00.0 Host bridge: Intel Corporation Device 7853 Subsystem: Intel Corporation Device 0000 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 0 NUMA node: 0 Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <512ns, L1 <4us ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- RootCtl: ErrCorrectable+ ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?> Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?> Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?> Capabilities: [250 v1] #19 Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?> Capabilities: [298 v1] Vendor Specific Information: ID=0007 Rev=0 Len=024 <?>
Thanks, Feng
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Sep 05, 2019 at 06:37:47PM +0800, Daniel Vetter wrote:
On Thu, Sep 5, 2019 at 8:58 AM Feng Tang feng.tang@intel.com wrote:
Hi Vetter,
On Wed, Sep 04, 2019 at 01:20:29PM +0200, Daniel Vetter wrote:
On Wed, Sep 4, 2019 at 1:15 PM Dave Airlie airlied@gmail.com wrote:
On Wed, 4 Sep 2019 at 19:17, Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Sep 4, 2019 at 10:35 AM Feng Tang feng.tang@intel.com wrote:
Hi Daniel,
On Wed, Sep 04, 2019 at 10:11:11AM +0200, Daniel Vetter wrote: > On Wed, Sep 4, 2019 at 8:53 AM Thomas Zimmermann tzimmermann@suse.de wrote: > > > > Hi > > > > Am 04.09.19 um 08:27 schrieb Feng Tang: > > >> Thank you for testing. But don't get too excited, because the patch > > >> simulates a bug that was present in the original mgag200 code. A > > >> significant number of frames are simply skipped. That is apparently the > > >> reason why it's faster. > > > > > > Thanks for the detailed info, so the original code skips time-consuming > > > work inside atomic context on purpose. Is there any space to optmise it? > > > If 2 scheduled update worker are handled at almost same time, can one be > > > skipped? > > > > To my knowledge, there's only one instance of the worker. Re-scheduling > > the worker before a previous instance started, will not create a second > > instance. The worker's instance will complete all pending updates. So in > > some way, skipping workers already happens. > > So I think that the most often fbcon update from atomic context is the > blinking cursor. If you disable that one you should be back to the old > performance level I think, since just writing to dmesg is from process > context, so shouldn't change.
Hmm, then for the old driver, it should also do the most update in non-atomic context?
One other thing is, I profiled that updating a 3MB shadow buffer needs 20 ms, which transfer to 150 MB/s bandwidth. Could it be related with the cache setting of DRM shadow buffer? say the orginal code use a cachable buffer?
Hm, that would indicate the write-combining got broken somewhere. This should definitely be faster. Also we shouldn't transfer the hole thing, except when scrolling ...
First rule of fbcon usage, you are always effectively scrolling.
Also these devices might be on a PCIE 1x piece of wet string, not sure if the numbers reflect that.
pcie 1x 1.0 is 250MB/s, so yeah with a bit of inefficiency and overhead not entirely out of the question that 150MB/s is actually the hw limit. If it's really pcie 1x 1.0, no idea where to check that. Also might be worth to double-check that the gpu pci bar is listed as wc in debugfs/x86/pat_memtype_list.
Here is some dump of the device info and the pat_memtype_list, while it is running other 0day task:
Looks all good, I guess Dave is right with this probably only being a real slow, real old pcie link, plus maybe some inefficiencies in the mapping. Your 150MB/s, was that just the copy, or did you include all the setup/map/unmap/teardown too in your measurement in the trace?
Following is the breakdown, the 19240 us is the memory copy time
The drm_fb_helper_dirty_work() calls sequentially 1. drm_client_buffer_vmap (290 us) 2. drm_fb_helper_dirty_blit_real (19240 us) 3. helper->fb->funcs->dirty() ---> NULL for mgag200 driver 4. drm_client_buffer_vunmap (215 us)
Thanks, Feng
-Daniel
controller info
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 05) (prog-if 00 [VGA controller]) Subsystem: Intel Corporation MGA G200e [Pilot] ServerEngines (SEP1) Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 16 NUMA node: 0 Region 0: Memory at d0000000 (32-bit, prefetchable) [size=16M] Region 1: Memory at d1800000 (32-bit, non-prefetchable) [size=16K] Region 2: Memory at d1000000 (32-bit, non-prefetchable) [size=8M] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [e4] Express (v1) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit- Address: 00000000 Data: 0000 Kernel driver in use: mgag200 Kernel modules: mgag200
Related pat setting
uncached-minus @ 0xc0000000-0xc0001000 uncached-minus @ 0xc0000000-0xd0000000 uncached-minus @ 0xc0008000-0xc0009000 uncached-minus @ 0xc0009000-0xc000a000 uncached-minus @ 0xc0010000-0xc0011000 uncached-minus @ 0xc0011000-0xc0012000 uncached-minus @ 0xc0012000-0xc0013000 uncached-minus @ 0xc0013000-0xc0014000 uncached-minus @ 0xc0018000-0xc0019000 uncached-minus @ 0xc0019000-0xc001a000 uncached-minus @ 0xc001a000-0xc001b000 write-combining @ 0xd0000000-0xd0300000 write-combining @ 0xd0000000-0xd1000000 uncached-minus @ 0xd1800000-0xd1804000 uncached-minus @ 0xd1900000-0xd1980000 uncached-minus @ 0xd1980000-0xd1981000 uncached-minus @ 0xd1a00000-0xd1a80000 uncached-minus @ 0xd1a80000-0xd1a81000 uncached-minus @ 0xd1f10000-0xd1f11000 uncached-minus @ 0xd1f11000-0xd1f12000 uncached-minus @ 0xd1f12000-0xd1f13000
Host bridge info
00:00.0 Host bridge: Intel Corporation Device 7853 Subsystem: Intel Corporation Device 0000 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 0 NUMA node: 0 Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L1, Exit Latency L0s <512ns, L1 <4us ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- RootCtl: ErrCorrectable+ ErrNon-Fatal+ ErrFatal+ PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range BCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?> Capabilities: [144 v1] Vendor Specific Information: ID=0004 Rev=1 Len=03c <?> Capabilities: [1d0 v1] Vendor Specific Information: ID=0003 Rev=1 Len=00a <?> Capabilities: [250 v1] #19 Capabilities: [280 v1] Vendor Specific Information: ID=0005 Rev=3 Len=018 <?> Capabilities: [298 v1] Vendor Specific Information: ID=0007 Rev=0 Len=024 <?>
Thanks, Feng
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
Hi Thomas,
On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
Hi
Am 28.08.19 um 11:37 schrieb Rong Chen:
Hi Thomas,
On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
Hi
Am 27.08.19 um 14:33 schrieb Chen, Rong A:
Both patches have little impact on the performance from our side.
Thanks for testing. Too bad they doesn't solve the issue.
There's another patch attached. Could you please tests this as well? Thanks a lot!
The patch comes from Daniel Vetter after discussing the problem on IRC. The idea of the patch is that the old mgag200 code might display much less frames that the generic code, because mgag200 only prints from non-atomic context. If we simulate this with the generic code, we should see roughly the original performance.
It's cool, the patch "usecansleep.patch" can fix the issue.
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
We discussed ideas on IRC and decided that screen updates could be synchronized with vblank intervals. This may give some rate limiting to the output.
If you like, you could try the patch set at [1]. It adds the respective code to console and mgag200.
Best regards Thomas
[1] https://lists.freedesktop.org/archives/dri-devel/2019-September/234850.html
Thanks, Feng
Best regards Thomas
Hi Thomas,
On Mon, Sep 09, 2019 at 04:12:37PM +0200, Thomas Zimmermann wrote:
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
Hi Thomas,
On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
Hi
Am 28.08.19 um 11:37 schrieb Rong Chen:
Hi Thomas,
On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
Hi
Am 27.08.19 um 14:33 schrieb Chen, Rong A:
Both patches have little impact on the performance from our side.
Thanks for testing. Too bad they doesn't solve the issue.
There's another patch attached. Could you please tests this as well? Thanks a lot!
The patch comes from Daniel Vetter after discussing the problem on IRC. The idea of the patch is that the old mgag200 code might display much less frames that the generic code, because mgag200 only prints from non-atomic context. If we simulate this with the generic code, we should see roughly the original performance.
It's cool, the patch "usecansleep.patch" can fix the issue.
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
We discussed ideas on IRC and decided that screen updates could be synchronized with vblank intervals. This may give some rate limiting to the output.
If you like, you could try the patch set at [1]. It adds the respective code to console and mgag200.
I just tried the 2 patches, no obvious change (comparing to the 18.8% regression), both in overall benchmark and micro-profiling.
90f479ae51afa45e 04a0983095feaee022cdd65e3e4 ---------------- --------------------------- 37236 ± 3% +2.5% 38167 ± 3% vm-scalability.median 0.15 ± 24% -25.1% 0.11 ± 23% vm-scalability.median_stddev 0.15 ± 23% -25.1% 0.11 ± 22% vm-scalability.stddev 12767318 ± 4% +2.5% 13089177 ± 3% vm-scalability.throughput
Thanks, Feng
Best regards Thomas
[1] https://lists.freedesktop.org/archives/dri-devel/2019-September/234850.html
Thanks, Feng
Best regards Thomas
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
Hi
Am 16.09.19 um 11:06 schrieb Feng Tang:
Hi Thomas,
On Mon, Sep 09, 2019 at 04:12:37PM +0200, Thomas Zimmermann wrote:
Hi
Am 04.09.19 um 08:27 schrieb Feng Tang:
Hi Thomas,
On Wed, Aug 28, 2019 at 12:51:40PM +0200, Thomas Zimmermann wrote:
Hi
Am 28.08.19 um 11:37 schrieb Rong Chen:
Hi Thomas,
On 8/28/19 1:16 AM, Thomas Zimmermann wrote:
Hi
Am 27.08.19 um 14:33 schrieb Chen, Rong A: > Both patches have little impact on the performance from our side. Thanks for testing. Too bad they doesn't solve the issue.
There's another patch attached. Could you please tests this as well? Thanks a lot!
The patch comes from Daniel Vetter after discussing the problem on IRC. The idea of the patch is that the old mgag200 code might display much less frames that the generic code, because mgag200 only prints from non-atomic context. If we simulate this with the generic code, we should see roughly the original performance.
It's cool, the patch "usecansleep.patch" can fix the issue.
Thank you for testing. But don't get too excited, because the patch simulates a bug that was present in the original mgag200 code. A significant number of frames are simply skipped. That is apparently the reason why it's faster.
Thanks for the detailed info, so the original code skips time-consuming work inside atomic context on purpose. Is there any space to optmise it? If 2 scheduled update worker are handled at almost same time, can one be skipped?
We discussed ideas on IRC and decided that screen updates could be synchronized with vblank intervals. This may give some rate limiting to the output.
If you like, you could try the patch set at [1]. It adds the respective code to console and mgag200.
I just tried the 2 patches, no obvious change (comparing to the 18.8% regression), both in overall benchmark and micro-profiling.
90f479ae51afa45e 04a0983095feaee022cdd65e3e4
37236 ± 3% +2.5% 38167 ± 3% vm-scalability.median 0.15 ± 24% -25.1% 0.11 ± 23% vm-scalability.median_stddev 0.15 ± 23% -25.1% 0.11 ± 22% vm-scalability.stddev
12767318 ± 4% +2.5% 13089177 ± 3% vm-scalability.throughput
Thank you for testing. I wish we'd seen at least some improvement.
Best regards Thomas
Thanks, Feng
Best regards Thomas
[1] https://lists.freedesktop.org/archives/dri-devel/2019-September/234850.html
Thanks, Feng
Best regards Thomas
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi
Am 05.08.19 um 09:02 schrieb Feng Tang:
Hi Thomas,
On Sun, Aug 04, 2019 at 08:39:19PM +0200, Thomas Zimmermann wrote:
Hi
I did some further analysis on this problem and found that the blinking cursor affects performance of the vm-scalability test case.
I only have a 4-core machine, so scalability is not really testable. Yet I see the effects of running vm-scalibility against drm-tip, a revert of the mgag200 patch and the vmap fixes that I posted a few days ago.
After reverting the mgag200 patch, running the test as described in the report
bin/lkp run job.yaml
gives results like
2019-08-02 19:34:37 ./case-anon-cow-seq-hugetlb 2019-08-02 19:34:37 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815395225 917319627 bytes / 756534 usecs = 1184110 KB/s 917319627 bytes / 764675 usecs = 1171504 KB/s 917319627 bytes / 766414 usecs = 1168846 KB/s 917319627 bytes / 777990 usecs = 1151454 KB/s
Running the test against current drm-tip gives slightly worse results, such as.
2019-08-03 19:17:06 ./case-anon-cow-seq-hugetlb 2019-08-03 19:17:06 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 871607 usecs = 1027778 KB/s 917318700 bytes / 894173 usecs = 1001840 KB/s 917318700 bytes / 919694 usecs = 974040 KB/s 917318700 bytes / 923341 usecs = 970193 KB/s
The test puts out roughly one result per second. Strangely sending the output to /dev/null can make results significantly worse.
bin/lkp run job.yaml > /dev/null
2019-08-03 19:23:04 ./case-anon-cow-seq-hugetlb 2019-08-03 19:23:04 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 1207358 usecs = 741966 KB/s 917318700 bytes / 1210456 usecs = 740067 KB/s 917318700 bytes / 1216572 usecs = 736346 KB/s 917318700 bytes / 1239152 usecs = 722929 KB/s
I realized that there's still a blinking cursor on the screen, which I disabled with
tput civis
or alternatively
echo 0 > /sys/devices/virtual/graphics/fbcon/cursor_blink
Running the the test now gives the original or even better results, such as
bin/lkp run job.yaml > /dev/null
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
In the original code, the fbdev console already ran with 32 bpp [1] and 16 bpp was selected for low-end devices. [2][3] The patch only set the same values for userspace; nothing changed for the console.
Best regards Thomas
[1] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20... [2] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20... [3] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20...
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median
14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput
commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24 Signed-off-by: Feng Tang <feng.tang@intel.com>
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else
dev->mode_config.preferred_depth = 32;
dev->mode_config.preferred_depth = 24;> dev->mode_config.prefer_shadow = 1;
r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
Hi Thomas,
On Mon, Aug 05, 2019 at 12:22:11PM +0200, Thomas Zimmermann wrote:
[snip]
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
In the original code, the fbdev console already ran with 32 bpp [1] and 16 bpp was selected for low-end devices. [2][3] The patch only set the same values for userspace; nothing changed for the console.
I did the experiment becasue I checked the commit
90f479ae51afa4 drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
in which there is code:
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index b10f726..a977333 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else - dev->mode_config.preferred_depth = 24; + dev->mode_config.preferred_depth = 32; dev->mode_config.prefer_shadow = 1;
My debug patch was kind of restoring of this part.
Thanks, Feng
Best regards Thomas
[1] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20... [2] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20... [3] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20...
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median
14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput
commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24 Signed-off-by: Feng Tang <feng.tang@intel.com>
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else
dev->mode_config.preferred_depth = 32;
dev->mode_config.preferred_depth = 24;> dev->mode_config.prefer_shadow = 1;
r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
Hi Feng,
do you still have the test setup that produced the performance penalty?
If so, could you give a try to the patchset at [1]? I think I've fixed the remaining issues in earlier versions and I'd like to see if it actually improves performance.
Best regards Thomas
[1] https://lists.freedesktop.org/archives/dri-devel/2019-December/247771.html
Am 05.08.19 um 14:52 schrieb Feng Tang:
Hi Thomas,
On Mon, Aug 05, 2019 at 12:22:11PM +0200, Thomas Zimmermann wrote:
[snip]
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
In the original code, the fbdev console already ran with 32 bpp [1] and 16 bpp was selected for low-end devices. [2][3] The patch only set the same values for userspace; nothing changed for the console.
I did the experiment becasue I checked the commit
90f479ae51afa4 drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
in which there is code:
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index b10f726..a977333 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else
dev->mode_config.preferred_depth = 24;
dev->mode_config.prefer_shadow = 1;dev->mode_config.preferred_depth = 32;
My debug patch was kind of restoring of this part.
Thanks, Feng
Best regards Thomas
[1] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20... [2] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20... [3] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20...
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median
14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput
commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24 Signed-off-by: Feng Tang <feng.tang@intel.com>
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else
dev->mode_config.preferred_depth = 32;
dev->mode_config.preferred_depth = 24;> dev->mode_config.prefer_shadow = 1;
r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Thomas,
The previous throughput was reduced from 43955 to 35691, and there is a little increase in next-20200106, but there is no obvious change after the patchset:
commit: f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console") 90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 ---------------- --------------------------- %stddev %change %stddev \ | \ 43955 ± 2% -18.8% 35691 vm-scalability.median
commit:
9eb1b48ca4 ("Add linux-next specific files for 20200106") 5f20199bac ("drm/fb-helper: Synchronize dirty worker with vblank")
next-20200106 5f20199bac9b2de71fd2158b90 ---------------- -------------------------- %stddev change %stddev \ | \ 38550 38744 38549 38744 vm-scalability.median
Best Regards, Rong Chen
On 1/6/20 9:19 PM, Thomas Zimmermann wrote:
Hi Feng,
do you still have the test setup that produced the performance penalty?
If so, could you give a try to the patchset at [1]? I think I've fixed the remaining issues in earlier versions and I'd like to see if it actually improves performance.
Best regards Thomas
[1] https://lists.freedesktop.org/archives/dri-devel/2019-December/247771.html
Am 05.08.19 um 14:52 schrieb Feng Tang:
Hi Thomas,
On Mon, Aug 05, 2019 at 12:22:11PM +0200, Thomas Zimmermann wrote:
[snip]
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
In the original code, the fbdev console already ran with 32 bpp [1] and 16 bpp was selected for low-end devices. [2][3] The patch only set the same values for userspace; nothing changed for the console.
I did the experiment becasue I checked the commit
90f479ae51afa4 drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
in which there is code:
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index b10f726..a977333 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else
dev->mode_config.preferred_depth = 24;
dev->mode_config.prefer_shadow = 1;dev->mode_config.preferred_depth = 32;
My debug patch was kind of restoring of this part.
Thanks, Feng
Best regards Thomas
[1] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20... [2] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20... [3] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20...
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median
14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput
commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24 Signed-off-by: Feng Tang <feng.tang@intel.com>
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else
dev->mode_config.preferred_depth = 32;
dev->mode_config.preferred_depth = 24;> dev->mode_config.prefer_shadow = 1;
r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi
Am 08.01.20 um 03:25 schrieb Rong Chen:
Hi Thomas,
The previous throughput was reduced from 43955 to 35691, and there is a little increase in next-20200106, but there is no obvious change after the patchset:
OK, I would have hoped for some improvements. Anyway, thanks for testing.
Best regards Thomas
commit: f1f8555dfb ("drm/bochs: Use shadow buffer for bochs framebuffer console") 90f479ae51 ("drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation")
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9
%stddev %change %stddev \ | \ 43955 ± 2% -18.8% 35691 vm-scalability.median
commit:
9eb1b48ca4 ("Add linux-next specific files for 20200106") 5f20199bac ("drm/fb-helper: Synchronize dirty worker with vblank")
next-20200106 5f20199bac9b2de71fd2158b90
%stddev change %stddev \ | \ 38550 38744 38549 38744 vm-scalability.median
Best Regards, Rong Chen
On 1/6/20 9:19 PM, Thomas Zimmermann wrote:
Hi Feng,
do you still have the test setup that produced the performance penalty?
If so, could you give a try to the patchset at [1]? I think I've fixed the remaining issues in earlier versions and I'd like to see if it actually improves performance.
Best regards Thomas
[1] https://lists.freedesktop.org/archives/dri-devel/2019-December/247771.html
Am 05.08.19 um 14:52 schrieb Feng Tang:
Hi Thomas,
On Mon, Aug 05, 2019 at 12:22:11PM +0200, Thomas Zimmermann wrote:
[snip]
2019-08-03 19:29:17 ./case-anon-cow-seq-hugetlb 2019-08-03 19:29:17 ./usemem --runtime 300 -n 4 --prealloc --prefault -O -U 815394406 917318700 bytes / 659419 usecs = 1358497 KB/s 917318700 bytes / 659658 usecs = 1358005 KB/s 917318700 bytes / 659916 usecs = 1357474 KB/s 917318700 bytes / 660168 usecs = 1356956 KB/s
Rong, Feng, could you confirm this by disabling the cursor or blinking?
Glad to know this method restored the drop. Rong is running the case.
While I have another finds, as I noticed your patch changed the bpp from 24 to 32, I had a patch to change it back to 24, and run the case in the weekend, the -18% regrssion was reduced to about -5%. Could this be related?
In the original code, the fbdev console already ran with 32 bpp [1] and 16 bpp was selected for low-end devices. [2][3] The patch only set the same values for userspace; nothing changed for the console.
I did the experiment becasue I checked the commit
90f479ae51afa4 drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation
in which there is code:
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index b10f726..a977333 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else
dev->mode_config.preferred_depth = 24;
dev->mode_config.prefer_shadow = 1;dev->mode_config.preferred_depth = 32;
My debug patch was kind of restoring of this part.
Thanks, Feng
Best regards Thomas
[1] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20... [2] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20... [3] https://cgit.freedesktop.org/drm/drm-tip/tree/drivers/gpu/drm/mgag200/mgag20...
commit: f1f8555dfb9 drm/bochs: Use shadow buffer for bochs framebuffer console 90f479ae51a drm/mgag200: Replace struct mga_fbdev with generic framebuffer emulation 01e75fea0d5 mgag200: restore the depth back to 24
f1f8555dfb9a70a2 90f479ae51afa45efab97afdde9 01e75fea0d5ff39d3e588c20ec5
43921 ± 2% -18.3% 35884 -4.8% 41826 vm-scalability.median
14889337 -17.5% 12291029 -4.1% 14278574 vm-scalability.throughput
commit 01e75fea0d5ff39d3e588c20ec52e7a4e6588a74 Author: Feng Tang feng.tang@intel.com Date: Fri Aug 2 15:09:19 2019 +0800
mgag200: restore the depth back to 24 Signed-off-by: Feng Tang <feng.tang@intel.com>
diff --git a/drivers/gpu/drm/mgag200/mgag200_main.c b/drivers/gpu/drm/mgag200/mgag200_main.c index a977333..ac8f6c9 100644 --- a/drivers/gpu/drm/mgag200/mgag200_main.c +++ b/drivers/gpu/drm/mgag200/mgag200_main.c @@ -162,7 +162,7 @@ int mgag200_driver_load(struct drm_device *dev, unsigned long flags) if (IS_G200_SE(mdev) && mdev->mc.vram_size < (2048*1024)) dev->mode_config.preferred_depth = 16; else
dev->mode_config.preferred_depth = 32;
dev->mode_config.preferred_depth = 24;> dev->mode_config.prefer_shadow = 1;
r = mgag200_modeset_init(mdev);
Thanks, Feng
The difference between mgag200's original fbdev support and generic fbdev emulation is generic fbdev's worker task that updates the VRAM buffer from the shadow buffer. mgag200 does this immediately, but relies on drm_can_sleep(), which is deprecated.
I think that the worker task interferes with the test case, as the worker has been in fbdev emulation since forever and no performance regressions have been reported so far.
So unless there's a report where this problem happens in a real-world use case, I'd like to keep code as it is. And apparently there's always the workaround of disabling the cursor blinking.
Best regards Thomas
-- Thomas Zimmermann Graphics Driver Developer SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
dri-devel@lists.freedesktop.org