After several days uptime with a 3.16 kernel (generally running Thunderbird, emacs, kernel builds, several Chrome tabs on multiple desktop workspaces) I've been seeing some really extreme slowdowns.
Mostly the slowdowns are associated with gpu-related tasks, like opening new emacs windows, switching workspaces, laughing at internet gifs, etc. Because this x86_64 desktop is nouveau-based, I didn't pursue it right away -- 3.15 is the first time suspend has worked reliably.
This week I started looking into what the slowdown was and discovered it's happening during dma allocation through swiotlb (the cpus can do intel iommu but I don't use it because it's not the default for most users).
I'm still working on a bisection but each step takes 8+ hours to validate and even then I'm no longer sure I still have the 'bad' commit in the bisection. [edit: yup, I started over]
I just discovered a smattering of these in my logs and only on 3.16-rc+ kernels: Sep 25 07:57:59 thor kernel: [28786.001300] alloc_contig_range test_pages_isolated(2bf560, 2bf562) failed
This dual-Xeon box has 10GB and sysrq Show Memory isn't showing heavy fragmentation [1].
Besides Mel's page allocator changes in 3.16, another suspect commit is:
commit b13b1d2d8692b437203de7a404c6b809d2cc4d99 Author: Shaohua Li shli@kernel.org Date: Tue Apr 8 15:58:09 2014 +0800
x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB
Specifically, this statement:
It could cause incorrect page aging and the (mistaken) reclaim of hot pages, but the chance of that should be relatively low.
I'm wondering if this could cause worse-case behavior with TTM? I'm testing a revert of this on mainline 3.16-final now, with no results yet.
Thoughts?
Regards, Peter Hurley
[1] SysRq : Show Memory Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 CPU 2: hi: 0, btch: 1 usd: 0 CPU 3: hi: 0, btch: 1 usd: 0 CPU 4: hi: 0, btch: 1 usd: 0 CPU 5: hi: 0, btch: 1 usd: 0 CPU 6: hi: 0, btch: 1 usd: 0 CPU 7: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 18 CPU 1: hi: 186, btch: 31 usd: 82 CPU 2: hi: 186, btch: 31 usd: 46 CPU 3: hi: 186, btch: 31 usd: 30 CPU 4: hi: 186, btch: 31 usd: 18 CPU 5: hi: 186, btch: 31 usd: 43 CPU 6: hi: 186, btch: 31 usd: 157 CPU 7: hi: 186, btch: 31 usd: 26 Node 0 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 25 CPU 1: hi: 186, btch: 31 usd: 33 CPU 2: hi: 186, btch: 31 usd: 28 CPU 3: hi: 186, btch: 31 usd: 46 CPU 4: hi: 186, btch: 31 usd: 23 CPU 5: hi: 186, btch: 31 usd: 8 CPU 6: hi: 186, btch: 31 usd: 112 CPU 7: hi: 186, btch: 31 usd: 18 active_anon:382833 inactive_anon:12103 isolated_anon:0 active_file:1156997 inactive_file:733988 isolated_file:0 unevictable:15 dirty:35833 writeback:0 unstable:0 free:129383 slab_reclaimable:95038 slab_unreclaimable:11095 mapped:81924 shmem:12509 pagetables:9039 bounce:0 free_cma:0 Node 0 DMA free:15860kB min:104kB low:128kB high:156kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15960kB managed:15876kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2974 9980 9980 Node 0 DMA32 free:166712kB min:20108kB low:25132kB high:30160kB active_anon:475548kB inactive_anon:15204kB active_file:1368716kB inactive_file:865832kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3127336kB managed:3048188kB mlocked:0kB dirty:38228kB writeback:0kB mapped:94340kB shmem:15436kB slab_reclaimable:116424kB slab_unreclaimable:12756kB kernel_stack:2512kB pagetables:11532kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 7006 7006 Node 0 Normal free:334960kB min:47368kB low:59208kB high:71052kB active_anon:1055784kB inactive_anon:33208kB active_file:3259272kB inactive_file:2070120kB unevictable:60kB isolated(anon):0kB isolated(file):0kB present:7340032kB managed:7174484kB mlocked:60kB dirty:105104kB writeback:0kB mapped:233356kB shmem:34600kB slab_reclaimable:263728kB slab_unreclaimable:31608kB kernel_stack:7344kB pagetables:24624kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15860kB Node 0 DMA32: 209*4kB (UEM) 394*8kB (UEM) 303*16kB (UEM) 60*32kB (UEM) 314*64kB (UEM) 117*128kB (UEM) 9*256kB (EM) 3*512kB (UEM) 2*1024kB (EM) 2*2048kB (UM) 27*4096kB (MR) = 166404kB Node 0 Normal: 17*4kB (UE) 460*8kB (UEM) 747*16kB (UM) 130*32kB (UEM) 521*64kB (UM) 184*128kB (UEM) 70*256kB (UM) 22*512kB (UM) 11*1024kB (UM) 2*2048kB (EM) 52*4096kB (MR) = 334292kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 1903443 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 10996456kB Total swap = 10996456kB 2620832 pages RAM 0 pages HighMem/MovableOnly 41387 pages reserved 0 pages hwpoisoned
Hey,
On 25-09-14 20:55, Peter Hurley wrote:
After several days uptime with a 3.16 kernel (generally running Thunderbird, emacs, kernel builds, several Chrome tabs on multiple desktop workspaces) I've been seeing some really extreme slowdowns.
Mostly the slowdowns are associated with gpu-related tasks, like opening new emacs windows, switching workspaces, laughing at internet gifs, etc. Because this x86_64 desktop is nouveau-based, I didn't pursue it right away -- 3.15 is the first time suspend has worked reliably.
This week I started looking into what the slowdown was and discovered it's happening during dma allocation through swiotlb (the cpus can do intel iommu but I don't use it because it's not the default for most users).
I'm still working on a bisection but each step takes 8+ hours to validate and even then I'm no longer sure I still have the 'bad' commit in the bisection. [edit: yup, I started over]
I just discovered a smattering of these in my logs and only on 3.16-rc+ kernels: Sep 25 07:57:59 thor kernel: [28786.001300] alloc_contig_range test_pages_isolated(2bf560, 2bf562) failed
This dual-Xeon box has 10GB and sysrq Show Memory isn't showing heavy fragmentation [1].
Besides Mel's page allocator changes in 3.16, another suspect commit is:
Maybe related, but I've been seeing page corruption in nouveau as well, with 3.15.9:
http://paste.debian.net/122800/
I think it might be an even older bug because I've been using nouveau on my desktop and it hasn't been stable for the past few releases. I'm also lazy with updating kernel, still do it from time to time.
The lookup and nvapeek warnings/crashes are not important btw, I was testing some nouveau things. The linker trap probably is. After the second BUG Xorg was no longer able to recover.
But this was after various suspend/resume cycles, although I suspect I've hit some corruption on radeon too (on a somewhat more recent kernel) when I fiddle with vgaswitcheroo, ending up with a real massive amount of spam there, etc.
Unfortunately I haven't been able to find out what caused it yet, nor am I sure what debug options I should set in the kernel to debug this.
~Maarten
On Thu, 25 Sep 2014 14:55:02 -0400 Peter Hurley peter@hurleysoftware.com wrote:
After several days uptime with a 3.16 kernel (generally running Thunderbird, emacs, kernel builds, several Chrome tabs on multiple desktop workspaces) I've been seeing some really extreme slowdowns.
Mostly the slowdowns are associated with gpu-related tasks, like opening new emacs windows, switching workspaces, laughing at internet gifs, etc. Because this x86_64 desktop is nouveau-based, I didn't pursue it right away -- 3.15 is the first time suspend has worked reliably.
This week I started looking into what the slowdown was and discovered it's happening during dma allocation through swiotlb (the cpus can do intel iommu but I don't use it because it's not the default for most users).
I'm still working on a bisection but each step takes 8+ hours to validate and even then I'm no longer sure I still have the 'bad' commit in the bisection. [edit: yup, I started over]
There are six ttm patches queued for 3.16.4:
drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch
On 09/25/2014 03:35 PM, Chuck Ebbert wrote:
There are six ttm patches queued for 3.16.4:
drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch
Thanks for info, Chuck.
Unfortunately, none of these fix TTM dma allocation doing CMA dma allocation, which is the root problem.
Regards, Peter Hurley
On 09/26/2014 01:52 AM, Peter Hurley wrote:
On 09/25/2014 03:35 PM, Chuck Ebbert wrote:
There are six ttm patches queued for 3.16.4:
drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch
Thanks for info, Chuck.
Unfortunately, none of these fix TTM dma allocation doing CMA dma allocation, which is the root problem.
Regards, Peter Hurley
The problem is not really in TTM but in CMA, There was a guy offering to fix this in the CMA code but I guess he didn't probably because he didn't receive any feedback.
/Thomas
On Fri, 26 Sep 2014 09:15:57 +0200 Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 01:52 AM, Peter Hurley wrote:
On 09/25/2014 03:35 PM, Chuck Ebbert wrote:
There are six ttm patches queued for 3.16.4:
drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch
Thanks for info, Chuck.
Unfortunately, none of these fix TTM dma allocation doing CMA dma allocation, which is the root problem.
Regards, Peter Hurley
The problem is not really in TTM but in CMA, There was a guy offering to fix this in the CMA code but I guess he didn't probably because he didn't receive any feedback.
Yeah, the "solution" to this problem seems to be "don't enable CMA on x86". Maybe it should even be disabled in the config system.
On 09/26/2014 12:40 PM, Chuck Ebbert wrote:
On Fri, 26 Sep 2014 09:15:57 +0200 Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 01:52 AM, Peter Hurley wrote:
On 09/25/2014 03:35 PM, Chuck Ebbert wrote:
There are six ttm patches queued for 3.16.4:
drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch
Thanks for info, Chuck.
Unfortunately, none of these fix TTM dma allocation doing CMA dma allocation, which is the root problem.
Regards, Peter Hurley
The problem is not really in TTM but in CMA, There was a guy offering to fix this in the CMA code but I guess he didn't probably because he didn't receive any feedback.
Yeah, the "solution" to this problem seems to be "don't enable CMA on x86". Maybe it should even be disabled in the config system.
Or, as previously suggested, don't use CMA for order 0 (single page) allocations....
/Thomas
On Fri, Sep 26, 2014 at 6:45 AM, Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 12:40 PM, Chuck Ebbert wrote:
On Fri, 26 Sep 2014 09:15:57 +0200 Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 01:52 AM, Peter Hurley wrote:
On 09/25/2014 03:35 PM, Chuck Ebbert wrote:
There are six ttm patches queued for 3.16.4:
drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch
Thanks for info, Chuck.
Unfortunately, none of these fix TTM dma allocation doing CMA dma allocation, which is the root problem.
Regards, Peter Hurley
The problem is not really in TTM but in CMA, There was a guy offering to fix this in the CMA code but I guess he didn't probably because he didn't receive any feedback.
Yeah, the "solution" to this problem seems to be "don't enable CMA on x86". Maybe it should even be disabled in the config system.
Or, as previously suggested, don't use CMA for order 0 (single page) allocations....
On devices that actually need CMA pools to arrange for memory to be in certain ranges, I think you probably do want to have order 0 pages come from the CMA pool.
Seems like disabling CMA on x86 (where it should be unneeded) is the better way, IMO
BR, -R
/Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On 09/26/2014 02:28 PM, Rob Clark wrote:
On Fri, Sep 26, 2014 at 6:45 AM, Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 12:40 PM, Chuck Ebbert wrote:
On Fri, 26 Sep 2014 09:15:57 +0200 Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 01:52 AM, Peter Hurley wrote:
On 09/25/2014 03:35 PM, Chuck Ebbert wrote:
There are six ttm patches queued for 3.16.4:
drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch
Thanks for info, Chuck.
Unfortunately, none of these fix TTM dma allocation doing CMA dma allocation, which is the root problem.
Regards, Peter Hurley
The problem is not really in TTM but in CMA, There was a guy offering to fix this in the CMA code but I guess he didn't probably because he didn't receive any feedback.
Yeah, the "solution" to this problem seems to be "don't enable CMA on x86". Maybe it should even be disabled in the config system.
Or, as previously suggested, don't use CMA for order 0 (single page) allocations....
On devices that actually need CMA pools to arrange for memory to be in certain ranges, I think you probably do want to have order 0 pages come from the CMA pool.
But can the DMA subsystem or more specifically dma_alloc_coherent() really guarantee such things? Isn't it better for such devices to use CMA directly?
/Thomas
Seems like disabling CMA on x86 (where it should be unneeded) is the better way, IMO
BR, -R
/Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mail...
On Fri, Sep 26, 2014 at 8:34 AM, Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 02:28 PM, Rob Clark wrote:
On Fri, Sep 26, 2014 at 6:45 AM, Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 12:40 PM, Chuck Ebbert wrote:
On Fri, 26 Sep 2014 09:15:57 +0200 Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 01:52 AM, Peter Hurley wrote:
On 09/25/2014 03:35 PM, Chuck Ebbert wrote: > There are six ttm patches queued for 3.16.4: > > drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch > drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch > drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch > drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch > drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch > drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch Thanks for info, Chuck.
Unfortunately, none of these fix TTM dma allocation doing CMA dma allocation, which is the root problem.
Regards, Peter Hurley
The problem is not really in TTM but in CMA, There was a guy offering to fix this in the CMA code but I guess he didn't probably because he didn't receive any feedback.
Yeah, the "solution" to this problem seems to be "don't enable CMA on x86". Maybe it should even be disabled in the config system.
Or, as previously suggested, don't use CMA for order 0 (single page) allocations....
On devices that actually need CMA pools to arrange for memory to be in certain ranges, I think you probably do want to have order 0 pages come from the CMA pool.
But can the DMA subsystem or more specifically dma_alloc_coherent() really guarantee such things? Isn't it better for such devices to use CMA directly?
Well, I was thinking more specifically about a use-case that was mentioned several times during the early CMA discussions, about video decoders/encoders which needed Y and UV split across memory banks to achieve sufficient bandwidth. I assume they must use CMA directly for this (since they'd need multiple pools per device), but not really 100% sure about that.
So perhaps, yeah, if you shunt order 0 allocations away from CMA at the DMA layer, maybe it is ok. If there actually is a valid use-case for CMA on sane hardware, then maybe this is the better way, and let the insane hw folks hack around it.
(plus, well, the use-case I was mentioning isn't really about order 0 allocations anyway)
BR, -R
/Thomas
Seems like disabling CMA on x86 (where it should be unneeded) is the better way, IMO
BR, -R
/Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mail...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 09/26/2014 08:28 AM, Rob Clark wrote:
On Fri, Sep 26, 2014 at 6:45 AM, Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 12:40 PM, Chuck Ebbert wrote:
On Fri, 26 Sep 2014 09:15:57 +0200 Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 01:52 AM, Peter Hurley wrote:
On 09/25/2014 03:35 PM, Chuck Ebbert wrote:
There are six ttm patches queued for 3.16.4:
drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch
drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch
drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch
drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch
drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch
Thanks for info, Chuck.
Unfortunately, none of these fix TTM dma allocation doing CMA dma allocation, which is the root problem.
Regards, Peter Hurley
The problem is not really in TTM but in CMA, There was a guy offering to fix this in the CMA code but I guess he didn't probably because he didn't receive any feedback.
Yeah, the "solution" to this problem seems to be "don't enable CMA on x86". Maybe it should even be disabled in the config system.
Or, as previously suggested, don't use CMA for order 0 (single page) allocations....
On devices that actually need CMA pools to arrange for memory to be in certain ranges, I think you probably do want to have order 0 pages come from the CMA pool.
Seems like disabling CMA on x86 (where it should be unneeded) is the better way, IMO
CMA has its uses on x86. For example, CMA is used to allocate 1GB huge pages.
There may also be people with devices that do not scatter-gather, and need a large physically contiguous buffer, though there should be relatively few of those on x86.
I suspect it makes most sense to do DMA allocations up to PAGE_ORDER through the normal allocator on x86, and only invoking CMA for larger allocations.
- -- All rights reversed
[ +cc Leann Ogasawara, Marek Szyprowski, Kyungmin Park, Arnd Bergmann ]
On 09/26/2014 08:40 AM, Rik van Riel wrote:
On 09/26/2014 08:28 AM, Rob Clark wrote:
On Fri, Sep 26, 2014 at 6:45 AM, Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 12:40 PM, Chuck Ebbert wrote:
On Fri, 26 Sep 2014 09:15:57 +0200 Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 01:52 AM, Peter Hurley wrote:
On 09/25/2014 03:35 PM, Chuck Ebbert wrote: > There are six ttm patches queued for 3.16.4: > > drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch > >
drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch
> drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch > >
drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch
> drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch > drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch
>
Thanks for info, Chuck.
Unfortunately, none of these fix TTM dma allocation doing CMA dma allocation, which is the root problem.
Regards, Peter Hurley
The problem is not really in TTM but in CMA, There was a guy offering to fix this in the CMA code but I guess he didn't probably because he didn't receive any feedback.
Yeah, the "solution" to this problem seems to be "don't enable CMA on x86". Maybe it should even be disabled in the config system.
Or, as previously suggested, don't use CMA for order 0 (single page) allocations....
On devices that actually need CMA pools to arrange for memory to be in certain ranges, I think you probably do want to have order 0 pages come from the CMA pool.
Seems like disabling CMA on x86 (where it should be unneeded) is the better way, IMO
CMA has its uses on x86. For example, CMA is used to allocate 1GB huge pages.
There may also be people with devices that do not scatter-gather, and need a large physically contiguous buffer, though there should be relatively few of those on x86.
I suspect it makes most sense to do DMA allocations up to PAGE_ORDER through the normal allocator on x86, and only invoking CMA for larger allocations.
The code that uses CMA to satisfy DMA allocations on x86 is specific to the x86 arch and was added in 2011 as a means of _testing_ CMA in KVM:
commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6 Author: Marek Szyprowski m.szyprowski@samsung.com Date: Thu Dec 29 13:09:51 2011 +0100
X86: integrate CMA with DMA-mapping subsystem
This patch adds support for CMA to dma-mapping subsystem for x86 architecture that uses common pci-dma/pci-nommu implementation. This allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com CC: Michal Nazarewicz mina86@mina86.com Acked-by: Arnd Bergmann arnd@arndb.de
(no x86 maintainer acks?).
Unfortunately, this code is enabled whenever CMA is enabled, rather than as a separate test configuration.
So, while enabling CMA may have other purposes on x86, using it for x86 swiotlb and nommu dma allocations is not one of the them.
And Ubuntu should not be enabling CONFIG_DMA_CMA for their i386 and amd64 configurations, as this is trying to drive _all_ dma mapping allocations through a _very_ small window (which is killing GPU performance).
Regards, Peter Hurley
On Fri, Sep 26, 2014 at 7:10 AM, Peter Hurley peter@hurleysoftware.com wrote:
[ +cc Leann Ogasawara, Marek Szyprowski, Kyungmin Park, Arnd Bergmann ]
On 09/26/2014 08:40 AM, Rik van Riel wrote:
On 09/26/2014 08:28 AM, Rob Clark wrote:
On Fri, Sep 26, 2014 at 6:45 AM, Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 12:40 PM, Chuck Ebbert wrote:
On Fri, 26 Sep 2014 09:15:57 +0200 Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 01:52 AM, Peter Hurley wrote: > On 09/25/2014 03:35 PM, Chuck Ebbert wrote: >> There are six ttm patches queued for 3.16.4: >> >> drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch >> >>
drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch
>> drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch >> >>
drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch
>> drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch >> drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch > >>
Thanks for info, Chuck.
> > Unfortunately, none of these fix TTM dma allocation doing > CMA dma allocation, which is the root problem. > > Regards, Peter Hurley The problem is not really in TTM but in CMA, There was a guy offering to fix this in the CMA code but I guess he didn't probably because he didn't receive any feedback.
Yeah, the "solution" to this problem seems to be "don't enable CMA on x86". Maybe it should even be disabled in the config system.
Or, as previously suggested, don't use CMA for order 0 (single page) allocations....
On devices that actually need CMA pools to arrange for memory to be in certain ranges, I think you probably do want to have order 0 pages come from the CMA pool.
Seems like disabling CMA on x86 (where it should be unneeded) is the better way, IMO
CMA has its uses on x86. For example, CMA is used to allocate 1GB huge pages.
There may also be people with devices that do not scatter-gather, and need a large physically contiguous buffer, though there should be relatively few of those on x86.
I suspect it makes most sense to do DMA allocations up to PAGE_ORDER through the normal allocator on x86, and only invoking CMA for larger allocations.
The code that uses CMA to satisfy DMA allocations on x86 is specific to the x86 arch and was added in 2011 as a means of _testing_ CMA in KVM:
commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6 Author: Marek Szyprowski m.szyprowski@samsung.com Date: Thu Dec 29 13:09:51 2011 +0100
X86: integrate CMA with DMA-mapping subsystem This patch adds support for CMA to dma-mapping subsystem for x86 architecture that uses common pci-dma/pci-nommu implementation. This allows to test CMA on KVM/QEMU and a lot of common x86 boxes. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> CC: Michal Nazarewicz <mina86@mina86.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
(no x86 maintainer acks?).
Unfortunately, this code is enabled whenever CMA is enabled, rather than as a separate test configuration.
So, while enabling CMA may have other purposes on x86, using it for x86 swiotlb and nommu dma allocations is not one of the them.
And Ubuntu should not be enabling CONFIG_DMA_CMA for their i386 and amd64 configurations, as this is trying to drive _all_ dma mapping allocations through a _very_ small window (which is killing GPU performance).
Thanks for the note Peter. We do have this disabled for our upcoming Ubuntu 14.10 release. It is however still enabled in the previous 14.04 release. We have been tracking this in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1362261 but users able to reproduce performance impacts in 14.10 were unable to reproduce in 14.04 which is why we hadn't yet disabled it there.
Thanks, Leann
On 09/26/2014 11:12 AM, Leann Ogasawara wrote:
On Fri, Sep 26, 2014 at 7:10 AM, Peter Hurley peter@hurleysoftware.com wrote:
[ +cc Leann Ogasawara, Marek Szyprowski, Kyungmin Park, Arnd Bergmann ]
On 09/26/2014 08:40 AM, Rik van Riel wrote:
On 09/26/2014 08:28 AM, Rob Clark wrote:
On Fri, Sep 26, 2014 at 6:45 AM, Thomas Hellstrom thellstrom@vmware.com wrote:
On 09/26/2014 12:40 PM, Chuck Ebbert wrote:
On Fri, 26 Sep 2014 09:15:57 +0200 Thomas Hellstrom thellstrom@vmware.com wrote:
> On 09/26/2014 01:52 AM, Peter Hurley wrote: >> On 09/25/2014 03:35 PM, Chuck Ebbert wrote: >>> There are six ttm patches queued for 3.16.4: >>> >>> drm-ttm-choose-a-pool-to-shrink-correctly-in-ttm_dma_pool_shrink_scan.patch >>> >>>
drm-ttm-fix-handling-of-ttm_pl_flag_topdown-v2.patch
>>> drm-ttm-fix-possible-division-by-0-in-ttm_dma_pool_shrink_scan.patch >>> >>>
drm-ttm-fix-possible-stack-overflow-by-recursive-shrinker-calls.patch
>>> drm-ttm-pass-gfp-flags-in-order-to-avoid-deadlock.patch >>> drm-ttm-use-mutex_trylock-to-avoid-deadlock-inside-shrinker-functions.patch >> >>>
Thanks for info, Chuck.
>> >> Unfortunately, none of these fix TTM dma allocation doing >> CMA dma allocation, which is the root problem. >> >> Regards, Peter Hurley > The problem is not really in TTM but in CMA, There was a guy > offering to fix this in the CMA code but I guess he didn't > probably because he didn't receive any feedback. > Yeah, the "solution" to this problem seems to be "don't enable CMA on x86". Maybe it should even be disabled in the config system.
Or, as previously suggested, don't use CMA for order 0 (single page) allocations....
On devices that actually need CMA pools to arrange for memory to be in certain ranges, I think you probably do want to have order 0 pages come from the CMA pool.
Seems like disabling CMA on x86 (where it should be unneeded) is the better way, IMO
CMA has its uses on x86. For example, CMA is used to allocate 1GB huge pages.
There may also be people with devices that do not scatter-gather, and need a large physically contiguous buffer, though there should be relatively few of those on x86.
I suspect it makes most sense to do DMA allocations up to PAGE_ORDER through the normal allocator on x86, and only invoking CMA for larger allocations.
The code that uses CMA to satisfy DMA allocations on x86 is specific to the x86 arch and was added in 2011 as a means of _testing_ CMA in KVM:
commit 0a2b9a6ea93650b8a00f9fd5ee8fdd25671e2df6 Author: Marek Szyprowski m.szyprowski@samsung.com Date: Thu Dec 29 13:09:51 2011 +0100
X86: integrate CMA with DMA-mapping subsystem This patch adds support for CMA to dma-mapping subsystem for x86 architecture that uses common pci-dma/pci-nommu implementation. This allows to test CMA on KVM/QEMU and a lot of common x86 boxes. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> CC: Michal Nazarewicz <mina86@mina86.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
(no x86 maintainer acks?).
Unfortunately, this code is enabled whenever CMA is enabled, rather than as a separate test configuration.
So, while enabling CMA may have other purposes on x86, using it for x86 swiotlb and nommu dma allocations is not one of the them.
And Ubuntu should not be enabling CONFIG_DMA_CMA for their i386 and amd64 configurations, as this is trying to drive _all_ dma mapping allocations through a _very_ small window (which is killing GPU performance).
Thanks for the note Peter. We do have this disabled for our upcoming Ubuntu 14.10 release. It is however still enabled in the previous 14.04 release. We have been tracking this in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1362261 but users able to reproduce performance impacts in 14.10 were unable to reproduce in 14.04 which is why we hadn't yet disabled it there.
Leann,
Thanks for that important clue.
The missing piece specific to 3.16+ is these patches which impact every iommu config:
Akinobu Mita (5): x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled x86: enable DMA CMA with swiotlb intel-iommu: integrate DMA CMA memblock: introduce memblock_alloc_range() cma: add placement specifier for "cma=" kernel parameter
These patches take the pre-existing nommu CMA test configuration and hook it up to all the x86 iommus, effectively reducing 10GB of DMA-able memory to 64MB, and hooks it all up to an allocator that's not nearly as effective as the page allocator.
All to enable DMA allocation below 4GB which is already supported with the GFP_DMA32 flag to dma_alloc_coherent().
Regards, Peter Hurley
On Thu, Sep 25, 2014 at 2:55 PM, Peter Hurley peter@hurleysoftware.com wrote:
After several days uptime with a 3.16 kernel (generally running Thunderbird, emacs, kernel builds, several Chrome tabs on multiple desktop workspaces) I've been seeing some really extreme slowdowns.
Mostly the slowdowns are associated with gpu-related tasks, like opening new emacs windows, switching workspaces, laughing at internet gifs, etc. Because this x86_64 desktop is nouveau-based, I didn't pursue it right away -- 3.15 is the first time suspend has worked reliably.
This week I started looking into what the slowdown was and discovered it's happening during dma allocation through swiotlb (the cpus can do intel iommu but I don't use it because it's not the default for most users).
I'm still working on a bisection but each step takes 8+ hours to validate and even then I'm no longer sure I still have the 'bad' commit in the bisection. [edit: yup, I started over]
I just discovered a smattering of these in my logs and only on 3.16-rc+ kernels: Sep 25 07:57:59 thor kernel: [28786.001300] alloc_contig_range test_pages_isolated(2bf560, 2bf562) failed
This dual-Xeon box has 10GB and sysrq Show Memory isn't showing heavy fragmentation [1].
Besides Mel's page allocator changes in 3.16, another suspect commit is:
commit b13b1d2d8692b437203de7a404c6b809d2cc4d99 Author: Shaohua Li shli@kernel.org Date: Tue Apr 8 15:58:09 2014 +0800
x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB
Specifically, this statement:
It could cause incorrect page aging and the (mistaken) reclaim of hot pages, but the chance of that should be relatively low.
I'm wondering if this could cause worse-case behavior with TTM? I'm testing a revert of this on mainline 3.16-final now, with no results yet.
Thoughts?
You may also be seeing this: https://lkml.org/lkml/2014/8/8/445
Alex
On 09/25/2014 04:33 PM, Alex Deucher wrote:
On Thu, Sep 25, 2014 at 2:55 PM, Peter Hurley peter@hurleysoftware.com wrote:
After several days uptime with a 3.16 kernel (generally running Thunderbird, emacs, kernel builds, several Chrome tabs on multiple desktop workspaces) I've been seeing some really extreme slowdowns.
Mostly the slowdowns are associated with gpu-related tasks, like opening new emacs windows, switching workspaces, laughing at internet gifs, etc. Because this x86_64 desktop is nouveau-based, I didn't pursue it right away -- 3.15 is the first time suspend has worked reliably.
This week I started looking into what the slowdown was and discovered it's happening during dma allocation through swiotlb (the cpus can do intel iommu but I don't use it because it's not the default for most users).
I'm still working on a bisection but each step takes 8+ hours to validate and even then I'm no longer sure I still have the 'bad' commit in the bisection. [edit: yup, I started over]
I just discovered a smattering of these in my logs and only on 3.16-rc+ kernels: Sep 25 07:57:59 thor kernel: [28786.001300] alloc_contig_range test_pages_isolated(2bf560, 2bf562) failed
This dual-Xeon box has 10GB and sysrq Show Memory isn't showing heavy fragmentation [1].
Besides Mel's page allocator changes in 3.16, another suspect commit is:
commit b13b1d2d8692b437203de7a404c6b809d2cc4d99 Author: Shaohua Li shli@kernel.org Date: Tue Apr 8 15:58:09 2014 +0800
x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB
Specifically, this statement:
It could cause incorrect page aging and the (mistaken) reclaim of hot pages, but the chance of that should be relatively low.
I'm wondering if this could cause worse-case behavior with TTM? I'm testing a revert of this on mainline 3.16-final now, with no results yet.
Thoughts?
You may also be seeing this: https://lkml.org/lkml/2014/8/8/445
Thanks Alex. That is indeed the problem.
Still reading the email thread to find out where the patches are that fix this. Although it doesn't make much sense to me that nouveau sets up a 1GB GART and then uses TTM which is trying to shove all the DMA through a 16MB CMA window (which turns out to be the base Ubuntu config).
Regards, Peter Hurley
Op 25-09-14 om 23:10 schreef Peter Hurley:
On 09/25/2014 04:33 PM, Alex Deucher wrote:
On Thu, Sep 25, 2014 at 2:55 PM, Peter Hurley peter@hurleysoftware.com wrote:
After several days uptime with a 3.16 kernel (generally running Thunderbird, emacs, kernel builds, several Chrome tabs on multiple desktop workspaces) I've been seeing some really extreme slowdowns.
Mostly the slowdowns are associated with gpu-related tasks, like opening new emacs windows, switching workspaces, laughing at internet gifs, etc. Because this x86_64 desktop is nouveau-based, I didn't pursue it right away -- 3.15 is the first time suspend has worked reliably.
This week I started looking into what the slowdown was and discovered it's happening during dma allocation through swiotlb (the cpus can do intel iommu but I don't use it because it's not the default for most users).
I'm still working on a bisection but each step takes 8+ hours to validate and even then I'm no longer sure I still have the 'bad' commit in the bisection. [edit: yup, I started over]
I just discovered a smattering of these in my logs and only on 3.16-rc+ kernels: Sep 25 07:57:59 thor kernel: [28786.001300] alloc_contig_range test_pages_isolated(2bf560, 2bf562) failed
This dual-Xeon box has 10GB and sysrq Show Memory isn't showing heavy fragmentation [1].
Besides Mel's page allocator changes in 3.16, another suspect commit is:
commit b13b1d2d8692b437203de7a404c6b809d2cc4d99 Author: Shaohua Li shli@kernel.org Date: Tue Apr 8 15:58:09 2014 +0800
x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB
Specifically, this statement:
It could cause incorrect page aging and the (mistaken) reclaim of hot pages, but the chance of that should be relatively low.
I'm wondering if this could cause worse-case behavior with TTM? I'm testing a revert of this on mainline 3.16-final now, with no results yet.
Thoughts?
You may also be seeing this: https://lkml.org/lkml/2014/8/8/445
Thanks Alex. That is indeed the problem.
Still reading the email thread to find out where the patches are that fix this. Although it doesn't make much sense to me that nouveau sets up a 1GB GART and then uses TTM which is trying to shove all the DMA through a 16MB CMA window (which turns out to be the base Ubuntu config).
Regards, Peter Hurley
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1362261
CMA's already disabled on x86 in most recent ubuntu kernels. :-)
~Maarten
On 09/25/2014 02:55 PM, Peter Hurley wrote:
After several days uptime with a 3.16 kernel (generally running Thunderbird, emacs, kernel builds, several Chrome tabs on multiple desktop workspaces) I've been seeing some really extreme slowdowns.
Mostly the slowdowns are associated with gpu-related tasks, like opening new emacs windows, switching workspaces, laughing at internet gifs, etc. Because this x86_64 desktop is nouveau-based, I didn't pursue it right away -- 3.15 is the first time suspend has worked reliably.
This week I started looking into what the slowdown was and discovered it's happening during dma allocation through swiotlb (the cpus can do intel iommu but I don't use it because it's not the default for most users).
I'm still working on a bisection but each step takes 8+ hours to validate and even then I'm no longer sure I still have the 'bad' commit in the bisection. [edit: yup, I started over]
I just discovered a smattering of these in my logs and only on 3.16-rc+ kernels: Sep 25 07:57:59 thor kernel: [28786.001300] alloc_contig_range test_pages_isolated(2bf560, 2bf562) failed
This dual-Xeon box has 10GB and sysrq Show Memory isn't showing heavy fragmentation [1].
It's swapping, which is crazy because there's 7+GB of file cache [1] which should be dropped before swapping.
The alloc_contig_range() failure precedes the swapping but not immediately (44 mins. earlier).
How I reproduce this is to simply do a full distro kernel build. Skipping the TLB flush is not the problem; the results below are from 3.16-final with that commit reverted.
The slowdown is really obvious because workspace switching redraw takes multiple seconds to complete (all-cpu perf record of that below [2])
Regards, Peter Hurley
[1] SysRq : Show Memory Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 CPU 2: hi: 0, btch: 1 usd: 0 CPU 3: hi: 0, btch: 1 usd: 0 CPU 4: hi: 0, btch: 1 usd: 0 CPU 5: hi: 0, btch: 1 usd: 0 CPU 6: hi: 0, btch: 1 usd: 0 CPU 7: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 71 CPU 1: hi: 186, btch: 31 usd: 166 CPU 2: hi: 186, btch: 31 usd: 183 CPU 3: hi: 186, btch: 31 usd: 109 CPU 4: hi: 186, btch: 31 usd: 106 CPU 5: hi: 186, btch: 31 usd: 161 CPU 6: hi: 186, btch: 31 usd: 120 CPU 7: hi: 186, btch: 31 usd: 54 Node 0 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 159 CPU 1: hi: 186, btch: 31 usd: 66 CPU 2: hi: 186, btch: 31 usd: 178 CPU 3: hi: 186, btch: 31 usd: 173 CPU 4: hi: 186, btch: 31 usd: 91 CPU 5: hi: 186, btch: 31 usd: 57 CPU 6: hi: 186, btch: 31 usd: 58 CPU 7: hi: 186, btch: 31 usd: 158 active_anon:170368 inactive_anon:173964 isolated_anon:0 active_file:982209 inactive_file:973911 isolated_file:0 unevictable:15 dirty:15 writeback:1 unstable:0 free:96067 slab_reclaimable:107401 slab_unreclaimable:12572 mapped:58271 shmem:10857 pagetables:9898 bounce:0 free_cma:18 Node 0 DMA free:15860kB min:104kB low:128kB high:156kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15960kB managed:15876kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 2974 9980 9980 Node 0 DMA32 free:117740kB min:20108kB low:25132kB high:30160kB active_anon:205232kB inactive_anon:196308kB active_file:1186764kB inactive_file:1173760kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3127336kB managed:3048212kB mlocked:0kB dirty:24kB writeback:4kB mapped:71600kB shmem:8776kB slab_reclaimable:129132kB slab_unreclaimable:13468kB kernel_stack:2864kB pagetables:11536kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 7006 7006 Node 0 Normal free:250668kB min:47368kB low:59208kB high:71052kB active_anon:476240kB inactive_anon:499548kB active_file:2742072kB inactive_file:2721884kB unevictable:60kB isolated(anon):0kB isolated(file):0kB present:7340032kB managed:7174484kB mlocked:60kB dirty:36kB writeback:0kB mapped:161484kB shmem:34652kB slab_reclaimable:300472kB slab_unreclaimable:36804kB kernel_stack:7232kB pagetables:28056kB unstable:0kB bounce:0kB free_cma:72kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15860kB Node 0 DMA32: 4099*4kB (UEM) 4372*8kB (UEM) 668*16kB (UEM) 294*32kB (UEM) 47*64kB (UEM) 24*128kB (UEM) 19*256kB (UM) 5*512kB (UM) 0*1024kB 6*2048kB (M) 5*4096kB (M) = 117740kB Node 0 Normal: 22224*4kB (UEMC) 8120*8kB (UEMC) 1594*16kB (UEMC) 301*32kB (UEMC) 154*64kB (UMC) 106*128kB (UEMC) 86*256kB (UMC) 13*512kB (UEMC) 3*1024kB (M) 3*2048kB (M) 1*4096kB (R) = 254400kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 1967616 total pagecache pages 637 pages in swap cache Swap cache stats: add 9593, delete 8956, find 2162/2929 Free swap = 10966116kB Total swap = 10996456kB 2620832 pages RAM 0 pages HighMem/MovableOnly 41387 pages reserved 0 pages hwpoisoned
[2] 'perf record -a -g' while workspace switching
Samples: 27K of event 'cycles', Event count (approx.): 15793770075 + 87.11% 0.00% Xorg [kernel.kallsyms] [k] tracesys + 86.81% 0.00% Xorg [unknown] [k] 0x00007fca2df34e77 + 86.77% 0.00% Xorg [kernel.kallsyms] [k] sys_ioctl + 86.77% 0.00% Xorg [kernel.kallsyms] [k] do_vfs_ioctl + 86.77% 0.00% Xorg [nouveau] [k] nouveau_drm_ioctl + 86.77% 0.00% Xorg [drm] [k] drm_ioctl + 86.66% 0.00% Xorg [nouveau] [k] nouveau_gem_ioctl_new + 86.50% 0.00% Xorg [nouveau] [k] nouveau_gem_new + 86.49% 0.00% Xorg [nouveau] [k] nouveau_bo_new + 86.48% 0.00% Xorg [ttm] [k] ttm_bo_init + 86.47% 0.00% Xorg [ttm] [k] ttm_bo_validate + 86.46% 0.00% Xorg [ttm] [k] ttm_bo_handle_move_mem + 86.45% 0.00% Xorg [ttm] [k] ttm_tt_bind + 86.45% 0.00% Xorg [nouveau] [k] nouveau_ttm_tt_populate + 86.45% 0.00% Xorg [ttm] [k] ttm_dma_populate + 86.43% 0.01% Xorg [ttm] [k] ttm_dma_pool_alloc_new_pages + 86.42% 0.00% Xorg [kernel.kallsyms] [k] x86_swiotlb_alloc_coherent + 86.37% 0.00% Xorg [kernel.kallsyms] [k] dma_generic_alloc_coherent + 86.19% 0.00% Xorg [unknown] [k] 0x0000000000c00000 + 85.82% 0.31% Xorg [kernel.kallsyms] [k] dma_alloc_from_contiguous + 84.21% 1.05% Xorg [kernel.kallsyms] [k] alloc_contig_range + 46.56% 46.56% Xorg [kernel.kallsyms] [k] move_freepages + 46.53% 0.29% Xorg [kernel.kallsyms] [k] move_freepages_block + 39.78% 0.13% Xorg [kernel.kallsyms] [k] start_isolate_page_range + 39.22% 0.40% Xorg [kernel.kallsyms] [k] set_migratetype_isolate + 27.26% 0.17% Xorg [kernel.kallsyms] [k] undo_isolate_page_range + 26.62% 0.33% Xorg [kernel.kallsyms] [k] unset_migratetype_isolate + 15.54% 7.93% Xorg [kernel.kallsyms] [k] drain_all_pages
dri-devel@lists.freedesktop.org