Re: AMD graphics performance regression in 4.15 and later

23 Apr 2018


      On 2018-04-20 09:40 PM, Felix Kuehling wrote:
...
On 2018-04-20 10:47 AM, Michel Dänzer wrote:
...
On 2018-04-11 11:37 AM, Christian König wrote:
...
Am 11.04.2018 um 06:00 schrieb Gabriel C:
...
2018-04-09 11:42 GMT+02:00 Christian König
ckoenig.leichtzumerken@gmail.com:
...
Am 07.04.2018 um 00:00 schrieb Jean-Marc Valin:
...
Hi Christian,
Thanks for the info. FYI, I've also opened a Firefox bug for that at:
https://bugzilla.mozilla.org/show_bug.cgi?id=1448778
Feel free to comment since you have a better understanding of what's
going on.
One last question: right now I'm running 4.15.0 with the "offending"
patch reverted. Is that safe to run or are there possible bad
interactions with other changes.
That should work without problems.
But I just had another idea as well, if you want you could still test
the
new code path which will be using in 4.17.
While Firefox may do some strange things is not about only Firefox.
With your patches my EPYC box is unusable with  4.15++ kernels.
The whole Desktop is acting weird.  This one is using
an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU.
Box is  2 * EPYC 7281 with 128 GB ECC RAM
Also a 14C Xeon box with a HD7700 is broken same way.
The hardware is irrelevant for this. We need to know what software stack
you use on top of it.
E.g. desktop environment/Mesa and DDX version etc...
...
Everything breaks in X .. scrolling , moving windows , flickering etc.
reverting f4c809914a7c3e4a59cf543da6c2a15d0f75ee38 and
648bc3574716400acc06f99915815f80d9563783
from an 4.15 kernel makes things work again.
...
Backporting all the detection logic is to invasive, but you could
just go
into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the other
code path.
Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those.
Well you really can't be serious about these suggestions ? Are you ?
Telling peoples to #if 0 random code is not a solution.
That is for testing and not a permanent solution.
...
You broke existsing working userland with your patches and at least
please fix that for 4.16.
I can help testing code for 4.17/++ if you wish but that is
*different* storry.
Please test Alex's amd-staging-drm-next branch from
git://people.freedesktop.org/~agd5f/linux.
I think we're still missing something here.
I'm currently running 4.16.2 + the DRM subsystem changes which are going
into 4.17 (so I have the changes Christian is referring to) with a
Kaveri APU, and I'm seeing similar symptoms as described by Jean-Marc.
Some observations:
Firefox, Thunderbird, or worst, gnome-shell, can freeze for up to on the
order of a minute, during which the kernel is spending most of one
core's cycles inside alloc_pages (__alloc_pages_nodemask to be more
precise), called from ttm_alloc_new_pages.
Philip debugged a similar problem with a KFD memory stress test about
two weeks ago, where the kernel was seemingly stuck in an infinite loop
trying to allocate huge pages. I'm pasting his analysis for the record:
...
[...] it uses huge_flags GFP_TRANSHUGE to call alloc_pages(), this
seems a corner case inside __alloc_pages_slowpath(), it never exits
but goes to retry path every time. It can reclaim pages and
did_some_progress (as a result, no_progress_loops is reset to 0 every
loop, never reach MAX_RECLAIM_RETRIES) but cannot finish huge page
allocations under this specific memory pressure.
As a workaround to unblock our release branch testing we removed
transparent huge page allocation from  ttm_get_pages. We're seeing this
as far back as 4.13 on our release branch.
Thanks for sharing this. In the future, please raise issues like this on
the public mailing lists from the beginning.
...
If we're really talking about the same problem, I don't think it's
caused by recent page allocator changes, but rather exposed by recent
TTM changes.
It sounds related, but probably not exactly the same problem. I already
had the TTM code using GFP_TRANSHUGE before I ran into the issue. Also,
__alloc_pages_slowpath eventually succeeds for me, it can just take up
to about a minute.
I'm currently testing using (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY)
instead of GFP_TRANSHUGE in TTM.
-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: AMD graphics performance regression in 4.15 and later