https://bugzilla.kernel.org/show_bug.cgi?id=200387
Bug ID: 200387 Summary: amdgpu uses unusually high memory Product: Drivers Version: 2.5 Kernel Version: 4.16.18, 4.17.3, 4.18.0-rc2 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: felix@feldspaten.org Regression: No
Created attachment 277107 --> https://bugzilla.kernel.org/attachment.cgi?id=277107&action=edit Config file for 4.17.3.
Hi there,
I'm experiencing some out-of-memory issues while running Cities:Skylines using the amdgpu driver. Trying to run a new game cases a complete system-freeze running any Kernel that runs the amdgpu driver instead of a rather old Kernel using the amdgpu-pro driver. The memory is the system related main memory, not the GPU memory.
System details: I'm running Ubuntu Mate 16.04 with a custom build 4.17.3 Kernel (Find config attached) AMD FX-8350 32 GB RAM Radeon RX470
Sample Main Memory usage. Kernel 4.4 with amdgpu-pro driver - RAM Usage after 1 Minute: 2.4 GB Kernel 4.17.3 with amdgpu driver - RAM Usage after 1 Minute: 13 GB Kernel 4.16.18 with amdgpu driver - RAM Usage after 1 Minute: 13 GB Kernel 4.18.0-rc2 with amdgpu driver - RAM Usage after 1 Minute: 13 GB
I get similar results with running Stardew Valley (Factor two difference, clearly measurable)
Find attached the config file for the 4.17.3 Kernel. Other kernels have been build using this config file and the default suggestions for any unconfigured parameter.
Greetings, Felix
https://bugzilla.kernel.org/show_bug.cgi?id=200387
Alex Deucher (alexdeucher@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |alexdeucher@gmail.com
--- Comment #1 from Alex Deucher (alexdeucher@gmail.com) --- Please attach your dmesg output and xorg log if using X.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #2 from Michel Dänzer (michel@daenzer.net) --- Also, please attach the output of
free
and of top after pressing shift-M, both captured while RAM usage is high.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #3 from phoenix (felix@feldspaten.org) --- Yeah, I'll post the mentioned things today after I got home.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #4 from phoenix (felix@feldspaten.org) --- Created attachment 277121 --> https://bugzilla.kernel.org/attachment.cgi?id=277121&action=edit dmesg Kernel 4.17.3
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #5 from phoenix (felix@feldspaten.org) --- Created attachment 277123 --> https://bugzilla.kernel.org/attachment.cgi?id=277123&action=edit dmesg on Kernel 4.4.0
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #6 from phoenix (felix@feldspaten.org) --- Created attachment 277125 --> https://bugzilla.kernel.org/attachment.cgi?id=277125&action=edit Xorg Log on 4.17.3
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #7 from phoenix (felix@feldspaten.org) --- Created attachment 277127 --> https://bugzilla.kernel.org/attachment.cgi?id=277127&action=edit Xorg log on 4.4.0
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #8 from phoenix (felix@feldspaten.org) --- Created attachment 277129 --> https://bugzilla.kernel.org/attachment.cgi?id=277129&action=edit Free and stats of the two Kernels
Contains free and the /proc/$ID/stat and /proc/$ID/statm output of the two Kernel versions
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #9 from phoenix (felix@feldspaten.org) --- Created attachment 277131 --> https://bugzilla.kernel.org/attachment.cgi?id=277131&action=edit Output of top of the problematic process on the two Kernels
Truncated output of top of the problematic process on the two kernels
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #10 from phoenix (felix@feldspaten.org) --- I uploaded all the requested files. Interestingly the output of top and statm of the process has comparable values except for the data stack (see file stats)
Virtual, resident and shared memory are comparable.
If you need any further data don't hesitate to ask. Thank you
https://bugzilla.kernel.org/show_bug.cgi?id=200387
Christian König (christian.koenig@amd.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |christian.koenig@amd.com
--- Comment #11 from Christian König (christian.koenig@amd.com) --- You could also try to compile your kernel with kmemleak enabled.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #12 from phoenix (felix@feldspaten.org) --- I'm rebuilding the kernel and checking a possible memory leak with kmemleak.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #13 from phoenix (felix@feldspaten.org) --- Having some problems setting up kmemleak at the moment. I'll test and check tomorrow
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #14 from Michel Dänzer (michel@daenzer.net) --- Another possibility would be narrowing down where between 4.4 and 4.16 this started happening, and eventually bisecting.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #15 from phoenix (felix@feldspaten.org) --- Created attachment 277135 --> https://bugzilla.kernel.org/attachment.cgi?id=277135&action=edit kmemleak output of Cities.x64
I was finally able to create a kmemleak output and cropped it to the relevant outpt coming from the affected program.
I hope this is helpful.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #16 from Christian König (christian.koenig@amd.com) --- (In reply to phoenix from comment #15)
I hope this is helpful.
Unfortunately not really, the only thing in there is a known issue with the IOVA cache.
Can you try to bisect as Michel suggested?
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #17 from phoenix (felix@feldspaten.org) --- Sure, I'm going to investigate through the different kernel versions but that is gonna take me some time (I have to do this in my spare time)
I'll post my progress and findings, when available.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #18 from Michel Dänzer (michel@daenzer.net) --- Does the memory usage go back down when you quit the game? Or when you restart X? Or never?
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #19 from phoenix (felix@feldspaten.org) --- The memory usage goes immediately down once the game quits. No X restart necessary
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #20 from Michel Dänzer (michel@daenzer.net) --- In that case, the output of running the game in
valgrind --leak-check=full
might be interesting.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #21 from phoenix (felix@feldspaten.org) --- Jep, I'll have a look this evening. Maybe I can reproduce the issue with another program as well to exclude exclusive problems with a single userland program.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #22 from phoenix (felix@feldspaten.org) --- Apperently it's not that easy to attach valgrind to any Steam game, so I'm going the suggested approach of trying it out using different Kernel version.
Interestingly I could observe similar behaviour in Stardew Valley but not in Kerbal Space program, as the following attached statm shows:
## /proc/$ID/statm for Stardew Valley (Similar problem see the data segment) # statm for 4917 on 4.17.3 978381 424915 23927 849 0 449695 0 # statm for 4370 on 4.4.0 979917 418188 23774 849 0 874146 0
## /proc/$ID/statm for Kerbal Space Program (Problem does not occur) # statm of 5419 on 4.4.0 532753 381415 19974 7863 0 446822 0 # statm of on 4.17.3 529142 389210 19754 7863 0 441862 0
I'm investigating using different Kernel versions and maybe I'm able to write a simple OpenGL program that triggers the problem.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #23 from Michel Dänzer (michel@daenzer.net) --- (In reply to phoenix from comment #22)
Did you swap these numbers? The only significant difference is the data size (second to last number), but the 4.4 number is bigger by ~400MB.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #24 from phoenix (felix@feldspaten.org) --- Hi Michel,
wiredly not, I just double-checked them an in Stardew Valley the 4.4 number is really the 400 MB bigger one. For now I'm gonna give the kernel version numbers a try before we're working here on two things at the same time.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #25 from phoenix (felix@feldspaten.org) --- Created attachment 277153 --> https://bugzilla.kernel.org/attachment.cgi?id=277153&action=edit Memor usage measurements for different programs using Kernel 4.9.111 and 4.15.0-24
Ok, I've tested the issue using Kernel 4.15.0-24-generic (Shipped with Ubuntu Mate) using the amdgpu driver and a 4.9.111 Kernel using the amdgpu-pro driver (17.40).
Sadly building the amdgpu-pro driver for Kernel linux-4.14.53 failed, so I couldn't test that one.
The issue occurs also in the 4.15.0-24-generic Kernel, while the 4.9.111 Kernel has significantly lower main memory requirements using Cities Skylines.
Also I found out, that neither the output of mstat nor proc shows significant differences in the processes between the Kernel versions. So as of now the only accessible metric for measuring the memory usage is to look at the output for 'free'.
In addition I could observe the same memory issue (but without a system freeze) in Civilization Beyond earth using the above mentioned Kernel versions. That program is more suitable than a rather low-resource program like Stardew Valley.
Find attached the text file MemUsage.txt with my current measurements.
Attaching Valgrind to a Steam Game is kind of non-trivial, do you still think that this gives us some meaningful insights? I can work that out, but fear that this soon goes beyond the scopes of my available time, still can give it a shot.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #26 from Michel Dänzer (michel@daenzer.net) --- (In reply to phoenix from comment #25)
BTW, ideally you should only test with the kernel's own amdgpu driver, not with amdgpu-pro, because the later uses its own copies of core DRM and even some core kernel code, and has other modifications compared to the stock driver.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #27 from Christian König (christian.koenig@amd.com) --- (In reply to Michel Dänzer from comment #26)
To be even more precise I'm not sure that this is actually a kernel problem, or just caused by some mix up between the amdgpu-pro driver and the upstream driver.
So testing on a clean install could yield some more results.
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #28 from phoenix (felix@feldspaten.org) --- Hi Michel, Hi Christian,
that makes sense, I test it on a clean environment. Sorry, that I should have done that in the first place :-/
https://bugzilla.kernel.org/show_bug.cgi?id=200387
--- Comment #29 from phoenix (felix@feldspaten.org) --- I'm a bit busy at the moment, hope that I will find time on the weekend to further investigate!
https://bugzilla.kernel.org/show_bug.cgi?id=200387
phoenix (felix@feldspaten.org) changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |OBSOLETE
--- Comment #30 from phoenix (felix@feldspaten.org) --- Finally had time to investigate.
The bug doesn't appear on a fresh install of Ubuntu 16.04 using the 4.17.3 Kernel with the above posted configuration. So apperently Christian was right and it was a weird mix-up between the amdgpu-pro and the upstream driver.
I mark the bug as Resolved -> Obsolete, because it was indeed just a zombie from relict of an ancient installation :-) I should have check in the first place on a fresh install.
Anyway - Thank you very much for the support and help and I wish you still a pleasant Sunday (or a good start into the week)
Greetings, Felix
dri-devel@lists.freedesktop.org