Re: GEM memory DOS (WAS Re: [PATCH 3/3] drm/ttm: under memory pressure minimize the size of memory pool)

13 Aug 2014


      On Wed, Aug 13, 2014 at 12:30:45PM -0400, Alex Deucher wrote:
...
On Wed, Aug 13, 2014 at 12:24 PM, Daniel Vetter daniel@ffwll.ch wrote:
...
On Wed, Aug 13, 2014 at 05:13:56PM +0200, Thomas Hellstrom wrote:
...
On 08/13/2014 03:01 PM, Daniel Vetter wrote:
...
On Wed, Aug 13, 2014 at 02:35:52PM +0200, Thomas Hellstrom wrote:
...
On 08/13/2014 12:42 PM, Daniel Vetter wrote:
...
On Wed, Aug 13, 2014 at 11:06:25AM +0200, Thomas Hellstrom wrote:
> On 08/13/2014 05:52 AM, Jérôme Glisse wrote:
>> From: Jérôme Glisse jglisse@redhat.com
>>
>> When experiencing memory pressure we want to minimize pool size so that
>> memory we just shrinked is not added back again just as the next thing.
>>
>> This will divide by 2 the maximum pool size for each device each time
>> the pool have to shrink. The limit is bumped again is next allocation
>> happen after one second since the last shrink. The one second delay is
>> obviously an arbitrary choice.
> Jérôme,
>
> I don't like this patch. It adds extra complexity and its usefulness is
> highly questionable.
> There are a number of caches in the system, and if all of them added
> some sort of voluntary shrink heuristics like this, we'd end up with
> impossible-to-debug unpredictable performance issues.
>
> We should let the memory subsystem decide when to reclaim pages from
> caches and what caches to reclaim them from.
Yeah, artificially limiting your cache from growing when your shrinker
gets called will just break the equal-memory pressure the core mm uses to
rebalance between all caches when workload changes. In i915 we let
everything grow without artificial bounds and only rely upon the shrinker
callbacks to ensure we don't consume more than our fair share of available
memory overall.
-Daniel
Now when you bring i915 memory usage up, Daniel,
I can't refrain from bringing up the old user-space unreclaimable kernel
memory issue, for which gem open is a good example ;) Each time
user-space opens a gem handle, some un-reclaimable kernel memory is
allocated, for which there is no accounting, so theoretically I think a
user can bring a system to unusability this way.
Typically there are various limits on unreclaimable objects like this,
like open file descriptors, and IIRC the kernel even has an internal
limit on the number of struct files you initialize, based on the
available system memory, so dma-buf / prime should already have some
sort of protection.
Oh yeah, we have zero cgroups limits or similar stuff for gem allocations,
so there's not really a way to isolate gpu memory usage in a sane way for
specific processes. But there's also zero limits on actual gpu usage
itself (timeslices or whatever) so I guess no one asked for this yet.
In its simplest form (like in TTM if correctly implemented by drivers)
this type of accounting stops non-privileged malicious GPU-users from
exhausting all system physical memory causing grief for other kernel
systems but not from causing grief for other GPU users. I think that's
the minimum level that's intended also for example also for the struct
file accounting.
I think in i915 we're fairly close on that minimal standard - interactions
with shrinkers and oom logic work decently. It starts to fall apart though
when we've actually run out of memory - if the real memory hog is a gpu
process the oom killer won't notice all that memory since it's not
accounted against processes correctly.
I don't agree that gpu process should be punished in general compared to
other subsystems in the kernel. If the user wants to use 90% of all memory
for gpu tasks then I want to make that possible, even if it means that
everything else thrashes horribly. And as long as the system recovers and
rebalances after that gpu memory hog is gone ofc. Iirc ttm currently has a
fairly arbitrary (tunable) setting to limit system memory consumption, but
I might be wrong on that.
Yes, it currently limits you to half of memory, but at least we would
like to make it tuneable since there are a lot of user cases where the
user wants to use 90% of memory for GPU tasks at the expense of
everything else.
Ime a lot of fun stuff starts to happen when you go there. We have piles
of memory thrashing testcases and generally had lots of fun with our
shrinker, so I think until you've really beaten onto those paths in
ttm+radeon I'd keep the limit where it is.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: GEM memory DOS (WAS Re: [PATCH 3/3] drm/ttm: under memory pressure minimize the size of memory pool)