SLAB_TYPESAFE_BY_RCU without constructors (was Re: [PATCH v4 13/17] khwasan: add hooks implementation)

List overview All Threads
Download

newer

older

[PATCH] gpu: drm: radeon: cik:...

[PATCH v3 0/8] Fix connector...

Andrey Ryabinin

31 Jul 2018 31 Jul '18

5:01 p.m.

On 07/31/2018 07:04 PM, Andrey Ryabinin wrote:

...

...
Somewhat offtopic, but I can't understand how SLAB_TYPESAFE_BY_RCU slabs can be useful without ctors or at least memset(0). Objects in such slabs need to be type-stable, but I can't understand how it's possible to establish type stability without a ctor... Are these bugs?

Yeah, I puzzled by this too. However, I think it's hard but possible to make it work, at least in theory. There must be an initializer, which consists of two parts: a) initilize objects fields b) expose object to the world (add it to list or something like that)

(a) part must somehow to be ok to race with another cpu which might already use the object. (b) part must must use e.g. barriers to make sure that racy users will see previously inilized fields. Racy users must have parring barrier of course.

But it sound fishy, and very easy to fuck up. I won't be surprised if every single one SLAB_TYPESAFE_BY_RCU user without ->ctor is bogus. It certainly would be better to convert those to use ->ctor.

Such caches seems used by networking subsystem in proto_register():
prot->slab = kmem_cache_create_usercopy(prot->name,
			prot->obj_size, 0,
			SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT |
			prot->slab_flags,
			prot->useroffset, prot->usersize,
			NULL);
And certain protocols specify SLAB_TYPESAFE_BY_RCU in ->slab_flags, such as: llc_proto, smc_proto, smc_proto6, tcp_prot, tcpv6_prot, dccp_v6_prot, dccp_v4_prot.

Also nf_conntrack_cachep, kernfs_node_cache, jbd2_journal_head_cache and i915_request cache.

[+CC maintainer of the relevant code.]

Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor. I think it's nearly impossible to use that combination without having bugs. It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache.

Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor?

E.g. the netlink code look extremely suspicious:

/* * Do not use kmem_cache_zalloc(), as this cache uses * SLAB_TYPESAFE_BY_RCU. */ ct = kmem_cache_alloc(nf_conntrack_cachep, gfp); if (ct == NULL) goto out;

spin_lock_init(&ct->lock);

If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be in use by another cpu. So we just reinitialize spin_lock used by someone else?

Show replies by date

Florian Westphal

31 Jul 31 Jul

5:09 p.m.

Andrey Ryabinin aryabinin@virtuozzo.com wrote:

...

Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor. I think it's nearly impossible to use that combination without having bugs. It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache.

Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor?

E.g. the netlink code look extremely suspicious:

/* * Do not use kmem_cache_zalloc(), as this cache uses * SLAB_TYPESAFE_BY_RCU. */ ct = kmem_cache_alloc(nf_conntrack_cachep, gfp); if (ct == NULL) goto out;

spin_lock_init(&ct->lock);

If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be in use by another cpu. So we just reinitialize spin_lock used by someone else?

That would be a bug, nf_conn objects are reference counted.

spinlock can only be used after object had its refcount incremented.

lookup operation on nf_conn object: 1. compare keys 2. attempt to obtain refcount (using _not_zero version) 3. compare keys again after refcount was obtained

if any of that fails, nf_conn candidate is skipped.

Eric Dumazet

5:32 p.m.

On Tue, Jul 31, 2018 at 10:10 AM Florian Westphal fw@strlen.de wrote:

...

Andrey Ryabinin aryabinin@virtuozzo.com wrote:

...
Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor. I think it's nearly impossible to use that combination without having bugs. It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache.

Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor?

E.g. the netlink code look extremely suspicious:
  /*
   * Do not use kmem_cache_zalloc(), as this cache uses
   * SLAB_TYPESAFE_BY_RCU.
   */
  ct = kmem_cache_alloc(nf_conntrack_cachep, gfp);
  if (ct == NULL)
          goto out;

  spin_lock_init(&ct->lock);
If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be in use by another cpu. So we just reinitialize spin_lock used by someone else?
That would be a bug, nf_conn objects are reference counted.

spinlock can only be used after object had its refcount incremented.

lookup operation on nf_conn object:

compare keys

attempt to obtain refcount (using _not_zero version)

compare keys again after refcount was obtained

if any of that fails, nf_conn candidate is skipped.

Yes, the key here is the refcount, this is only what we need to clear after kmem_cache_alloc()

By definition, if an object is being freed/reallocated, the refcount should be already 0, and clearing it again is a NOP.

Christopher Lameter

5:36 p.m.

On Tue, 31 Jul 2018, Andrey Ryabinin wrote:

...

Guys, it seems that we have a lot of code using SLAB_TYPESAFE_BY_RCU cache without constructor. I think it's nearly impossible to use that combination without having bugs. It's either you don't really need the SLAB_TYPESAFE_BY_RCU, or you need to have a constructor in kmem_cache.

Could you guys, please, verify your code if it's really need SLAB_TYPSAFE or constructor?

E.g. the netlink code look extremely suspicious:

/* * Do not use kmem_cache_zalloc(), as this cache uses * SLAB_TYPESAFE_BY_RCU. */ ct = kmem_cache_alloc(nf_conntrack_cachep, gfp); if (ct == NULL) goto out;

spin_lock_init(&ct->lock);

If nf_conntrack_cachep objects really used in rcu typesafe manner, than 'ct' returned by kmem_cache_alloc might still be in use by another cpu. So we just reinitialize spin_lock used by someone else?

ct may still be read by another cpu in a RCU section but the object was freed elsewhere so no other processor may modify the object.

The lock must have been released before freeing the slab object and thus the initialization of the spinlock is unnecessary if it was initialized in ctor.

If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU?

Eric Dumazet

5:41 p.m.

On Tue, Jul 31, 2018 at 10:36 AM Christopher Lameter cl@linux.com wrote:

...

If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU?

To allow fast reuse of objects, without going through call_rcu() and reducing cache efficiency.

I believe this is mentioned in Documentation/RCU/rculist_nulls.txt

Dmitry Vyukov

5:51 p.m.

On Tue, Jul 31, 2018 at 7:41 PM, Eric Dumazet edumazet@google.com wrote:

...

On Tue, Jul 31, 2018 at 10:36 AM Christopher Lameter cl@linux.com wrote:

...
If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU?

To allow fast reuse of objects, without going through call_rcu() and reducing cache efficiency.

I believe this is mentioned in Documentation/RCU/rculist_nulls.txt

Is it OK to overwrite ct->status? It seems that are some read and writes to it right after atomic_inc_not_zero.

Eric Dumazet

6:16 p.m.

On Tue, Jul 31, 2018 at 10:51 AM Dmitry Vyukov dvyukov@google.com wrote:

...

Is it OK to overwrite ct->status? It seems that are some read and writes to it right after atomic_inc_not_zero.

If it is after a (successful) atomic_inc_not_zero(), the object is guaranteed to be alive (not freed or about to be freed).

About readind/writing a specific field, all traditional locking rules apply.

For TCP socket, we would generally grab the socket lock before reading/writing various fields.

ct->status seems to be manipulated with set_bit() and clear_bit() which are SMP safe.

Linus Torvalds

5:49 p.m.

On Tue, Jul 31, 2018 at 10:36 AM Christopher Lameter cl@linux.com wrote:

...

If there is refcounting going on then why use SLAB_TYPESAFE_BY_RCU?

.. because the object can be accessed (by RCU) after the refcount has gone down to zero, and the thing has been released.

That's the whole and only point of SLAB_TYPESAFE_BY_RCU.

That flag basically says:

"I may end up accessing this object *after* it has been free'd, because there may be RCU lookups in flight"

This has nothing to do with constructors. It's ok if the object gets reused as an object of the same type and does *not* get re-initialized, because we're perfectly fine seeing old stale data.

What it guarantees is that the slab isn't shared with any other kind of object, _and_ that the underlying pages are free'd after an RCU quiescent period (so the pages aren't shared with another kind of object either during an RCU walk).

And it doesn't necessarily have to have a constructor, because the thing that a RCU walk will care about is

(a) guaranteed to be an object that *has* been on some RCU list (so it's not a "new" object)

(b) the RCU walk needs to have logic to verify that it's still the *same* object and hasn't been re-used as something else.

So the re-use might initialize the fields lazily, not necessarily using a ctor.

And the point of using SLAB_TYPESAFE_BY_RCU is that using the more traditional RCU freeing - where you free each object one by one with an RCU delay - can be prohibitively slow and have a huge memory overhead (because of big chunks of memory that are queued for freeing).

In contrast, a SLAB_TYPESAFE_BY_RCU memory gets free'd and re-used immediately, but because it gets reused as the same kind of object, the RCU walker can "know" what parts have meaning for re-use, in a way it couidn't if the re-use was random.

That said, it *is* subtle, and people should be careful.

Linus

Linus Torvalds

6:51 p.m.

On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds torvalds@linux-foundation.org wrote:

...

So the re-use might initialize the fields lazily, not necessarily using a ctor.

In particular, the pattern that nf_conntrack uses looks like it is safe.

If you have a well-defined refcount, and use "atomic_inc_not_zero()" to guard the speculative RCU access section, and use "atomic_dec_and_test()" in the freeing section, then you should be safe wrt new allocations.

If you have a completely new allocation that has "random stale content", you know that it cannot be on the RCU list, so there is no speculative access that can ever see that random content.

So the only case you need to worry about is a re-use allocation, and you know that the refcount will start out as zero even if you don't have a constructor.

So you can think of the refcount itself as always having a zero constructor, *BUT* you need to be careful with ordering.

In particular, whoever does the allocation needs to then set the refcount to a non-zero value *after* it has initialized all the other fields. And in particular, it needs to make sure that it uses the proper memory ordering to do so.

And in this case, we have

static struct nf_conn * __nf_conntrack_alloc(struct net *net, { ... atomic_set(&ct->ct_general.use, 0);

which is a no-op for the re-use case (whether racing or not, since any "inc_not_zero" users won't touch it), but initializes it to zero for the "completely new object" case.

And then, the thing that actually exposes it to the speculative walkers does:

int nf_conntrack_hash_check_insert(struct nf_conn *ct) { ... smp_wmb(); /* The caller holds a reference to this object */ atomic_set(&ct->ct_general.use, 2);

which means that it stays as zero until everything is actually set up, and then the optimistic walker can use the other fields (including spinlocks etc) to verify that it's actually the right thing. The smp_wmb() means that the previous initialization really will be visible before the object is visible.

Side note: on some architectures it might help to make that "smp_wmb -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't matter on x86, but might matter on arm64.

NOTE! One thing to be very worried about is that re-initializing whatever RCU lists means that now the RCU walker may be walking on the wrong list so the walker may do the right thing for this particular entry, but it may miss walking *other* entries. So then you can get spurious lookup failures, because the RCU walker never walked all the way to the end of the right list. That ends up being a much more subtle bug.

But the nf_conntrack case seems to get that right too, see the restart in ____nf_conntrack_find().

So I don't see anything wrong in nf_conntrack.

But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of the subtleties have nothing to do with having a constructor, they are about those "make sure memory ordering wrt refcount is right" and "restart speculative RCU walk" issues that actually happen regardless of having a constructor or not.

Linus

Dmitry Vyukov

1 Aug 1 Aug

8:46 a.m.

On Tue, Jul 31, 2018 at 8:51 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

...

On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds torvalds@linux-foundation.org wrote:

...
So the re-use might initialize the fields lazily, not necessarily using a ctor.

In particular, the pattern that nf_conntrack uses looks like it is safe.

If you have a well-defined refcount, and use "atomic_inc_not_zero()" to guard the speculative RCU access section, and use "atomic_dec_and_test()" in the freeing section, then you should be safe wrt new allocations.

If you have a completely new allocation that has "random stale content", you know that it cannot be on the RCU list, so there is no speculative access that can ever see that random content.

So the only case you need to worry about is a re-use allocation, and you know that the refcount will start out as zero even if you don't have a constructor.

So you can think of the refcount itself as always having a zero constructor, *BUT* you need to be careful with ordering.

In particular, whoever does the allocation needs to then set the refcount to a non-zero value *after* it has initialized all the other fields. And in particular, it needs to make sure that it uses the proper memory ordering to do so.

And in this case, we have

static struct nf_conn * __nf_conntrack_alloc(struct net *net, { ... atomic_set(&ct->ct_general.use, 0);

which is a no-op for the re-use case (whether racing or not, since any "inc_not_zero" users won't touch it), but initializes it to zero for the "completely new object" case.

And then, the thing that actually exposes it to the speculative walkers does:

int nf_conntrack_hash_check_insert(struct nf_conn *ct) { ... smp_wmb(); /* The caller holds a reference to this object */ atomic_set(&ct->ct_general.use, 2);

which means that it stays as zero until everything is actually set up, and then the optimistic walker can use the other fields (including spinlocks etc) to verify that it's actually the right thing. The smp_wmb() means that the previous initialization really will be visible before the object is visible.

Side note: on some architectures it might help to make that "smp_wmb -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't matter on x86, but might matter on arm64.

NOTE! One thing to be very worried about is that re-initializing whatever RCU lists means that now the RCU walker may be walking on the wrong list so the walker may do the right thing for this particular entry, but it may miss walking *other* entries. So then you can get spurious lookup failures, because the RCU walker never walked all the way to the end of the right list. That ends up being a much more subtle bug.

But the nf_conntrack case seems to get that right too, see the restart in ____nf_conntrack_find().

So I don't see anything wrong in nf_conntrack.

But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of the subtleties have nothing to do with having a constructor, they are about those "make sure memory ordering wrt refcount is right" and "restart speculative RCU walk" issues that actually happen regardless of having a constructor or not.

Thank you, this is very enlightening.

So the type-stable invariant is established by initialization of the object after the first kmem_cache_alloc, and then we rely on the fact that repeated initialization does not break the invariant, which works because the state is very simple (including debug builds, i.e. no ODEBUG nor magic values).

There is a bunch of other SLAB_TYPESAFE_BY_RCU uses without ctor:

https://elixir.bootlin.com/linux/latest/source/fs/jbd2/journal.c#L2395 https://elixir.bootlin.com/linux/latest/source/fs/kernfs/mount.c#L415 https://elixir.bootlin.com/linux/latest/source/net/netfilter/nf_conntrack_co... https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/i915_gem... https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/selftest... https://elixir.bootlin.com/linux/latest/source/drivers/staging/lustre/lustre...

Also these in proto structs: https://elixir.bootlin.com/linux/latest/source/net/dccp/ipv4.c#L959 https://elixir.bootlin.com/linux/latest/source/net/dccp/ipv6.c#L1048 https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_ipv4.c#L2461 https://elixir.bootlin.com/linux/latest/source/net/ipv6/tcp_ipv6.c#L1980 https://elixir.bootlin.com/linux/latest/source/net/llc/af_llc.c#L145 https://elixir.bootlin.com/linux/latest/source/net/smc/af_smc.c#L105 They later created in net/core/sock.c without ctor.

I guess it would be useful to have such extensive comment for each SLAB_TYPESAFE_BY_RCU use explaining why it is needed and how all the tricky aspects are handled.

For example, the one in jbd2 is interesting because it memsets the whole object before freeing it into SLAB_TYPESAFE_BY_RCU slab:

memset(jh, JBD2_POISON_FREE, sizeof(*jh)); kmem_cache_free(jbd2_journal_head_cache, jh);

I guess there are also tricky ways how it can all work in the end (type-stable state is only a byte, or we check for all possible combinations of being overwritten with JBD2_POISON_FREE). But at first sight it does look fishy.

Dmitry Vyukov

9:10 a.m.

On Wed, Aug 1, 2018 at 10:46 AM, Dmitry Vyukov dvyukov@google.com wrote:

...

On Tue, Jul 31, 2018 at 8:51 PM, Linus Torvalds torvalds@linux-foundation.org wrote:

...
On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds torvalds@linux-foundation.org wrote:

...
So the re-use might initialize the fields lazily, not necessarily using a ctor.

In particular, the pattern that nf_conntrack uses looks like it is safe.

If you have a well-defined refcount, and use "atomic_inc_not_zero()" to guard the speculative RCU access section, and use "atomic_dec_and_test()" in the freeing section, then you should be safe wrt new allocations.

If you have a completely new allocation that has "random stale content", you know that it cannot be on the RCU list, so there is no speculative access that can ever see that random content.

So the only case you need to worry about is a re-use allocation, and you know that the refcount will start out as zero even if you don't have a constructor.

So you can think of the refcount itself as always having a zero constructor, *BUT* you need to be careful with ordering.

In particular, whoever does the allocation needs to then set the refcount to a non-zero value *after* it has initialized all the other fields. And in particular, it needs to make sure that it uses the proper memory ordering to do so.

And in this case, we have

static struct nf_conn * __nf_conntrack_alloc(struct net *net, { ... atomic_set(&ct->ct_general.use, 0);

which is a no-op for the re-use case (whether racing or not, since any "inc_not_zero" users won't touch it), but initializes it to zero for the "completely new object" case.

And then, the thing that actually exposes it to the speculative walkers does:

int nf_conntrack_hash_check_insert(struct nf_conn *ct) { ... smp_wmb(); /* The caller holds a reference to this object */ atomic_set(&ct->ct_general.use, 2);

which means that it stays as zero until everything is actually set up, and then the optimistic walker can use the other fields (including spinlocks etc) to verify that it's actually the right thing. The smp_wmb() means that the previous initialization really will be visible before the object is visible.

Side note: on some architectures it might help to make that "smp_wmb -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't matter on x86, but might matter on arm64.

NOTE! One thing to be very worried about is that re-initializing whatever RCU lists means that now the RCU walker may be walking on the wrong list so the walker may do the right thing for this particular entry, but it may miss walking *other* entries. So then you can get spurious lookup failures, because the RCU walker never walked all the way to the end of the right list. That ends up being a much more subtle bug.

But the nf_conntrack case seems to get that right too, see the restart in ____nf_conntrack_find().

So I don't see anything wrong in nf_conntrack.

But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of the subtleties have nothing to do with having a constructor, they are about those "make sure memory ordering wrt refcount is right" and "restart speculative RCU walk" issues that actually happen regardless of having a constructor or not.

Thank you, this is very enlightening.

So the type-stable invariant is established by initialization of the object after the first kmem_cache_alloc, and then we rely on the fact that repeated initialization does not break the invariant, which works because the state is very simple (including debug builds, i.e. no ODEBUG nor magic values).

Still can't grasp all details. There is state that we read without taking ct->ct_general.use ref first, namely ct->state and what's used by nf_ct_key_equal. So let's say the entry we want to find is in the list, but ____nf_conntrack_find finds a wrong entry earlier because all state it looks at is random garbage, so it returns the wrong entry to __nf_conntrack_find_get. Now (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use)) check in __nf_conntrack_find_get passes, and it returns NULL to the caller (which means entry is not present). While in reality the entry is present, but we were just looking at the wrong one.

Also I am not sure about order of checks in (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use)), because checking state before taking the ref is only a best-effort hint, so it can actually be a dying entry when we take a ref.

So shouldn't it read something like the following?

rcu_read_lock(); begin: h = ____nf_conntrack_find(net, zone, tuple, hash); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); if (!atomic_inc_not_zero(&ct->ct_general.use)) goto begin; if (unlikely(nf_ct_is_dying(ct)) || unlikely(!nf_ct_key_equal(h, tuple, zone, net))) { nf_ct_put(ct); goto begin; } } rcu_read_unlock(); return h;

Florian Westphal

10:35 a.m.

Dmitry Vyukov dvyukov@google.com wrote:

...

Still can't grasp all details. There is state that we read without taking ct->ct_general.use ref first, namely ct->state and what's used by nf_ct_key_equal. So let's say the entry we want to find is in the list, but ____nf_conntrack_find finds a wrong entry earlier because all state it looks at is random garbage, so it returns the wrong entry to __nf_conntrack_find_get.

If an entry can be found, it can't be random garbage. We never link entries into global table until state has been set up.

...

Now (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use)) check in __nf_conntrack_find_get passes, and it returns NULL to the caller (which means entry is not present).

So entry is going away or marked as dead which for us is same as 'not present', we need to allocate a new entry.

...

While in reality the entry is present, but we were just looking at the wrong one.

We never add tuples that are identical to the global table.

If N cores receive identical packets at same time with no prior state, all will allocate a new conntrack, but we notice this when we try to insert the nf_conn entries into the table.

Only one will succeed, other cpus have to cope with this. (worst case: all raced packets are dropped along with their conntrack object).

For lookup, we have following scenarios:

1. It doesn't exist -> new allocation needed 2. It exists, not dead, has nonzero refount -> use it 3. It exists, but marked as dying -> new allocation needed 4. It exists but has 0 reference count -> new allocation needed 5. It exists, we get reference, but 2nd nf_ct_key_equal check fails. We saw a matching 'old incarnation' that just got re-used on other core. -> retry lookup

...

Also I am not sure about order of checks in (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use)), because checking state before taking the ref is only a best-effort hint, so it can actually be a dying entry when we take a ref.

Yes, it can also become a dying entry after we took the reference.

...

So shouldn't it read something like the following?
    rcu_read_lock();
begin: h = ____nf_conntrack_find(net, zone, tuple, hash); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); if (!atomic_inc_not_zero(&ct->ct_general.use)) goto begin; if (unlikely(nf_ct_is_dying(ct)) || unlikely(!nf_ct_key_equal(h, tuple, zone, net))) { nf_ct_put(ct);

It would be ok to make this change, but dying bit can be set at any time e.g. because userspace tells kernel to flush the conntrack table. So refcount is always > 0 when the DYING bit is set.

I don't see why it would be a problem.

nf_conn struct will stay valid until all cpus have dropped references. The check in lookup function only serves to hide the known-to-go-away entry.

Dmitry Vyukov

10:41 a.m.

On Wed, Aug 1, 2018 at 12:35 PM, Florian Westphal fw@strlen.de wrote:

...

Dmitry Vyukov dvyukov@google.com wrote:

...
Still can't grasp all details. There is state that we read without taking ct->ct_general.use ref first, namely ct->state and what's used by nf_ct_key_equal. So let's say the entry we want to find is in the list, but ____nf_conntrack_find finds a wrong entry earlier because all state it looks at is random garbage, so it returns the wrong entry to __nf_conntrack_find_get.

If an entry can be found, it can't be random garbage. We never link entries into global table until state has been set up.

But... we don't hold a reference to the entry. So say it's in the table with valid state, now ____nf_conntrack_find discovers it, now the entry is removed and reused a dozen of times will all associated state reinitialization. And nf_ct_key_equal looks at it concurrently and decides if it's the entry we are looking for or now. I think unless we hold a ref to the entry, it's state needs to be considered random garbage for correctness reasoning.

...

...
Now (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use)) check in __nf_conntrack_find_get passes, and it returns NULL to the caller (which means entry is not present).

So entry is going away or marked as dead which for us is same as 'not present', we need to allocate a new entry.

...
While in reality the entry is present, but we were just looking at the wrong one.

We never add tuples that are identical to the global table.

If N cores receive identical packets at same time with no prior state, all will allocate a new conntrack, but we notice this when we try to insert the nf_conn entries into the table.

Only one will succeed, other cpus have to cope with this. (worst case: all raced packets are dropped along with their conntrack object).

For lookup, we have following scenarios:

It doesn't exist -> new allocation needed

It exists, not dead, has nonzero refount -> use it

It exists, but marked as dying -> new allocation needed

It exists but has 0 reference count -> new allocation needed

It exists, we get reference, but 2nd nf_ct_key_equal check fails. We saw a matching 'old incarnation' that just got re-used on other core. -> retry lookup

...
Also I am not sure about order of checks in (nf_ct_is_dying(ct) || !atomic_inc_not_zero(&ct->ct_general.use)), because checking state before taking the ref is only a best-effort hint, so it can actually be a dying entry when we take a ref.

Yes, it can also become a dying entry after we took the reference.

...
So shouldn't it read something like the following?
    rcu_read_lock();
begin: h = ____nf_conntrack_find(net, zone, tuple, hash); if (h) { ct = nf_ct_tuplehash_to_ctrack(h); if (!atomic_inc_not_zero(&ct->ct_general.use)) goto begin; if (unlikely(nf_ct_is_dying(ct)) || unlikely(!nf_ct_key_equal(h, tuple, zone, net))) { nf_ct_put(ct);
It would be ok to make this change, but dying bit can be set at any time e.g. because userspace tells kernel to flush the conntrack table. So refcount is always > 0 when the DYING bit is set.

I don't see why it would be a problem.

nf_conn struct will stay valid until all cpus have dropped references. The check in lookup function only serves to hide the known-to-go-away entry.

Florian Westphal

11:40 a.m.

Dmitry Vyukov dvyukov@google.com wrote:

...

On Wed, Aug 1, 2018 at 12:35 PM, Florian Westphal fw@strlen.de wrote:

...
Dmitry Vyukov dvyukov@google.com wrote:

...
Still can't grasp all details. There is state that we read without taking ct->ct_general.use ref first, namely ct->state and what's used by nf_ct_key_equal. So let's say the entry we want to find is in the list, but ____nf_conntrack_find finds a wrong entry earlier because all state it looks at is random garbage, so it returns the wrong entry to __nf_conntrack_find_get.

If an entry can be found, it can't be random garbage. We never link entries into global table until state has been set up.

But... we don't hold a reference to the entry. So say it's in the table with valid state, now ____nf_conntrack_find discovers it, now the entry is removed and reused a dozen of times will all associated state reinitialization. And nf_ct_key_equal looks at it concurrently and decides if it's the entry we are looking for or now. I think unless we hold a ref to the entry, it's state needs to be considered random garbage for correctness reasoning.

It checks if it might be the entry we're looking for.

If this was complete random garbage, scheme would not work, as then we could have entry that isn't in table, has nonzero refcount, but has its confirmed bit set.

I don't see how that would be possible, any reallocation makes sure ct->status has CONFIRMED bit clear before setting refcount to nonzero value.

I think this is the scenario you hint at is: 1. nf_ct_key_equal is true 2. the entry is free'd (or was already free'd) 3. we return NULL to caller due to atomic_inc_not_zero() failure

but i fail to see how thats wrong?

Sure, we could restart lookup but how would that help?

We'd not find the 'candidate' entry again.

We might find entry that has been inserted at this very instant but newly allocated entries are only inserted into global table until the skb that created the nf_conn object has made it through the network stack (postrouting for fowarded, or input for local delivery).

So, the likelyhood of such restart finding another candidate is close to 0, and won't prevent 'insert race' from happening.

Dmitry Vyukov

12:38 p.m.

On Wed, Aug 1, 2018 at 1:40 PM, Florian Westphal fw@strlen.de wrote:

...

Dmitry Vyukov dvyukov@google.com wrote:

...
On Wed, Aug 1, 2018 at 12:35 PM, Florian Westphal fw@strlen.de wrote:

...
Dmitry Vyukov dvyukov@google.com wrote:

...
Still can't grasp all details. There is state that we read without taking ct->ct_general.use ref first, namely ct->state and what's used by nf_ct_key_equal. So let's say the entry we want to find is in the list, but ____nf_conntrack_find finds a wrong entry earlier because all state it looks at is random garbage, so it returns the wrong entry to __nf_conntrack_find_get.

If an entry can be found, it can't be random garbage. We never link entries into global table until state has been set up.

But... we don't hold a reference to the entry. So say it's in the table with valid state, now ____nf_conntrack_find discovers it, now the entry is removed and reused a dozen of times will all associated state reinitialization. And nf_ct_key_equal looks at it concurrently and decides if it's the entry we are looking for or now. I think unless we hold a ref to the entry, it's state needs to be considered random garbage for correctness reasoning.

It checks if it might be the entry we're looking for.

If this was complete random garbage, scheme would not work, as then we could have entry that isn't in table, has nonzero refcount, but has its confirmed bit set.

I don't see how that would be possible, any reallocation makes sure ct->status has CONFIRMED bit clear before setting refcount to nonzero value.

I think this is the scenario you hint at is:

nf_ct_key_equal is true

the entry is free'd (or was already free'd)

we return NULL to caller due to atomic_inc_not_zero() failure

but i fail to see how thats wrong?

Sure, we could restart lookup but how would that help?

We'd not find the 'candidate' entry again.

We might find entry that has been inserted at this very instant but newly allocated entries are only inserted into global table until the skb that created the nf_conn object has made it through the network stack (postrouting for fowarded, or input for local delivery).

So, the likelyhood of such restart finding another candidate is close to 0, and won't prevent 'insert race' from happening.

The scenario I have in mind is different and it relates to the fact that ____nf_conntrack_find will return the right entry if it's present, but it can also return an unrelated entry because when it looks at entries they change underneath.

Let's take any 2 fields compared by nf_ct_key_equal for simplicity, for example, src.u3 and dst.u3. Let's say we are looking for an entry with src.u3=A and dst.u3=B, let's call it (A,B). Let's say there is an existing entry 1 (A,B) in the global list. But there is also entry 2 (A,C) earlier in the list. Now, ____nf_conntrack_find starts checking entry 2 (A,C), it checks that src.u3==A, so so far it looks good. Now another thread deletes, reuses and reinitilizes entry 2 for (C,B). Now, ____nf_conntrack_find checks that dst.u3==B, so it concludes that it's the right entry, because it observed (A,B). Entry 2 is returned to __nf_conntrack_find_get. Now another thread marks entry 2 as dying. Now __nf_conntrack_find_get sees that it's dying and returns NULL to caller, _but_ the matching entry 1 (A,B) was in the list all that time and we should have been discovered it, but we didn't because we were deraield by the wrong entry 2.

If that scenario is possible that a fix would be to make __nf_conntrack_find_get ever return NULL iff it got NULL from ____nf_conntrack_find (not if any of the checks has failed).

Florian Westphal

1:46 p.m.

Dmitry Vyukov dvyukov@google.com wrote:

...

If that scenario is possible that a fix would be to make

Looks possible.

...

__nf_conntrack_find_get ever return NULL iff it got NULL from ____nf_conntrack_find (not if any of the checks has failed).

I don't see why we need to restart on nf_ct_is_dying(), but other than that this seems ok.

Dmitry Vyukov

1:52 p.m.

On Wed, Aug 1, 2018 at 3:46 PM, Florian Westphal fw@strlen.de wrote:

...

Dmitry Vyukov dvyukov@google.com wrote:

...
If that scenario is possible that a fix would be to make

Looks possible.

...
__nf_conntrack_find_get ever return NULL iff it got NULL from ____nf_conntrack_find (not if any of the checks has failed).

I don't see why we need to restart on nf_ct_is_dying(), but other than that this seems ok.

Because it can be a wrong entry dying. When we check dying, we don't yet know if we are looking at the right entry or not. If we successfully acquire a reference, then recheck nf_ct_key_equal and _then_ check dying, then we don't need to restart on dying. But with the current check order, we need to restart on dying too.

Jan Kara

6 Aug 6 Aug

8:20 p.m.

On Wed 01-08-18 10:46:35, Dmitry Vyukov wrote:

...

I guess it would be useful to have such extensive comment for each SLAB_TYPESAFE_BY_RCU use explaining why it is needed and how all the tricky aspects are handled.

For example, the one in jbd2 is interesting because it memsets the whole object before freeing it into SLAB_TYPESAFE_BY_RCU slab:

memset(jh, JBD2_POISON_FREE, sizeof(*jh)); kmem_cache_free(jbd2_journal_head_cache, jh);

I guess there are also tricky ways how it can all work in the end (type-stable state is only a byte, or we check for all possible combinations of being overwritten with JBD2_POISON_FREE). But at first sight it does look fishy.

The RCU access is used from a single place:

fs/jbd2/transaction.c: jbd2_write_access_granted()

There are also quite some comments explaining why what it does is safe. The overwrite by JBD2_POISON_FREE is much older than this RCU stuff (honestly I didn't know about it until this moment) and has nothing to do with the safety of RCU access.

Honza

-- Jan Kara jack@suse.com SUSE Labs, CR

Andrey Ryabinin

1 Aug 1 Aug

9:03 a.m.

On 07/31/2018 09:51 PM, Linus Torvalds wrote:

...

On Tue, Jul 31, 2018 at 10:49 AM Linus Torvalds torvalds@linux-foundation.org wrote:

...
So the re-use might initialize the fields lazily, not necessarily using a ctor.

In particular, the pattern that nf_conntrack uses looks like it is safe.

If you have a well-defined refcount, and use "atomic_inc_not_zero()" to guard the speculative RCU access section, and use "atomic_dec_and_test()" in the freeing section, then you should be safe wrt new allocations.

If you have a completely new allocation that has "random stale content", you know that it cannot be on the RCU list, so there is no speculative access that can ever see that random content.

So the only case you need to worry about is a re-use allocation, and you know that the refcount will start out as zero even if you don't have a constructor.

So you can think of the refcount itself as always having a zero constructor, *BUT* you need to be careful with ordering.

In particular, whoever does the allocation needs to then set the refcount to a non-zero value *after* it has initialized all the other fields. And in particular, it needs to make sure that it uses the proper memory ordering to do so.

And in this case, we have

static struct nf_conn * __nf_conntrack_alloc(struct net *net, { ... atomic_set(&ct->ct_general.use, 0);

which is a no-op for the re-use case (whether racing or not, since any "inc_not_zero" users won't touch it), but initializes it to zero for the "completely new object" case.

And then, the thing that actually exposes it to the speculative walkers does:

int nf_conntrack_hash_check_insert(struct nf_conn *ct) { ... smp_wmb(); /* The caller holds a reference to this object */ atomic_set(&ct->ct_general.use, 2);

which means that it stays as zero until everything is actually set up, and then the optimistic walker can use the other fields (including spinlocks etc) to verify that it's actually the right thing. The smp_wmb() means that the previous initialization really will be visible before the object is visible.

Side note: on some architectures it might help to make that "smp_wmb -> atomic_set()" sequence be am "smp_store_release()" instead. Doesn't matter on x86, but might matter on arm64.

NOTE! One thing to be very worried about is that re-initializing whatever RCU lists means that now the RCU walker may be walking on the wrong list so the walker may do the right thing for this particular entry, but it may miss walking *other* entries. So then you can get spurious lookup failures, because the RCU walker never walked all the way to the end of the right list. That ends up being a much more subtle bug.

But the nf_conntrack case seems to get that right too, see the restart in ____nf_conntrack_find().

So I don't see anything wrong in nf_conntrack.

But yes, using SLAB_TYPESAFE_BY_RCU is very very subtle. But most of the subtleties have nothing to do with having a constructor, they are about those "make sure memory ordering wrt refcount is right" and "restart speculative RCU walk" issues that actually happen regardless of having a constructor or not.

I see, thanks. I just don't see any point or advantage of *not* using the constructor with SLAB_TYPESAFE_BY_RCU caches. There is always must be a small part of initialization code that could be placed in constructor. And it's better to put that part into constructor to avoid initializing what's already has been initialized. If people use SLAB_TYPESAFE_BY_RCU instead of traditional kfree_rcu() for cache-efficiency and because they want to reuse the object faster, than avoiding few writes into cache-hot data doesn't look like a bad idea. E.g. nf_conntrack case could at least move the spin_lock_init() and atomic_set(&ct->ct_general.use, 0) in the constructor.

I can't think of any advantage in not having the constructor.

Eric Dumazet

10:23 a.m.

On 08/01/2018 02:03 AM, Andrey Ryabinin wrote:

...

I can't think of any advantage in not having the constructor.

I can't see any advantage adding another indirect call, in RETPOLINE world.

Dmitry Vyukov

10:34 a.m.

On Wed, Aug 1, 2018 at 12:23 PM, Eric Dumazet eric.dumazet@gmail.com wrote:

...

On 08/01/2018 02:03 AM, Andrey Ryabinin wrote:

...
I can't think of any advantage in not having the constructor.

I can't see any advantage adding another indirect call, in RETPOLINE world.

Can you please elaborate what's the problem here? If slab ctor call have RETPOLINE, then using ctors more does not introduce any security problems and they are not _that_ slow. If slab ctors are not protected, then we have problem that needs to be fixed already. What am I missing?

Eric Dumazet

11:28 a.m.

On 08/01/2018 03:34 AM, Dmitry Vyukov wrote:

...

On Wed, Aug 1, 2018 at 12:23 PM, Eric Dumazet eric.dumazet@gmail.com wrote:

...
On 08/01/2018 02:03 AM, Andrey Ryabinin wrote:

...
I can't think of any advantage in not having the constructor.

I can't see any advantage adding another indirect call, in RETPOLINE world.

Can you please elaborate what's the problem here? If slab ctor call have RETPOLINE, then using ctors more does not introduce any security problems and they are not _that_ slow.

They _are_ slow, when we have dozens of them in a code path.

I object "having to add" yet another indirect call, if this can be avoided [*]

If some people want to use ctor, fine, but do not request this.

[*] This can be tricky, but worth the pain.

Dmitry Vyukov

11:35 a.m.

On Wed, Aug 1, 2018 at 1:28 PM, Eric Dumazet eric.dumazet@gmail.com wrote:

...

On 08/01/2018 03:34 AM, Dmitry Vyukov wrote:

...
On Wed, Aug 1, 2018 at 12:23 PM, Eric Dumazet eric.dumazet@gmail.com wrote:

...
On 08/01/2018 02:03 AM, Andrey Ryabinin wrote:

...
I can't think of any advantage in not having the constructor.

I can't see any advantage adding another indirect call, in RETPOLINE world.

Can you please elaborate what's the problem here? If slab ctor call have RETPOLINE, then using ctors more does not introduce any security problems and they are not _that_ slow.

They _are_ slow, when we have dozens of them in a code path.

I object "having to add" yet another indirect call, if this can be avoided [*]

If some people want to use ctor, fine, but do not request this.

[*] This can be tricky, but worth the pain.

But we are trading 1 indirect call for comparable overhead removed from much more common path. The path that does ctors is also calling into page alloc, which is much more expensive. So ctor should be a net win on performance front, no?

Christopher Lameter

3:15 p.m.

On Wed, 1 Aug 2018, Dmitry Vyukov wrote:

...

But we are trading 1 indirect call for comparable overhead removed from much more common path. The path that does ctors is also calling into page alloc, which is much more expensive. So ctor should be a net win on performance front, no?

ctor would make it esier to review the flow and guarantee that the object always has certain fields set as required before any use by the subsystem.

ctors are run once on allocation of the slab page for all objects in it.

ctors are not called duiring allocation and freeing of objects from the slab page. So we could avoid the intialization of the spinlock on each object allocation which actually should be faster.

Eric Dumazet

3:37 p.m.

On Wed, Aug 1, 2018 at 8:15 AM Christopher Lameter cl@linux.com wrote:

...

On Wed, 1 Aug 2018, Dmitry Vyukov wrote:

...
But we are trading 1 indirect call for comparable overhead removed from much more common path. The path that does ctors is also calling into page alloc, which is much more expensive. So ctor should be a net win on performance front, no?

ctor would make it esier to review the flow and guarantee that the object always has certain fields set as required before any use by the subsystem.

ctors are run once on allocation of the slab page for all objects in it.

ctors are not called duiring allocation and freeing of objects from the slab page. So we could avoid the intialization of the spinlock on each object allocation which actually should be faster.

This strategy might have been a win 30 years ago when cpu had no caches (or too small anyway)

What probability is that the 60 bytes around the spinlock are not touched after the object is freshly allocated ?

-> None

Writing 60 bytes in one cache line instead of 64 has really the same cost. The cache line miss is the real killer.

Feel free to write the patches, test them, but I doubt you will have any gain.

Remember btw that TCP sockets can be either completely fresh (socket() call, using memset() to clear the whole object), or clones (accept() thus copying the parent socket)

The idea of having a ctor() would only be a win if all the fields that can be initialized in the ctor are contiguous and fill an integral number of cache lines.

Dmitry Vyukov

3:53 p.m.

On Wed, Aug 1, 2018 at 5:37 PM, Eric Dumazet edumazet@google.com wrote:

...

On Wed, Aug 1, 2018 at 8:15 AM Christopher Lameter cl@linux.com wrote:

...
On Wed, 1 Aug 2018, Dmitry Vyukov wrote:

...
But we are trading 1 indirect call for comparable overhead removed from much more common path. The path that does ctors is also calling into page alloc, which is much more expensive. So ctor should be a net win on performance front, no?

ctor would make it esier to review the flow and guarantee that the object always has certain fields set as required before any use by the subsystem.

ctors are run once on allocation of the slab page for all objects in it.

ctors are not called duiring allocation and freeing of objects from the slab page. So we could avoid the intialization of the spinlock on each object allocation which actually should be faster.

This strategy might have been a win 30 years ago when cpu had no caches (or too small anyway)

What probability is that the 60 bytes around the spinlock are not touched after the object is freshly allocated ?

-> None

Writing 60 bytes in one cache line instead of 64 has really the same cost. The cache line miss is the real killer.

Feel free to write the patches, test them, but I doubt you will have any gain.

Remember btw that TCP sockets can be either completely fresh (socket() call, using memset() to clear the whole object), or clones (accept() thus copying the parent socket)

The idea of having a ctor() would only be a win if all the fields that can be initialized in the ctor are contiguous and fill an integral number of cache lines.

Code size can have some visible performance impact too.

But either way, what you say only means that ctors are not necessary significantly faster. But your original point was that they are slower. If they are not slower, then what Andrey said seems to make sense: some gain on code comprehension front re type-stability invariant + some gain on performance front (even if not too big) and no downsides.

Christopher Lameter

4:22 p.m.

On Wed, 1 Aug 2018, Eric Dumazet wrote:

...

The idea of having a ctor() would only be a win if all the fields that can be initialized in the ctor are contiguous and fill an integral number of cache lines.

Ok. Its reducing code size and makes the object status more consistent. Isn't that enough?

Eric Dumazet

4:25 p.m.

On 08/01/2018 09:22 AM, Christopher Lameter wrote:

...

On Wed, 1 Aug 2018, Eric Dumazet wrote:

...
The idea of having a ctor() would only be a win if all the fields that can be initialized in the ctor are contiguous and fill an integral number of cache lines.

Ok. Its reducing code size and makes the object status more consistent. Isn't that enough?

Prove it ;)

I yet have to seen actual numbers.

Dmitry Vyukov

4:47 p.m.

On Wed, Aug 1, 2018 at 6:25 PM, Eric Dumazet eric.dumazet@gmail.com wrote:

...

On 08/01/2018 09:22 AM, Christopher Lameter wrote:

...
On Wed, 1 Aug 2018, Eric Dumazet wrote:

...
The idea of having a ctor() would only be a win if all the fields that can be initialized in the ctor are contiguous and fill an integral number of cache lines.

Ok. Its reducing code size and makes the object status more consistent. Isn't that enough?

Prove it ;)

I yet have to seen actual numbers.

Proving with numbers is required for a claimed performance improvement at the cost of code degradation/increase. For a win-win change there is really nothing to prove.

Eric Dumazet

5:18 p.m.

On Wed, Aug 1, 2018 at 9:47 AM Dmitry Vyukov dvyukov@google.com wrote:

...

Proving with numbers is required for a claimed performance improvement at the cost of code degradation/increase. For a win-win change there is really nothing to prove.

You have to _prove_ it is a win-win.

It is not sufficient to claim it is a win-win.

Sorry, but I do have bugs to take care of.

2471

Age (days ago)

2477

Last active (days ago)

dri-devel@lists.freedesktop.org

29 comments

8 participants

tags (0)

participants (8)

Andrey Ryabinin
Christopher Lameter
Dmitry Vyukov
Eric Dumazet
Eric Dumazet
Florian Westphal
Jan Kara
Linus Torvalds