Hi everyone,
hopefully the last iteration of those patches.
For now I've addressed the issue of unmapping imported BOs from the amdgpu page tables immediately by locking the page tables in place.
For HMM handling we are getting the ability to invalidate BOs without locking the VM anyway, so this last TODO will probably go away rather soon.
Place comment, Christian.
On the exporter side we add optional explicit pinning callbacks. Which are called when the importer doesn't implement dynamic handling, move notification or need the DMA-buf locked in place for its use case.
On the importer side we add an optional move_notify callback. This callback is used by the exporter to inform the importers that their mappings should be destroyed as soon as possible.
This allows the exporter to provide the mappings without the need to pin the backing store.
v2: don't try to invalidate mappings when the callback is NULL, lock the reservation obj while using the attachments, add helper to set the callback v3: move flag for invalidation support into the DMA-buf, use new attach_info structure to set the callback v4: use importer_priv field instead of mangling exporter priv. v5: drop invalidation_supported flag v6: squash together with pin/unpin changes v7: pin/unpin takes an attachment now v8: nuke dma_buf_attachment_(map|unmap)_locked, everything is now handled backward compatible v9: always cache when export/importer don't agree on dynamic handling v10: minimal style cleanup v11: drop automatically re-entry avoidance v12: rename callback to move_notify v13: add might_lock in appropriate places v14: rebase on separated locking change
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/dma-buf/dma-buf.c | 106 ++++++++++++++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 6 +- include/linux/dma-buf.h | 78 ++++++++++++-- 3 files changed, 170 insertions(+), 20 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d377b4ca66bf..ce293cee76ed 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -529,6 +529,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) exp_info->ops->dynamic_mapping)) return ERR_PTR(-EINVAL);
+ if (WARN_ON(!exp_info->ops->dynamic_mapping && + (exp_info->ops->pin || exp_info->ops->unpin))) + return ERR_PTR(-EINVAL); + if (!try_module_get(exp_info->owner)) return ERR_PTR(-ENOENT);
@@ -653,7 +657,8 @@ EXPORT_SYMBOL_GPL(dma_buf_put); * calls attach() of dma_buf_ops to allow device-specific attach functionality * @dmabuf: [in] buffer to attach device to. * @dev: [in] device to be attached. - * @dynamic_mapping: [in] calling convention for map/unmap + * @importer_ops [in] importer operations for the attachment + * @importer_priv [in] importer private pointer for the attachment * * Returns struct dma_buf_attachment pointer for this attachment. Attachments * must be cleaned up by calling dma_buf_detach(). @@ -669,7 +674,8 @@ EXPORT_SYMBOL_GPL(dma_buf_put); */ struct dma_buf_attachment * dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, - bool dynamic_mapping) + const struct dma_buf_attach_ops *importer_ops, + void *importer_priv) { struct dma_buf_attachment *attach; int ret; @@ -683,7 +689,8 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
attach->dev = dev; attach->dmabuf = dmabuf; - attach->dynamic_mapping = dynamic_mapping; + attach->importer_ops = importer_ops; + attach->importer_priv = importer_priv;
if (dmabuf->ops->attach) { ret = dmabuf->ops->attach(dmabuf, attach); @@ -702,15 +709,19 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, dma_buf_is_dynamic(dmabuf)) { struct sg_table *sgt;
- if (dma_buf_is_dynamic(attach->dmabuf)) + if (dma_buf_is_dynamic(attach->dmabuf)) { dma_resv_lock(attach->dmabuf->resv, NULL); + ret = dma_buf_pin(attach); + if (ret) + goto err_unlock; + }
sgt = dmabuf->ops->map_dma_buf(attach, DMA_BIDIRECTIONAL); if (!sgt) sgt = ERR_PTR(-ENOMEM); if (IS_ERR(sgt)) { ret = PTR_ERR(sgt); - goto err_unlock; + goto err_unpin; } if (dma_buf_is_dynamic(attach->dmabuf)) dma_resv_unlock(attach->dmabuf->resv); @@ -724,6 +735,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, kfree(attach); return ERR_PTR(ret);
+err_unpin: + if (dma_buf_is_dynamic(attach->dmabuf)) + dma_buf_unpin(attach); + err_unlock: if (dma_buf_is_dynamic(attach->dmabuf)) dma_resv_unlock(attach->dmabuf->resv); @@ -744,7 +759,7 @@ EXPORT_SYMBOL_GPL(dma_buf_dynamic_attach); struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev) { - return dma_buf_dynamic_attach(dmabuf, dev, false); + return dma_buf_dynamic_attach(dmabuf, dev, NULL, NULL); } EXPORT_SYMBOL_GPL(dma_buf_attach);
@@ -767,8 +782,10 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach)
dmabuf->ops->unmap_dma_buf(attach, attach->sgt, attach->dir);
- if (dma_buf_is_dynamic(attach->dmabuf)) + if (dma_buf_is_dynamic(attach->dmabuf)) { + dma_buf_unpin(attach); dma_resv_unlock(attach->dmabuf->resv); + } }
dma_resv_lock(dmabuf->resv, NULL); @@ -781,6 +798,44 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) } EXPORT_SYMBOL_GPL(dma_buf_detach);
+/** + * dma_buf_pin - Lock down the DMA-buf + * + * @attach: [in] attachment which should be pinned + * + * Returns: + * 0 on success, negative error code on failure. + */ +int dma_buf_pin(struct dma_buf_attachment *attach) +{ + struct dma_buf *dmabuf = attach->dmabuf; + int ret = 0; + + dma_resv_assert_held(dmabuf->resv); + + if (dmabuf->ops->pin) + ret = dmabuf->ops->pin(attach); + + return ret; +} +EXPORT_SYMBOL_GPL(dma_buf_pin); + +/** + * dma_buf_unpin - Remove lock from DMA-buf + * + * @attach: [in] attachment which should be unpinned + */ +void dma_buf_unpin(struct dma_buf_attachment *attach) +{ + struct dma_buf *dmabuf = attach->dmabuf; + + dma_resv_assert_held(dmabuf->resv); + + if (dmabuf->ops->unpin) + dmabuf->ops->unpin(attach); +} +EXPORT_SYMBOL_GPL(dma_buf_unpin); + /** * dma_buf_map_attachment - Returns the scatterlist table of the attachment; * mapped into _device_ address space. Is a wrapper for map_dma_buf() of the @@ -800,6 +855,7 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, enum dma_data_direction direction) { struct sg_table *sg_table; + int r;
might_sleep();
@@ -821,13 +877,23 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, return attach->sgt; }
- if (dma_buf_is_dynamic(attach->dmabuf)) + if (dma_buf_is_dynamic(attach->dmabuf)) { dma_resv_assert_held(attach->dmabuf->resv); + if (!attach->importer_ops->move_notify) { + r = dma_buf_pin(attach); + if (r) + return ERR_PTR(r); + } + }
sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction); if (!sg_table) sg_table = ERR_PTR(-ENOMEM);
+ if (IS_ERR(sg_table) && dma_buf_is_dynamic(attach->dmabuf) && + !attach->importer_ops->move_notify) + dma_buf_unpin(attach); + if (!IS_ERR(sg_table) && attach->dmabuf->ops->cache_sgt_mapping) { attach->sgt = sg_table; attach->dir = direction; @@ -866,9 +932,33 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, dma_resv_assert_held(attach->dmabuf->resv);
attach->dmabuf->ops->unmap_dma_buf(attach, sg_table, direction); + + if (dma_buf_is_dynamic(attach->dmabuf) && + !attach->importer_ops->move_notify) + dma_buf_unpin(attach); } EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment);
+/** + * dma_buf_move_notify - notify attachments that DMA-buf is moving + * + * @dmabuf: [in] buffer which is moving + * + * Informs all attachmenst that they need to destroy and recreated all their + * mappings. + */ +void dma_buf_move_notify(struct dma_buf *dmabuf) +{ + struct dma_buf_attachment *attach; + + dma_resv_assert_held(dmabuf->resv); + + list_for_each_entry(attach, &dmabuf->attachments, node) + if (attach->importer_ops && attach->importer_ops->move_notify) + attach->importer_ops->move_notify(attach); +} +EXPORT_SYMBOL_GPL(dma_buf_move_notify); + /** * DOC: cpu access * diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index e2eec7b66334..b2ca78b6abce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -414,6 +414,9 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = { +}; + /** * amdgpu_gem_prime_import - &drm_driver.gem_prime_import implementation * @dev: DRM device @@ -446,7 +449,8 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, if (IS_ERR(obj)) return obj;
- attach = dma_buf_dynamic_attach(dma_buf, dev->dev, true); + attach = dma_buf_dynamic_attach(dma_buf, dev->dev, + &amdgpu_dma_buf_attach_ops, NULL); if (IS_ERR(attach)) { drm_gem_object_put(obj); return ERR_CAST(attach); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index af73f835c51c..7456bb937635 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -93,14 +93,40 @@ struct dma_buf_ops { */ void (*detach)(struct dma_buf *, struct dma_buf_attachment *);
+ /** + * @pin: + * + * This is called by dma_buf_pin and lets the exporter know that the + * DMA-buf can't be moved any more. + * + * This is called with the dmabuf->resv object locked. + * + * This callback is optional. + * + * Returns: + * + * 0 on success, negative error code on failure. + */ + int (*pin)(struct dma_buf_attachment *attach); + + /** + * @unpin: + * + * This is called by dma_buf_unpin and lets the exporter know that the + * DMA-buf can be moved again. + * + * This is called with the dmabuf->resv object locked. + * + * This callback is optional. + */ + void (*unpin)(struct dma_buf_attachment *attach); + /** * @map_dma_buf: * * This is called by dma_buf_map_attachment() and is used to map a * shared &dma_buf into device address space, and it is mandatory. It - * can only be called if @attach has been called successfully. This - * essentially pins the DMA buffer into place, and it cannot be moved - * any more + * can only be called if @attach has been called successfully. * * This call may sleep, e.g. when the backing storage first needs to be * allocated, or moved to a location suitable for all currently attached @@ -141,9 +167,6 @@ struct dma_buf_ops { * * This is called by dma_buf_unmap_attachment() and should unmap and * release the &sg_table allocated in @map_dma_buf, and it is mandatory. - * It should also unpin the backing storage if this is the last mapping - * of the DMA buffer, it the exporter supports backing storage - * migration. */ void (*unmap_dma_buf)(struct dma_buf_attachment *, struct sg_table *, @@ -336,6 +359,34 @@ struct dma_buf { } cb_excl, cb_shared; };
+/** + * struct dma_buf_attach_ops - importer operations for an attachment + * @move_notify: [optional] notification that the DMA-buf is moving + * + * Attachment operations implemented by the importer. + */ +struct dma_buf_attach_ops { + /** + * @move_notify + * + * If this callback is provided the framework can avoid pinning the + * backing store while mappings exists. + * + * This callback is called with the lock of the reservation object + * associated with the dma_buf held and the mapping function must be + * called with this lock held as well. This makes sure that no mapping + * is created concurrently with an ongoing move operation. + * + * Mappings stay valid and are not directly affected by this callback. + * But the DMA-buf can now be in a different physical location, so all + * mappings should be destroyed and re-created as soon as possible. + * + * New mappings can be created after this callback returns, and will + * point to the new location of the DMA-buf. + */ + void (*move_notify)(struct dma_buf_attachment *attach); +}; + /** * struct dma_buf_attachment - holds device-buffer attachment data * @dmabuf: buffer for this attachment. @@ -344,8 +395,9 @@ struct dma_buf { * @sgt: cached mapping. * @dir: direction of cached mapping. * @priv: exporter specific attachment data. - * @dynamic_mapping: true if dma_buf_map/unmap_attachment() is called with the - * dma_resv lock held. + * @importer_ops: importer operations for this attachment, if provided + * dma_buf_map/unmap_attachment() must be called with the dma_resv lock held. + * @importer_priv: importer specific attachment data. * * This structure holds the attachment information between the dma_buf buffer * and its user device(s). The list contains one attachment struct per device @@ -362,7 +414,8 @@ struct dma_buf_attachment { struct list_head node; struct sg_table *sgt; enum dma_data_direction dir; - bool dynamic_mapping; + const struct dma_buf_attach_ops *importer_ops; + void *importer_priv; void *priv; };
@@ -438,16 +491,19 @@ static inline bool dma_buf_is_dynamic(struct dma_buf *dmabuf) static inline bool dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach) { - return attach->dynamic_mapping; + return !!attach->importer_ops; }
struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev); struct dma_buf_attachment * dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, - bool dynamic_mapping); + const struct dma_buf_attach_ops *importer_ops, + void *importer_priv); void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach); +int dma_buf_pin(struct dma_buf_attachment *attach); +void dma_buf_unpin(struct dma_buf_attachment *attach);
struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info);
Please ignore this one, I've send out the wrong version without Daniels latest comment nit picks fixed.
The interesting one in this series is the last patch.
Regards, Christian.
Am 17.02.20 um 16:45 schrieb Christian König:
On the exporter side we add optional explicit pinning callbacks. Which are called when the importer doesn't implement dynamic handling, move notification or need the DMA-buf locked in place for its use case.
On the importer side we add an optional move_notify callback. This callback is used by the exporter to inform the importers that their mappings should be destroyed as soon as possible.
This allows the exporter to provide the mappings without the need to pin the backing store.
v2: don't try to invalidate mappings when the callback is NULL, lock the reservation obj while using the attachments, add helper to set the callback v3: move flag for invalidation support into the DMA-buf, use new attach_info structure to set the callback v4: use importer_priv field instead of mangling exporter priv. v5: drop invalidation_supported flag v6: squash together with pin/unpin changes v7: pin/unpin takes an attachment now v8: nuke dma_buf_attachment_(map|unmap)_locked, everything is now handled backward compatible v9: always cache when export/importer don't agree on dynamic handling v10: minimal style cleanup v11: drop automatically re-entry avoidance v12: rename callback to move_notify v13: add might_lock in appropriate places v14: rebase on separated locking change
Signed-off-by: Christian König christian.koenig@amd.com
drivers/dma-buf/dma-buf.c | 106 ++++++++++++++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 6 +- include/linux/dma-buf.h | 78 ++++++++++++-- 3 files changed, 170 insertions(+), 20 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index d377b4ca66bf..ce293cee76ed 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -529,6 +529,10 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) exp_info->ops->dynamic_mapping)) return ERR_PTR(-EINVAL);
- if (WARN_ON(!exp_info->ops->dynamic_mapping &&
(exp_info->ops->pin || exp_info->ops->unpin)))
return ERR_PTR(-EINVAL);
- if (!try_module_get(exp_info->owner)) return ERR_PTR(-ENOENT);
@@ -653,7 +657,8 @@ EXPORT_SYMBOL_GPL(dma_buf_put);
- calls attach() of dma_buf_ops to allow device-specific attach functionality
- @dmabuf: [in] buffer to attach device to.
- @dev: [in] device to be attached.
- @dynamic_mapping: [in] calling convention for map/unmap
- @importer_ops [in] importer operations for the attachment
- @importer_priv [in] importer private pointer for the attachment
- Returns struct dma_buf_attachment pointer for this attachment. Attachments
- must be cleaned up by calling dma_buf_detach().
@@ -669,7 +674,8 @@ EXPORT_SYMBOL_GPL(dma_buf_put); */ struct dma_buf_attachment * dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
bool dynamic_mapping)
const struct dma_buf_attach_ops *importer_ops,
{ struct dma_buf_attachment *attach; int ret;void *importer_priv)
@@ -683,7 +689,8 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
attach->dev = dev; attach->dmabuf = dmabuf;
- attach->dynamic_mapping = dynamic_mapping;
attach->importer_ops = importer_ops;
attach->importer_priv = importer_priv;
if (dmabuf->ops->attach) { ret = dmabuf->ops->attach(dmabuf, attach);
@@ -702,15 +709,19 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, dma_buf_is_dynamic(dmabuf)) { struct sg_table *sgt;
if (dma_buf_is_dynamic(attach->dmabuf))
if (dma_buf_is_dynamic(attach->dmabuf)) { dma_resv_lock(attach->dmabuf->resv, NULL);
ret = dma_buf_pin(attach);
if (ret)
goto err_unlock;
}
sgt = dmabuf->ops->map_dma_buf(attach, DMA_BIDIRECTIONAL); if (!sgt) sgt = ERR_PTR(-ENOMEM); if (IS_ERR(sgt)) { ret = PTR_ERR(sgt);
goto err_unlock;
} if (dma_buf_is_dynamic(attach->dmabuf)) dma_resv_unlock(attach->dmabuf->resv);goto err_unpin;
@@ -724,6 +735,10 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, kfree(attach); return ERR_PTR(ret);
+err_unpin:
- if (dma_buf_is_dynamic(attach->dmabuf))
dma_buf_unpin(attach);
- err_unlock: if (dma_buf_is_dynamic(attach->dmabuf)) dma_resv_unlock(attach->dmabuf->resv);
@@ -744,7 +759,7 @@ EXPORT_SYMBOL_GPL(dma_buf_dynamic_attach); struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev) {
- return dma_buf_dynamic_attach(dmabuf, dev, false);
- return dma_buf_dynamic_attach(dmabuf, dev, NULL, NULL); } EXPORT_SYMBOL_GPL(dma_buf_attach);
@@ -767,8 +782,10 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach)
dmabuf->ops->unmap_dma_buf(attach, attach->sgt, attach->dir);
if (dma_buf_is_dynamic(attach->dmabuf))
if (dma_buf_is_dynamic(attach->dmabuf)) {
dma_buf_unpin(attach); dma_resv_unlock(attach->dmabuf->resv);
}
}
dma_resv_lock(dmabuf->resv, NULL);
@@ -781,6 +798,44 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) } EXPORT_SYMBOL_GPL(dma_buf_detach);
+/**
- dma_buf_pin - Lock down the DMA-buf
- @attach: [in] attachment which should be pinned
- Returns:
- 0 on success, negative error code on failure.
- */
+int dma_buf_pin(struct dma_buf_attachment *attach) +{
- struct dma_buf *dmabuf = attach->dmabuf;
- int ret = 0;
- dma_resv_assert_held(dmabuf->resv);
- if (dmabuf->ops->pin)
ret = dmabuf->ops->pin(attach);
- return ret;
+} +EXPORT_SYMBOL_GPL(dma_buf_pin);
+/**
- dma_buf_unpin - Remove lock from DMA-buf
- @attach: [in] attachment which should be unpinned
- */
+void dma_buf_unpin(struct dma_buf_attachment *attach) +{
- struct dma_buf *dmabuf = attach->dmabuf;
- dma_resv_assert_held(dmabuf->resv);
- if (dmabuf->ops->unpin)
dmabuf->ops->unpin(attach);
+} +EXPORT_SYMBOL_GPL(dma_buf_unpin);
- /**
- dma_buf_map_attachment - Returns the scatterlist table of the attachment;
- mapped into _device_ address space. Is a wrapper for map_dma_buf() of the
@@ -800,6 +855,7 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, enum dma_data_direction direction) { struct sg_table *sg_table;
int r;
might_sleep();
@@ -821,13 +877,23 @@ struct sg_table *dma_buf_map_attachment(struct dma_buf_attachment *attach, return attach->sgt; }
- if (dma_buf_is_dynamic(attach->dmabuf))
if (dma_buf_is_dynamic(attach->dmabuf)) { dma_resv_assert_held(attach->dmabuf->resv);
if (!attach->importer_ops->move_notify) {
r = dma_buf_pin(attach);
if (r)
return ERR_PTR(r);
}
}
sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction); if (!sg_table) sg_table = ERR_PTR(-ENOMEM);
if (IS_ERR(sg_table) && dma_buf_is_dynamic(attach->dmabuf) &&
!attach->importer_ops->move_notify)
dma_buf_unpin(attach);
if (!IS_ERR(sg_table) && attach->dmabuf->ops->cache_sgt_mapping) { attach->sgt = sg_table; attach->dir = direction;
@@ -866,9 +932,33 @@ void dma_buf_unmap_attachment(struct dma_buf_attachment *attach, dma_resv_assert_held(attach->dmabuf->resv);
attach->dmabuf->ops->unmap_dma_buf(attach, sg_table, direction);
- if (dma_buf_is_dynamic(attach->dmabuf) &&
!attach->importer_ops->move_notify)
} EXPORT_SYMBOL_GPL(dma_buf_unmap_attachment);dma_buf_unpin(attach);
+/**
- dma_buf_move_notify - notify attachments that DMA-buf is moving
- @dmabuf: [in] buffer which is moving
- Informs all attachmenst that they need to destroy and recreated all their
- mappings.
- */
+void dma_buf_move_notify(struct dma_buf *dmabuf) +{
- struct dma_buf_attachment *attach;
- dma_resv_assert_held(dmabuf->resv);
- list_for_each_entry(attach, &dmabuf->attachments, node)
if (attach->importer_ops && attach->importer_ops->move_notify)
attach->importer_ops->move_notify(attach);
+} +EXPORT_SYMBOL_GPL(dma_buf_move_notify);
- /**
- DOC: cpu access
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index e2eec7b66334..b2ca78b6abce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -414,6 +414,9 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = { +};
- /**
- amdgpu_gem_prime_import - &drm_driver.gem_prime_import implementation
- @dev: DRM device
@@ -446,7 +449,8 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, if (IS_ERR(obj)) return obj;
- attach = dma_buf_dynamic_attach(dma_buf, dev->dev, true);
- attach = dma_buf_dynamic_attach(dma_buf, dev->dev,
if (IS_ERR(attach)) { drm_gem_object_put(obj); return ERR_CAST(attach);&amdgpu_dma_buf_attach_ops, NULL);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index af73f835c51c..7456bb937635 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -93,14 +93,40 @@ struct dma_buf_ops { */ void (*detach)(struct dma_buf *, struct dma_buf_attachment *);
- /**
* @pin:
*
* This is called by dma_buf_pin and lets the exporter know that the
* DMA-buf can't be moved any more.
*
* This is called with the dmabuf->resv object locked.
*
* This callback is optional.
*
* Returns:
*
* 0 on success, negative error code on failure.
*/
- int (*pin)(struct dma_buf_attachment *attach);
- /**
* @unpin:
*
* This is called by dma_buf_unpin and lets the exporter know that the
* DMA-buf can be moved again.
*
* This is called with the dmabuf->resv object locked.
*
* This callback is optional.
*/
- void (*unpin)(struct dma_buf_attachment *attach);
- /**
- @map_dma_buf:
- This is called by dma_buf_map_attachment() and is used to map a
- shared &dma_buf into device address space, and it is mandatory. It
* can only be called if @attach has been called successfully. This
* essentially pins the DMA buffer into place, and it cannot be moved
* any more
* can only be called if @attach has been called successfully.
- This call may sleep, e.g. when the backing storage first needs to be
- allocated, or moved to a location suitable for all currently attached
@@ -141,9 +167,6 @@ struct dma_buf_ops { * * This is called by dma_buf_unmap_attachment() and should unmap and * release the &sg_table allocated in @map_dma_buf, and it is mandatory.
* It should also unpin the backing storage if this is the last mapping
* of the DMA buffer, it the exporter supports backing storage
*/ void (*unmap_dma_buf)(struct dma_buf_attachment *, struct sg_table *,* migration.
@@ -336,6 +359,34 @@ struct dma_buf { } cb_excl, cb_shared; };
+/**
- struct dma_buf_attach_ops - importer operations for an attachment
- @move_notify: [optional] notification that the DMA-buf is moving
- Attachment operations implemented by the importer.
- */
+struct dma_buf_attach_ops {
- /**
* @move_notify
*
* If this callback is provided the framework can avoid pinning the
* backing store while mappings exists.
*
* This callback is called with the lock of the reservation object
* associated with the dma_buf held and the mapping function must be
* called with this lock held as well. This makes sure that no mapping
* is created concurrently with an ongoing move operation.
*
* Mappings stay valid and are not directly affected by this callback.
* But the DMA-buf can now be in a different physical location, so all
* mappings should be destroyed and re-created as soon as possible.
*
* New mappings can be created after this callback returns, and will
* point to the new location of the DMA-buf.
*/
- void (*move_notify)(struct dma_buf_attachment *attach);
+};
- /**
- struct dma_buf_attachment - holds device-buffer attachment data
- @dmabuf: buffer for this attachment.
@@ -344,8 +395,9 @@ struct dma_buf {
- @sgt: cached mapping.
- @dir: direction of cached mapping.
- @priv: exporter specific attachment data.
- @dynamic_mapping: true if dma_buf_map/unmap_attachment() is called with the
- dma_resv lock held.
- @importer_ops: importer operations for this attachment, if provided
- dma_buf_map/unmap_attachment() must be called with the dma_resv lock held.
- @importer_priv: importer specific attachment data.
- This structure holds the attachment information between the dma_buf buffer
- and its user device(s). The list contains one attachment struct per device
@@ -362,7 +414,8 @@ struct dma_buf_attachment { struct list_head node; struct sg_table *sgt; enum dma_data_direction dir;
- bool dynamic_mapping;
- const struct dma_buf_attach_ops *importer_ops;
- void *importer_priv; void *priv; };
@@ -438,16 +491,19 @@ static inline bool dma_buf_is_dynamic(struct dma_buf *dmabuf) static inline bool dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach) {
- return attach->dynamic_mapping;
return !!attach->importer_ops; }
struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev); struct dma_buf_attachment * dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
bool dynamic_mapping);
const struct dma_buf_attach_ops *importer_ops,
void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach);void *importer_priv);
+int dma_buf_pin(struct dma_buf_attachment *attach); +void dma_buf_unpin(struct dma_buf_attachment *attach);
struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info);
Pipeline removal of the BOs backing store when no placement is given during validation.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/ttm/ttm_bo.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 229205e499db..a56cb2a2b4ad 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1225,6 +1225,18 @@ int ttm_bo_validate(struct ttm_buffer_object *bo, uint32_t new_flags;
dma_resv_assert_held(bo->base.resv); + + /* + * Remove the backing store if no placement is given. + */ + if (!placement->num_placement && !placement->num_busy_placement) { + ret = ttm_bo_pipeline_gutting(bo); + if (ret) + return ret; + + return ttm_tt_create(bo, false); + } + /* * Check whether we need to move buffer. */
Avoid that we ping/pong the buffers when we stop to pin DMA-buf exports by using the allowed domains for exported buffers.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 80ba6dfc54e2..d938d458e22d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -28,6 +28,7 @@ #include <linux/file.h> #include <linux/pagemap.h> #include <linux/sync_file.h> +#include <linux/dma-buf.h>
#include <drm/amdgpu_drm.h> #include <drm/drm_syncobj.h> @@ -417,7 +418,9 @@ static int amdgpu_cs_bo_validate(struct amdgpu_cs_parser *p, /* Don't move this buffer if we have depleted our allowance * to move it. Don't move anything if the threshold is zero. */ - if (p->bytes_moved < p->bytes_moved_threshold) { + if (p->bytes_moved < p->bytes_moved_threshold && + (!bo->tbo.base.dma_buf || + list_empty(&bo->tbo.base.dma_buf->attachments))) { if (!amdgpu_gmc_vram_full_visible(&adev->gmc) && (bo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED)) { /* And don't move a CPU_ACCESS_REQUIRED BO to limited
This implements the exporter side of unpinned DMA-buf handling.
v2: fix minor coding style issues
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 53 ++++++++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++ 2 files changed, 51 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index b2ca78b6abce..770baba621b3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -222,6 +222,37 @@ static void amdgpu_dma_buf_detach(struct dma_buf *dmabuf, bo->prime_shared_count--; }
+/** + * amdgpu_dma_buf_pin - &dma_buf_ops.pin implementation + * + * @attach: attachment to pin down + * + * Pin the BO which is backing the DMA-buf so that it can't move any more. + */ +static int amdgpu_dma_buf_pin(struct dma_buf_attachment *attach) +{ + struct drm_gem_object *obj = attach->dmabuf->priv; + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + + /* pin buffer into GTT */ + return amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT); +} + +/** + * amdgpu_dma_buf_unpin - &dma_buf_ops.unpin implementation + * + * @attach: attachment to unpin + * + * Unpin a previously pinned BO to make it movable again. + */ +static void amdgpu_dma_buf_unpin(struct dma_buf_attachment *attach) +{ + struct drm_gem_object *obj = attach->dmabuf->priv; + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + + amdgpu_bo_unpin(bo); +} + /** * amdgpu_dma_buf_map - &dma_buf_ops.map_dma_buf implementation * @attach: DMA-buf attachment @@ -244,9 +275,19 @@ static struct sg_table *amdgpu_dma_buf_map(struct dma_buf_attachment *attach, struct sg_table *sgt; long r;
- r = amdgpu_bo_pin(bo, AMDGPU_GEM_DOMAIN_GTT); - if (r) - return ERR_PTR(r); + if (!bo->pin_count) { + /* move buffer into GTT */ + struct ttm_operation_ctx ctx = { false, false }; + + amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT); + r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx); + if (r) + return ERR_PTR(r); + + } else if (!(amdgpu_mem_type_to_domain(bo->tbo.mem.mem_type) & + AMDGPU_GEM_DOMAIN_GTT)) { + return ERR_PTR(-EBUSY); + }
sgt = drm_prime_pages_to_sg(bo->tbo.ttm->pages, bo->tbo.num_pages); if (IS_ERR(sgt)) @@ -277,13 +318,9 @@ static void amdgpu_dma_buf_unmap(struct dma_buf_attachment *attach, struct sg_table *sgt, enum dma_data_direction dir) { - struct drm_gem_object *obj = attach->dmabuf->priv; - struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); - dma_unmap_sg(attach->dev, sgt->sgl, sgt->nents, dir); sg_free_table(sgt); kfree(sgt); - amdgpu_bo_unpin(bo); }
/** @@ -330,6 +367,8 @@ const struct dma_buf_ops amdgpu_dmabuf_ops = { .dynamic_mapping = true, .attach = amdgpu_dma_buf_attach, .detach = amdgpu_dma_buf_detach, + .pin = amdgpu_dma_buf_pin, + .unpin = amdgpu_dma_buf_unpin, .map_dma_buf = amdgpu_dma_buf_map, .unmap_dma_buf = amdgpu_dma_buf_unmap, .release = drm_gem_dmabuf_release, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 6f60a581e3ba..8ae260822908 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -31,6 +31,7 @@ */ #include <linux/list.h> #include <linux/slab.h> +#include <linux/dma-buf.h>
#include <drm/amdgpu_drm.h> #include <drm/drm_cache.h> @@ -1274,6 +1275,10 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo,
amdgpu_bo_kunmap(abo);
+ if (abo->tbo.base.dma_buf && !abo->tbo.base.import_attach && + bo->mem.mem_type != TTM_PL_SYSTEM) + dma_buf_move_notify(abo->tbo.base.dma_buf); + /* remember the eviction */ if (evict) atomic64_inc(&adev->num_evictions);
Implement the importer side of unpinned DMA-buf handling.
v2: update page tables immediately
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ 2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index 770baba621b3..48de7624d49c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+/** + * amdgpu_dma_buf_move_notify - &attach.move_notify implementation + * + * @attach: the DMA-buf attachment + * + * Invalidate the DMA-buf attachment, making sure that the we re-create the + * mapping before the next use. + */ +static void +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) +{ + struct drm_gem_object *obj = attach->importer_priv; + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); + struct ttm_operation_ctx ctx = { false, false }; + struct ttm_placement placement = {}; + struct amdgpu_vm_bo_base *bo_base; + int r; + + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) + return; + + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); + if (r) { + DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r); + return; + } + + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { + struct amdgpu_vm *vm = bo_base->vm; + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv; + + if (ticket) { + /* When we get an error here it means that somebody + * else is holding the VM lock and updating page tables + * So we can just continue here. + */ + r = dma_resv_lock(resv, ticket); + if (r) + continue; + + } else { + /* TODO: This is more problematic and we actually need + * to allow page tables updates without holding the + * lock. + */ + if (!dma_resv_trylock(resv)) + continue; + } + + r = amdgpu_vm_clear_freed(adev, vm, NULL); + if (!r) + r = amdgpu_vm_handle_moved(adev, vm); + + if (r && r != -EBUSY) + DRM_ERROR("Failed to invalidate VM page tables (%d))\n", + r); + + dma_resv_unlock(resv); + } +} + static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = { + .move_notify = amdgpu_dma_buf_move_notify };
/** @@ -489,7 +553,7 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, return obj;
attach = dma_buf_dynamic_attach(dma_buf, dev->dev, - &amdgpu_dma_buf_attach_ops, NULL); + &amdgpu_dma_buf_attach_ops, obj); if (IS_ERR(attach)) { drm_gem_object_put(obj); return ERR_CAST(attach); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 8ae260822908..8c480c898b0d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -926,6 +926,9 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain, return 0; }
+ if (bo->tbo.base.import_attach) + dma_buf_pin(bo->tbo.base.import_attach); + bo->flags |= AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS; /* force to pin into visible video ram */ if (!(bo->flags & AMDGPU_GEM_CREATE_NO_CPU_ACCESS)) @@ -1009,6 +1012,9 @@ int amdgpu_bo_unpin(struct amdgpu_bo *bo)
amdgpu_bo_subtract_pin_size(bo);
+ if (bo->tbo.base.import_attach) + dma_buf_unpin(bo->tbo.base.import_attach); + for (i = 0; i < bo->placement.num_placement; i++) { bo->placements[i].lpfn = 0; bo->placements[i].flags &= ~TTM_PL_FLAG_NO_EVICT;
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
Implement the importer side of unpinned DMA-buf handling.
v2: update page tables immediately
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ 2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index 770baba621b3..48de7624d49c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+/**
- amdgpu_dma_buf_move_notify - &attach.move_notify implementation
- @attach: the DMA-buf attachment
- Invalidate the DMA-buf attachment, making sure that the we re-create the
- mapping before the next use.
- */
+static void +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) +{
- struct drm_gem_object *obj = attach->importer_priv;
- struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
- struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
- struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
- struct ttm_operation_ctx ctx = { false, false };
- struct ttm_placement placement = {};
- struct amdgpu_vm_bo_base *bo_base;
- int r;
- if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
return;
- r = ttm_bo_validate(&bo->tbo, &placement, &ctx);
- if (r) {
DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
return;
- }
- for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
struct amdgpu_vm *vm = bo_base->vm;
struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
if (ticket) {
Yeah so this is kinda why I've been a total pain about the exact semantics of the move_notify hook. I think we should flat-out require that importers _always_ have a ticket attach when they call this, and that they can cope with additional locks being taken (i.e. full EDEADLCK) handling.
Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to the dma_resv object, which we then can take #ifdef CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
Now the real disaster is how we handle deadlocks. Two issues:
- Ideally we'd keep any lock we've taken locked until the end, it helps needless backoffs. I've played around a bit with that but not even poc level, just an idea:
https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27...
Idea is essentially to track a list of objects we had to lock as part of the ttm_bo_validate of the main object.
- Second one is if we get a EDEADLCK on one of these sublocks (like the one here). We need to pass that up the entire callchain, including a temporary reference (we have to drop locks to do the ww_mutex_lock_slow call), and need a custom callback to drop that temporary reference (since that's all driver specific, might even be internal ww_mutex and not anything remotely looking like a normal dma_buf). This probably needs the exec util helpers from ttm, but at the dma_resv level, so that we can do something like this:
struct dma_resv_ticket { struct ww_acquire_ctx base;
/* can be set by anyone (including other drivers) that got hold of * this ticket and had to acquire some new lock. This lock might * protect anything, including driver-internal stuff, and isn't * required to be a dma_buf or even just a dma_resv. */ struct ww_mutex *contended_lock;
/* callback which the driver (which might be a dma-buf exporter * and not matching the driver that started this locking ticket) * sets together with @contended_lock, for the main driver to drop * when it calls dma_resv_unlock on the contended_lock. */ void (drop_ref*)(struct ww_mutex *contended_lock); };
This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).
Plan B would be to throw our hands into and declare that "move_notify is best effort only and can fail for any reason". Exactly like ttm eviction currently does, even with all your hacks to do at least some dma_resv_lock (but not the full slowpath).
Given how much "fun" you have with all the low memory handling and ttm fundamentally being best-effort only (despite that dma_resv would allow us to do this right, with some work) I'm not sure that's a good idea to extend to a cross-driver interface. Personally I'd lean towards fixing this first fully (in ttm/amdgpu), and then using that to implement move_notify correctly.
Or just add an int return value here and mandate that importers must handle eviction failures. Exactly like ttm_mem_evict_first can currently still fail for various reasons.
(Sorry this isn't exactly the mail you hoped for)
Cheers, Daniel
/* When we get an error here it means that somebody
* else is holding the VM lock and updating page tables
* So we can just continue here.
*/
r = dma_resv_lock(resv, ticket);
if (r)
continue;
} else {
/* TODO: This is more problematic and we actually need
* to allow page tables updates without holding the
* lock.
*/
if (!dma_resv_trylock(resv))
continue;
}
r = amdgpu_vm_clear_freed(adev, vm, NULL);
if (!r)
r = amdgpu_vm_handle_moved(adev, vm);
if (r && r != -EBUSY)
DRM_ERROR("Failed to invalidate VM page tables (%d))\n",
r);
dma_resv_unlock(resv);
- }
+}
static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = {
- .move_notify = amdgpu_dma_buf_move_notify
};
/** @@ -489,7 +553,7 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, return obj;
attach = dma_buf_dynamic_attach(dma_buf, dev->dev,
&amdgpu_dma_buf_attach_ops, NULL);
if (IS_ERR(attach)) { drm_gem_object_put(obj); return ERR_CAST(attach);&amdgpu_dma_buf_attach_ops, obj);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 8ae260822908..8c480c898b0d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -926,6 +926,9 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain, return 0; }
- if (bo->tbo.base.import_attach)
dma_buf_pin(bo->tbo.base.import_attach);
- bo->flags |= AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS; /* force to pin into visible video ram */ if (!(bo->flags & AMDGPU_GEM_CREATE_NO_CPU_ACCESS))
@@ -1009,6 +1012,9 @@ int amdgpu_bo_unpin(struct amdgpu_bo *bo)
amdgpu_bo_subtract_pin_size(bo);
- if (bo->tbo.base.import_attach)
dma_buf_unpin(bo->tbo.base.import_attach);
- for (i = 0; i < bo->placement.num_placement; i++) { bo->placements[i].lpfn = 0; bo->placements[i].flags &= ~TTM_PL_FLAG_NO_EVICT;
-- 2.17.1
Am 17.02.20 um 18:55 schrieb Daniel Vetter:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
Implement the importer side of unpinned DMA-buf handling.
v2: update page tables immediately
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ 2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index 770baba621b3..48de7624d49c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+/**
- amdgpu_dma_buf_move_notify - &attach.move_notify implementation
- @attach: the DMA-buf attachment
- Invalidate the DMA-buf attachment, making sure that the we re-create the
- mapping before the next use.
- */
+static void +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) +{
- struct drm_gem_object *obj = attach->importer_priv;
- struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
- struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
- struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
- struct ttm_operation_ctx ctx = { false, false };
- struct ttm_placement placement = {};
- struct amdgpu_vm_bo_base *bo_base;
- int r;
- if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
return;
- r = ttm_bo_validate(&bo->tbo, &placement, &ctx);
- if (r) {
DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
return;
- }
- for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
struct amdgpu_vm *vm = bo_base->vm;
struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
if (ticket) {
Yeah so this is kinda why I've been a total pain about the exact semantics of the move_notify hook. I think we should flat-out require that importers _always_ have a ticket attach when they call this, and that they can cope with additional locks being taken (i.e. full EDEADLCK) handling.
That is pretty much exactly my thinking as well.
And is also the sole reason why I started looking into the ww_mutex cursor handling a while back (e.g. the initial version with the horrible macro hack).
But this is really really hard to get right. So my thinking for now is to push this series upstream to at least unblock my ongoing P2P work.
Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to the dma_resv object, which we then can take #ifdef CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
Now the real disaster is how we handle deadlocks. Two issues:
- Ideally we'd keep any lock we've taken locked until the end, it helps needless backoffs. I've played around a bit with that but not even poc level, just an idea:
https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27...
Idea is essentially to track a list of objects we had to lock as part of the ttm_bo_validate of the main object.
- Second one is if we get a EDEADLCK on one of these sublocks (like the one here). We need to pass that up the entire callchain, including a temporary reference (we have to drop locks to do the ww_mutex_lock_slow call), and need a custom callback to drop that temporary reference (since that's all driver specific, might even be internal ww_mutex and not anything remotely looking like a normal dma_buf). This probably needs the exec util helpers from ttm, but at the dma_resv level, so that we can do something like this:
struct dma_resv_ticket { struct ww_acquire_ctx base;
/* can be set by anyone (including other drivers) that got hold of * this ticket and had to acquire some new lock. This lock might * protect anything, including driver-internal stuff, and isn't * required to be a dma_buf or even just a dma_resv. */ struct ww_mutex *contended_lock;
/* callback which the driver (which might be a dma-buf exporter * and not matching the driver that started this locking ticket) * sets together with @contended_lock, for the main driver to drop * when it calls dma_resv_unlock on the contended_lock. */ void (drop_ref*)(struct ww_mutex *contended_lock); };
My initial thinking was to make all of this part of the core ww_mutex implementation, but then I quickly found that this won't work.
This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).
The next idea was to have it based on dma_resv objects, but as you also figured out you then need to drop the reference to the contended lock somehow...
So my current working plan was to use GEM object to avoid the callback...
Plan B would be to throw our hands into and declare that "move_notify is best effort only and can fail for any reason". Exactly like ttm eviction currently does, even with all your hacks to do at least some dma_resv_lock (but not the full slowpath).
I would seriously NAK that. We have tried that with TTM and the whole idea is just braindead.
You can use trylock based eviction for things like best effort shrinker callbacks. But that's pretty much it.
Given how much "fun" you have with all the low memory handling and ttm fundamentally being best-effort only (despite that dma_resv would allow us to do this right, with some work) I'm not sure that's a good idea to extend to a cross-driver interface. Personally I'd lean towards fixing this first fully (in ttm/amdgpu), and then using that to implement move_notify correctly.
Well have you seen this: https://fosdem.org/2020/schedule/event/ttm/ :)
Or just add an int return value here and mandate that importers must handle eviction failures. Exactly like ttm_mem_evict_first can currently still fail for various reasons.
(Sorry this isn't exactly the mail you hoped for)
Well it's actually the mail I expected. I'm thinking about exactly those problems for over a year now as well.
For the rather specific amdgpu case I could work around that by utilizing the HMM work to invalidate page tables on the fly, but that doesn't really help with memory management in general.
So YES, I totally agree that we need some sort of GEM execution context or something like this to lock buffers on the fly as we try to make room for others.
Regards, Christian.
Cheers, Daniel
/* When we get an error here it means that somebody
* else is holding the VM lock and updating page tables
* So we can just continue here.
*/
r = dma_resv_lock(resv, ticket);
if (r)
continue;
} else {
/* TODO: This is more problematic and we actually need
* to allow page tables updates without holding the
* lock.
*/
if (!dma_resv_trylock(resv))
continue;
}
r = amdgpu_vm_clear_freed(adev, vm, NULL);
if (!r)
r = amdgpu_vm_handle_moved(adev, vm);
if (r && r != -EBUSY)
DRM_ERROR("Failed to invalidate VM page tables (%d))\n",
r);
dma_resv_unlock(resv);
- }
+}
static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = {
.move_notify = amdgpu_dma_buf_move_notify };
/**
@@ -489,7 +553,7 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, return obj;
attach = dma_buf_dynamic_attach(dma_buf, dev->dev,
&amdgpu_dma_buf_attach_ops, NULL);
if (IS_ERR(attach)) { drm_gem_object_put(obj); return ERR_CAST(attach);&amdgpu_dma_buf_attach_ops, obj);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 8ae260822908..8c480c898b0d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -926,6 +926,9 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain, return 0; }
- if (bo->tbo.base.import_attach)
dma_buf_pin(bo->tbo.base.import_attach);
- bo->flags |= AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS; /* force to pin into visible video ram */ if (!(bo->flags & AMDGPU_GEM_CREATE_NO_CPU_ACCESS))
@@ -1009,6 +1012,9 @@ int amdgpu_bo_unpin(struct amdgpu_bo *bo)
amdgpu_bo_subtract_pin_size(bo);
- if (bo->tbo.base.import_attach)
dma_buf_unpin(bo->tbo.base.import_attach);
- for (i = 0; i < bo->placement.num_placement; i++) { bo->placements[i].lpfn = 0; bo->placements[i].flags &= ~TTM_PL_FLAG_NO_EVICT;
-- 2.17.1
On Mon, Feb 17, 2020 at 7:58 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 17.02.20 um 18:55 schrieb Daniel Vetter:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
Implement the importer side of unpinned DMA-buf handling.
v2: update page tables immediately
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ 2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index 770baba621b3..48de7624d49c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+/**
- amdgpu_dma_buf_move_notify - &attach.move_notify implementation
- @attach: the DMA-buf attachment
- Invalidate the DMA-buf attachment, making sure that the we re-create the
- mapping before the next use.
- */
+static void +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) +{
- struct drm_gem_object *obj = attach->importer_priv;
- struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
- struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
- struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
- struct ttm_operation_ctx ctx = { false, false };
- struct ttm_placement placement = {};
- struct amdgpu_vm_bo_base *bo_base;
- int r;
- if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
return;
- r = ttm_bo_validate(&bo->tbo, &placement, &ctx);
- if (r) {
DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
return;
- }
- for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
struct amdgpu_vm *vm = bo_base->vm;
struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
if (ticket) {
Yeah so this is kinda why I've been a total pain about the exact semantics of the move_notify hook. I think we should flat-out require that importers _always_ have a ticket attach when they call this, and that they can cope with additional locks being taken (i.e. full EDEADLCK) handling.
That is pretty much exactly my thinking as well.
And is also the sole reason why I started looking into the ww_mutex cursor handling a while back (e.g. the initial version with the horrible macro hack).
But this is really really hard to get right. So my thinking for now is to push this series upstream to at least unblock my ongoing P2P work.
Hm, but at least the move_notify stuff and the locking nightmare around that feels rushed if we just push that. Otoh it's indeed getting painful, and we'll probably have another few rounds of headaches to sort this all out. What about a
config EXPERIMENTAL_DYNAMIC_DMA_BUF depends on BROKEN
Wrapped around the new ->move_notify hook, plus all relevant code? That way you can land at least something, in-tree refactoring might become easier with at least some example of what it needs to achieve. But we're also not tricking anyone into believing that this is production ready.
Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to the dma_resv object, which we then can take #ifdef CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
Now the real disaster is how we handle deadlocks. Two issues:
- Ideally we'd keep any lock we've taken locked until the end, it helps needless backoffs. I've played around a bit with that but not even poc level, just an idea:
https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27...
Idea is essentially to track a list of objects we had to lock as part of the ttm_bo_validate of the main object.
- Second one is if we get a EDEADLCK on one of these sublocks (like the one here). We need to pass that up the entire callchain, including a temporary reference (we have to drop locks to do the ww_mutex_lock_slow call), and need a custom callback to drop that temporary reference (since that's all driver specific, might even be internal ww_mutex and not anything remotely looking like a normal dma_buf). This probably needs the exec util helpers from ttm, but at the dma_resv level, so that we can do something like this:
struct dma_resv_ticket { struct ww_acquire_ctx base;
/* can be set by anyone (including other drivers) that got hold of * this ticket and had to acquire some new lock. This lock might * protect anything, including driver-internal stuff, and isn't * required to be a dma_buf or even just a dma_resv. */ struct ww_mutex *contended_lock; /* callback which the driver (which might be a dma-buf exporter * and not matching the driver that started this locking ticket) * sets together with @contended_lock, for the main driver to drop * when it calls dma_resv_unlock on the contended_lock. */ void (drop_ref*)(struct ww_mutex *contended_lock);
};
My initial thinking was to make all of this part of the core ww_mutex implementation, but then I quickly found that this won't work.
This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).
The next idea was to have it based on dma_resv objects, but as you also figured out you then need to drop the reference to the contended lock somehow...
So my current working plan was to use GEM object to avoid the callback...
I've heard noise that someone is looking into adding dynamic dma-buf support to stuff like rdma drivers. Because interconnects and big machines. Plus feels a bit awkward to mandate a gem library if you want to use dynamic dma-buf support in your driver. Hence why I think something around dma_resv (but with enough flexibility that it doesn't insist that the contending lock must be a dma_resv itself).
Plan B would be to throw our hands into and declare that "move_notify is best effort only and can fail for any reason". Exactly like ttm eviction currently does, even with all your hacks to do at least some dma_resv_lock (but not the full slowpath).
I would seriously NAK that. We have tried that with TTM and the whole idea is just braindead.
You can use trylock based eviction for things like best effort shrinker callbacks. But that's pretty much it.
Yeah trylock works pretty well for balancing caches, right up to the point where you actually have to shrink stuff. Then suddenly all the locks are contended because everyone is running low on memory :-/ We have glorious amounts of experience with our best effort system memory shrinker in i915 ...
Given how much "fun" you have with all the low memory handling and ttm fundamentally being best-effort only (despite that dma_resv would allow us to do this right, with some work) I'm not sure that's a good idea to extend to a cross-driver interface. Personally I'd lean towards fixing this first fully (in ttm/amdgpu), and then using that to implement move_notify correctly.
Well have you seen this: https://fosdem.org/2020/schedule/event/ttm/ :)
Oh cool, I was waiting for the upload. Will watch asap. btw slides somewhere?
Or just add an int return value here and mandate that importers must handle eviction failures. Exactly like ttm_mem_evict_first can currently still fail for various reasons.
(Sorry this isn't exactly the mail you hoped for)
Well it's actually the mail I expected. I'm thinking about exactly those problems for over a year now as well.
For the rather specific amdgpu case I could work around that by utilizing the HMM work to invalidate page tables on the fly, but that doesn't really help with memory management in general.
Yeah, so move_notify is maybe solveable with better hw and hmm, but there's other scenarios where I think the cross-driver ww_mutex locking will be needed, for fundamental reasons. Scenario: - a bunch of gpus in pcie slots, all in the same machine - because pcie is slot a nice interconnect (iirc you guys call yours xgmi or something like that) - working sets that are bigger than vram of a single gpu - lots of buffer sharing ofc
1. So driver has an imported dma-buf, currently not mapped anywhere because ti got thrown out (or first use). 2. Importer calls dma_buf_map_attachement 3. Exporter realizes there's a nice xgmi link and p2p would be much better if that object is in vram. 4. Exporter does ttm_bo_validate or equivalent to get the bo into vram, including eviction and lots of locking 5. In turn this might bite back to the importer through some move_notify of objects still mapped, but at the end of the lru.
So ->move_notify might not be the worst, eventually I think we'll need the full locking dance across drivers (or at least across drm_device instances, there might be internal upcasting going on so you get your buffers placed in the right vram and all that directly).
So YES, I totally agree that we need some sort of GEM execution context or something like this to lock buffers on the fly as we try to make room for others.
So what's the plan? Merge current series (with the bikesheds address) under this CONFIG_EXPERIMENTAL_DYN_DMABUF and then see where we land from there? Trying to get all the pieces lined up out of tree feels like it's going to be too much :-/ -Daniel
Regards, Christian.
Cheers, Daniel
/* When we get an error here it means that somebody
* else is holding the VM lock and updating page tables
* So we can just continue here.
*/
r = dma_resv_lock(resv, ticket);
if (r)
continue;
} else {
/* TODO: This is more problematic and we actually need
* to allow page tables updates without holding the
* lock.
*/
if (!dma_resv_trylock(resv))
continue;
}
r = amdgpu_vm_clear_freed(adev, vm, NULL);
if (!r)
r = amdgpu_vm_handle_moved(adev, vm);
if (r && r != -EBUSY)
DRM_ERROR("Failed to invalidate VM page tables (%d))\n",
r);
dma_resv_unlock(resv);
- }
+}
static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = {
.move_notify = amdgpu_dma_buf_move_notify };
/**
@@ -489,7 +553,7 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, return obj;
attach = dma_buf_dynamic_attach(dma_buf, dev->dev,
&amdgpu_dma_buf_attach_ops, NULL);
if (IS_ERR(attach)) { drm_gem_object_put(obj); return ERR_CAST(attach);&amdgpu_dma_buf_attach_ops, obj);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 8ae260822908..8c480c898b0d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -926,6 +926,9 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain, return 0; }
- if (bo->tbo.base.import_attach)
dma_buf_pin(bo->tbo.base.import_attach);
- bo->flags |= AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS; /* force to pin into visible video ram */ if (!(bo->flags & AMDGPU_GEM_CREATE_NO_CPU_ACCESS))
@@ -1009,6 +1012,9 @@ int amdgpu_bo_unpin(struct amdgpu_bo *bo)
amdgpu_bo_subtract_pin_size(bo);
- if (bo->tbo.base.import_attach)
dma_buf_unpin(bo->tbo.base.import_attach);
- for (i = 0; i < bo->placement.num_placement; i++) { bo->placements[i].lpfn = 0; bo->placements[i].flags &= ~TTM_PL_FLAG_NO_EVICT;
-- 2.17.1
Am 17.02.20 um 20:38 schrieb Daniel Vetter:
On Mon, Feb 17, 2020 at 7:58 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 17.02.20 um 18:55 schrieb Daniel Vetter:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: [SNIP]
And is also the sole reason why I started looking into the ww_mutex cursor handling a while back (e.g. the initial version with the horrible macro hack).
But this is really really hard to get right. So my thinking for now is to push this series upstream to at least unblock my ongoing P2P work.
Hm, but at least the move_notify stuff and the locking nightmare around that feels rushed if we just push that. Otoh it's indeed getting painful, and we'll probably have another few rounds of headaches to sort this all out. What about a
config EXPERIMENTAL_DYNAMIC_DMA_BUF depends on BROKEN
Wrapped around the new ->move_notify hook, plus all relevant code?
Oh, that is a really good idea and trivial to to do.
My initial thinking was to make all of this part of the core ww_mutex implementation, but then I quickly found that this won't work.
This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).
The next idea was to have it based on dma_resv objects, but as you also figured out you then need to drop the reference to the contended lock somehow...
So my current working plan was to use GEM object to avoid the callback...
I've heard noise that someone is looking into adding dynamic dma-buf support to stuff like rdma drivers. Because interconnects and big machines. Plus feels a bit awkward to mandate a gem library if you want to use dynamic dma-buf support in your driver. Hence why I think something around dma_resv (but with enough flexibility that it doesn't insist that the contending lock must be a dma_resv itself).
Ok, good to know. So to hell with the idea of using a GEM object.
But this also means that we can't do this with a single drop_ref() callback in the context because the context might contains different objects of all kind.....
[SNIP]
Oh cool, I was waiting for the upload. Will watch asap. btw slides somewhere?
Attached.
[SNIP] For the rather specific amdgpu case I could work around that by utilizing the HMM work to invalidate page tables on the fly, but that doesn't really help with memory management in general.
Yeah, so move_notify is maybe solveable with better hw and hmm,
You don't even need better hardware and HMM.
All you need to do is clever locking because in this case you will never export page tables to other devices.
but there's other scenarios where I think the cross-driver ww_mutex locking will be needed, for fundamental reasons. Scenario:
- a bunch of gpus in pcie slots, all in the same machine
- because pcie is slot a nice interconnect (iirc you guys call yours
xgmi or something like that)
- working sets that are bigger than vram of a single gpu
- lots of buffer sharing ofc
Yeah, completely agree. The issue with the page tables is actually a rather specific use case.
- So driver has an imported dma-buf, currently not mapped anywhere
because ti got thrown out (or first use). 2. Importer calls dma_buf_map_attachement 3. Exporter realizes there's a nice xgmi link and p2p would be much better if that object is in vram. 4. Exporter does ttm_bo_validate or equivalent to get the bo into vram, including eviction and lots of locking 5. In turn this might bite back to the importer through some move_notify of objects still mapped, but at the end of the lru.
So ->move_notify might not be the worst, eventually I think we'll need the full locking dance across drivers (or at least across drm_device instances, there might be internal upcasting going on so you get your buffers placed in the right vram and all that directly).
YES, EXACTLY! That's the reason why I'm working on this stuff and not try to get P2P/XGMI/etc.. upstream directly without it.
I mean using P2P without all this is certainly possible, but sooner or later your memory management will just fall apart.
So YES, I totally agree that we need some sort of GEM execution context or something like this to lock buffers on the fly as we try to make room for others.
So what's the plan? Merge current series (with the bikesheds address) under this CONFIG_EXPERIMENTAL_DYN_DMABUF and then see where we land from there? Trying to get all the pieces lined up out of tree feels like it's going to be too much :-/
At least I hoped for something like that.
Developing this out of tree and especially since I have this only as a background task turned out to be delaying things over and over again.
This way I can get it upstream (even when still under experimental flag) and start to convince our internal team/customers that we should work on investing time into this.
Thanks, Christian.
-Daniel
Regards, Christian.
Cheers, Daniel
/* When we get an error here it means that somebody
* else is holding the VM lock and updating page tables
* So we can just continue here.
*/
r = dma_resv_lock(resv, ticket);
if (r)
continue;
} else {
/* TODO: This is more problematic and we actually need
* to allow page tables updates without holding the
* lock.
*/
if (!dma_resv_trylock(resv))
continue;
}
r = amdgpu_vm_clear_freed(adev, vm, NULL);
if (!r)
r = amdgpu_vm_handle_moved(adev, vm);
if (r && r != -EBUSY)
DRM_ERROR("Failed to invalidate VM page tables (%d))\n",
r);
dma_resv_unlock(resv);
- }
+}
static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = {
.move_notify = amdgpu_dma_buf_move_notify };
/**
@@ -489,7 +553,7 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev, return obj;
attach = dma_buf_dynamic_attach(dma_buf, dev->dev,
&amdgpu_dma_buf_attach_ops, NULL);
&amdgpu_dma_buf_attach_ops, obj); if (IS_ERR(attach)) { drm_gem_object_put(obj); return ERR_CAST(attach);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 8ae260822908..8c480c898b0d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -926,6 +926,9 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain, return 0; }
- if (bo->tbo.base.import_attach)
dma_buf_pin(bo->tbo.base.import_attach);
bo->flags |= AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS; /* force to pin into visible video ram */ if (!(bo->flags & AMDGPU_GEM_CREATE_NO_CPU_ACCESS))
@@ -1009,6 +1012,9 @@ int amdgpu_bo_unpin(struct amdgpu_bo *bo)
amdgpu_bo_subtract_pin_size(bo);
- if (bo->tbo.base.import_attach)
dma_buf_unpin(bo->tbo.base.import_attach);
for (i = 0; i < bo->placement.num_placement; i++) { bo->placements[i].lpfn = 0; bo->placements[i].flags &= ~TTM_PL_FLAG_NO_EVICT;
-- 2.17.1
On 2/17/20 6:55 PM, Daniel Vetter wrote:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
Implement the importer side of unpinned DMA-buf handling.
v2: update page tables immediately
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ 2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index 770baba621b3..48de7624d49c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+/**
- amdgpu_dma_buf_move_notify - &attach.move_notify implementation
- @attach: the DMA-buf attachment
- Invalidate the DMA-buf attachment, making sure that the we re-create the
- mapping before the next use.
- */
+static void +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) +{
- struct drm_gem_object *obj = attach->importer_priv;
- struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
- struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
- struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
- struct ttm_operation_ctx ctx = { false, false };
- struct ttm_placement placement = {};
- struct amdgpu_vm_bo_base *bo_base;
- int r;
- if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
return;
- r = ttm_bo_validate(&bo->tbo, &placement, &ctx);
- if (r) {
DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
return;
- }
- for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
struct amdgpu_vm *vm = bo_base->vm;
struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
if (ticket) {
Yeah so this is kinda why I've been a total pain about the exact semantics of the move_notify hook. I think we should flat-out require that importers _always_ have a ticket attach when they call this, and that they can cope with additional locks being taken (i.e. full EDEADLCK) handling.
Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to the dma_resv object, which we then can take #ifdef CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
Now the real disaster is how we handle deadlocks. Two issues:
- Ideally we'd keep any lock we've taken locked until the end, it helps needless backoffs. I've played around a bit with that but not even poc level, just an idea:
https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27...
Idea is essentially to track a list of objects we had to lock as part of the ttm_bo_validate of the main object.
- Second one is if we get a EDEADLCK on one of these sublocks (like the one here). We need to pass that up the entire callchain, including a temporary reference (we have to drop locks to do the ww_mutex_lock_slow call), and need a custom callback to drop that temporary reference (since that's all driver specific, might even be internal ww_mutex and not anything remotely looking like a normal dma_buf). This probably needs the exec util helpers from ttm, but at the dma_resv level, so that we can do something like this:
struct dma_resv_ticket { struct ww_acquire_ctx base;
/* can be set by anyone (including other drivers) that got hold of * this ticket and had to acquire some new lock. This lock might * protect anything, including driver-internal stuff, and isn't * required to be a dma_buf or even just a dma_resv. */ struct ww_mutex *contended_lock;
/* callback which the driver (which might be a dma-buf exporter * and not matching the driver that started this locking ticket) * sets together with @contended_lock, for the main driver to drop * when it calls dma_resv_unlock on the contended_lock. */ void (drop_ref*)(struct ww_mutex *contended_lock); };
This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).
Just a short comment on this:
Neither the currently used wait-die or the wound-wait algorithm *strictly* requires a slow lock on the contended lock. For wait-die it's just very convenient since it makes us sleep instead of spinning with -EDEADLK on the contended lock. For wound-wait IIRC one could just immediately restart the whole locking transaction after an -EDEADLK, and the transaction would automatically end up waiting on the contended lock, provided the mutex lock stealing is not allowed. There is however a possibility that the transaction will be wounded again on another lock, taken before the contended lock, but I think there are ways to improve the wound-wait algorithm to reduce that probability.
So in short, choosing the wound-wait algorithm instead of wait-die and perhaps modifying the ww mutex code somewhat would probably help passing an -EDEADLK up the call chain without requiring passing the contended lock, as long as each locker releases its own locks when receiving an -EDEADLK.
/Thomas
On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) thomas_os@shipmail.org wrote:
On 2/17/20 6:55 PM, Daniel Vetter wrote:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
Implement the importer side of unpinned DMA-buf handling.
v2: update page tables immediately
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ 2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index 770baba621b3..48de7624d49c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+/**
- amdgpu_dma_buf_move_notify - &attach.move_notify implementation
- @attach: the DMA-buf attachment
- Invalidate the DMA-buf attachment, making sure that the we re-create the
- mapping before the next use.
- */
+static void +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) +{
- struct drm_gem_object *obj = attach->importer_priv;
- struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
- struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
- struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
- struct ttm_operation_ctx ctx = { false, false };
- struct ttm_placement placement = {};
- struct amdgpu_vm_bo_base *bo_base;
- int r;
- if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
return;
- r = ttm_bo_validate(&bo->tbo, &placement, &ctx);
- if (r) {
DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
return;
- }
- for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
struct amdgpu_vm *vm = bo_base->vm;
struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
if (ticket) {
Yeah so this is kinda why I've been a total pain about the exact semantics of the move_notify hook. I think we should flat-out require that importers _always_ have a ticket attach when they call this, and that they can cope with additional locks being taken (i.e. full EDEADLCK) handling.
Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to the dma_resv object, which we then can take #ifdef CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
Now the real disaster is how we handle deadlocks. Two issues:
- Ideally we'd keep any lock we've taken locked until the end, it helps needless backoffs. I've played around a bit with that but not even poc level, just an idea:
https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27...
Idea is essentially to track a list of objects we had to lock as part of the ttm_bo_validate of the main object.
- Second one is if we get a EDEADLCK on one of these sublocks (like the one here). We need to pass that up the entire callchain, including a temporary reference (we have to drop locks to do the ww_mutex_lock_slow call), and need a custom callback to drop that temporary reference (since that's all driver specific, might even be internal ww_mutex and not anything remotely looking like a normal dma_buf). This probably needs the exec util helpers from ttm, but at the dma_resv level, so that we can do something like this:
struct dma_resv_ticket { struct ww_acquire_ctx base;
/* can be set by anyone (including other drivers) that got hold of * this ticket and had to acquire some new lock. This lock might * protect anything, including driver-internal stuff, and isn't * required to be a dma_buf or even just a dma_resv. */ struct ww_mutex *contended_lock; /* callback which the driver (which might be a dma-buf exporter * and not matching the driver that started this locking ticket) * sets together with @contended_lock, for the main driver to drop * when it calls dma_resv_unlock on the contended_lock. */ void (drop_ref*)(struct ww_mutex *contended_lock);
};
This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).
Just a short comment on this:
Neither the currently used wait-die or the wound-wait algorithm *strictly* requires a slow lock on the contended lock. For wait-die it's just very convenient since it makes us sleep instead of spinning with -EDEADLK on the contended lock. For wound-wait IIRC one could just immediately restart the whole locking transaction after an -EDEADLK, and the transaction would automatically end up waiting on the contended lock, provided the mutex lock stealing is not allowed. There is however a possibility that the transaction will be wounded again on another lock, taken before the contended lock, but I think there are ways to improve the wound-wait algorithm to reduce that probability.
So in short, choosing the wound-wait algorithm instead of wait-die and perhaps modifying the ww mutex code somewhat would probably help passing an -EDEADLK up the call chain without requiring passing the contended lock, as long as each locker releases its own locks when receiving an -EDEADLK.
Hm this is kinda tempting, since rolling out the full backoff tricker across driver boundaries is going to be real painful.
What I'm kinda worried about is the debug/validation checks we're losing with this. The required backoff has this nice property that ww_mutex debug code can check that we've fully unwound everything when we should, that we've blocked on the right lock, and that we're restarting everything without keeling over. Without that I think we could end up with situations where a driver in the middle feels like handling the EDEADLCK, which might go well most of the times (the deadlock will probably be mostly within a given driver, not across). Right up to the point where someone creates a deadlock across drivers, and the lack of full rollback will be felt.
So not sure whether we can still keep all these debug/validation checks, or whether this is a step too far towards clever tricks.
But definitely a neat idea ... -Daniel
On 2/18/20 10:01 PM, Daniel Vetter wrote:
On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) thomas_os@shipmail.org wrote:
On 2/17/20 6:55 PM, Daniel Vetter wrote:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
Implement the importer side of unpinned DMA-buf handling.
v2: update page tables immediately
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ 2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index 770baba621b3..48de7624d49c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+/**
- amdgpu_dma_buf_move_notify - &attach.move_notify implementation
- @attach: the DMA-buf attachment
- Invalidate the DMA-buf attachment, making sure that the we re-create the
- mapping before the next use.
- */
+static void +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) +{
- struct drm_gem_object *obj = attach->importer_priv;
- struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
- struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
- struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
- struct ttm_operation_ctx ctx = { false, false };
- struct ttm_placement placement = {};
- struct amdgpu_vm_bo_base *bo_base;
- int r;
- if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
return;
- r = ttm_bo_validate(&bo->tbo, &placement, &ctx);
- if (r) {
DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
return;
- }
- for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
struct amdgpu_vm *vm = bo_base->vm;
struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
if (ticket) {
Yeah so this is kinda why I've been a total pain about the exact semantics of the move_notify hook. I think we should flat-out require that importers _always_ have a ticket attach when they call this, and that they can cope with additional locks being taken (i.e. full EDEADLCK) handling.
Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to the dma_resv object, which we then can take #ifdef CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
Now the real disaster is how we handle deadlocks. Two issues:
- Ideally we'd keep any lock we've taken locked until the end, it helps needless backoffs. I've played around a bit with that but not even poc level, just an idea:
https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27...
Idea is essentially to track a list of objects we had to lock as part of the ttm_bo_validate of the main object.
- Second one is if we get a EDEADLCK on one of these sublocks (like the one here). We need to pass that up the entire callchain, including a temporary reference (we have to drop locks to do the ww_mutex_lock_slow call), and need a custom callback to drop that temporary reference (since that's all driver specific, might even be internal ww_mutex and not anything remotely looking like a normal dma_buf). This probably needs the exec util helpers from ttm, but at the dma_resv level, so that we can do something like this:
struct dma_resv_ticket { struct ww_acquire_ctx base;
/* can be set by anyone (including other drivers) that got hold of * this ticket and had to acquire some new lock. This lock might * protect anything, including driver-internal stuff, and isn't * required to be a dma_buf or even just a dma_resv. */ struct ww_mutex *contended_lock; /* callback which the driver (which might be a dma-buf exporter * and not matching the driver that started this locking ticket) * sets together with @contended_lock, for the main driver to drop * when it calls dma_resv_unlock on the contended_lock. */ void (drop_ref*)(struct ww_mutex *contended_lock);
};
This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).
Just a short comment on this:
Neither the currently used wait-die or the wound-wait algorithm *strictly* requires a slow lock on the contended lock. For wait-die it's just very convenient since it makes us sleep instead of spinning with -EDEADLK on the contended lock. For wound-wait IIRC one could just immediately restart the whole locking transaction after an -EDEADLK, and the transaction would automatically end up waiting on the contended lock, provided the mutex lock stealing is not allowed. There is however a possibility that the transaction will be wounded again on another lock, taken before the contended lock, but I think there are ways to improve the wound-wait algorithm to reduce that probability.
So in short, choosing the wound-wait algorithm instead of wait-die and perhaps modifying the ww mutex code somewhat would probably help passing an -EDEADLK up the call chain without requiring passing the contended lock, as long as each locker releases its own locks when receiving an -EDEADLK.
Hm this is kinda tempting, since rolling out the full backoff tricker across driver boundaries is going to be real painful.
What I'm kinda worried about is the debug/validation checks we're losing with this. The required backoff has this nice property that ww_mutex debug code can check that we've fully unwound everything when we should, that we've blocked on the right lock, and that we're restarting everything without keeling over. Without that I think we could end up with situations where a driver in the middle feels like handling the EDEADLCK, which might go well most of the times (the deadlock will probably be mostly within a given driver, not across). Right up to the point where someone creates a deadlock across drivers, and the lack of full rollback will be felt.
So not sure whether we can still keep all these debug/validation checks, or whether this is a step too far towards clever tricks.
I think we could definitely find a way to keep debugging to make sure everything is unwound before attempting to restart the locking transaction. But the debug check that we're restarting on the contended lock only really makes sense for wait-die, (and we could easily keep it for wait-die). The lock returning -EDEADLK for wound-wait may actually not be the contending lock but an arbitrary lock that the wounded transaction attempts to take after it is wounded.
So in the end IMO this is a tradeoff between added (possibly severe) locking complexity into dma-buf and not being able to switch back to wait-die efficiently if we need / want to do that.
/Thomas
But definitely a neat idea ... -Daniel
On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:
On 2/18/20 10:01 PM, Daniel Vetter wrote:
On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) thomas_os@shipmail.org wrote:
On 2/17/20 6:55 PM, Daniel Vetter wrote:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
Implement the importer side of unpinned DMA-buf handling.
v2: update page tables immediately
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ 2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index 770baba621b3..48de7624d49c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+/**
- amdgpu_dma_buf_move_notify - &attach.move_notify implementation
- @attach: the DMA-buf attachment
- Invalidate the DMA-buf attachment, making sure that the we
re-create the
- mapping before the next use.
- */
+static void +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) +{ + struct drm_gem_object *obj = attach->importer_priv; + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); + struct ttm_operation_ctx ctx = { false, false }; + struct ttm_placement placement = {}; + struct amdgpu_vm_bo_base *bo_base; + int r;
+ if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) + return;
+ r = ttm_bo_validate(&bo->tbo, &placement, &ctx); + if (r) { + DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r); + return; + }
+ for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { + struct amdgpu_vm *vm = bo_base->vm; + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+ if (ticket) {
Yeah so this is kinda why I've been a total pain about the exact semantics of the move_notify hook. I think we should flat-out require that importers _always_ have a ticket attach when they call this, and that they can cope with additional locks being taken (i.e. full EDEADLCK) handling.
Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to the dma_resv object, which we then can take #ifdef CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
Now the real disaster is how we handle deadlocks. Two issues:
- Ideally we'd keep any lock we've taken locked until the end, it
helps needless backoffs. I've played around a bit with that but not even poc level, just an idea:
https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27...
Idea is essentially to track a list of objects we had to lock as part of the ttm_bo_validate of the main object.
- Second one is if we get a EDEADLCK on one of these sublocks (like
the one here). We need to pass that up the entire callchain, including a temporary reference (we have to drop locks to do the ww_mutex_lock_slow call), and need a custom callback to drop that temporary reference (since that's all driver specific, might even be internal ww_mutex and not anything remotely looking like a normal dma_buf). This probably needs the exec util helpers from ttm, but at the dma_resv level, so that we can do something like this:
struct dma_resv_ticket { struct ww_acquire_ctx base;
/* can be set by anyone (including other drivers) that got hold of * this ticket and had to acquire some new lock. This lock might * protect anything, including driver-internal stuff, and isn't * required to be a dma_buf or even just a dma_resv. */ struct ww_mutex *contended_lock;
/* callback which the driver (which might be a dma-buf exporter * and not matching the driver that started this locking ticket) * sets together with @contended_lock, for the main driver to drop * when it calls dma_resv_unlock on the contended_lock. */ void (drop_ref*)(struct ww_mutex *contended_lock); };
This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).
Just a short comment on this:
Neither the currently used wait-die or the wound-wait algorithm *strictly* requires a slow lock on the contended lock. For wait-die it's just very convenient since it makes us sleep instead of spinning with -EDEADLK on the contended lock. For wound-wait IIRC one could just immediately restart the whole locking transaction after an -EDEADLK, and the transaction would automatically end up waiting on the contended lock, provided the mutex lock stealing is not allowed. There is however a possibility that the transaction will be wounded again on another lock, taken before the contended lock, but I think there are ways to improve the wound-wait algorithm to reduce that probability.
So in short, choosing the wound-wait algorithm instead of wait-die and perhaps modifying the ww mutex code somewhat would probably help passing an -EDEADLK up the call chain without requiring passing the contended lock, as long as each locker releases its own locks when receiving an -EDEADLK.
Hm this is kinda tempting, since rolling out the full backoff tricker across driver boundaries is going to be real painful.
What I'm kinda worried about is the debug/validation checks we're losing with this. The required backoff has this nice property that ww_mutex debug code can check that we've fully unwound everything when we should, that we've blocked on the right lock, and that we're restarting everything without keeling over. Without that I think we could end up with situations where a driver in the middle feels like handling the EDEADLCK, which might go well most of the times (the deadlock will probably be mostly within a given driver, not across). Right up to the point where someone creates a deadlock across drivers, and the lack of full rollback will be felt.
So not sure whether we can still keep all these debug/validation checks, or whether this is a step too far towards clever tricks.
I think we could definitely find a way to keep debugging to make sure everything is unwound before attempting to restart the locking transaction. But the debug check that we're restarting on the contended lock only really makes sense for wait-die, (and we could easily keep it for wait-die). The lock returning -EDEADLK for wound-wait may actually not be the contending lock but an arbitrary lock that the wounded transaction attempts to take after it is wounded.
So in the end IMO this is a tradeoff between added (possibly severe) locking complexity into dma-buf and not being able to switch back to wait-die efficiently if we need / want to do that.
/Thomas
And as a consequence an interface *could* be:
*) We introduce functions
void ww_acquire_relax(struct ww_acquire_ctx *ctx); int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx);
that can be used instead of ww_mutex_lock_slow() in the absence of a contending lock to avoid spinning on -EDEADLK. While trying to take the contending lock is probably the best choice there are various second best approaches that can be explored, for example waiting on the contending acquire to finish or in the wound-wait case, perhaps do nothing. These functions will also help us keep the debugging.
*) A function returning -EDEADLK to a caller *must* have already released its own locks.
*) move_notify() explicitly takes a struct ww_acquire_ctx * to make sure there is no ambiguity. (I think it would be valuable if we could do the same for ttm_bo_validate()).
/Thomas
On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:
On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:
On 2/18/20 10:01 PM, Daniel Vetter wrote:
On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) thomas_os@shipmail.org wrote:
On 2/17/20 6:55 PM, Daniel Vetter wrote:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
Implement the importer side of unpinned DMA-buf handling.
v2: update page tables immediately
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ 2 files changed, 71 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index 770baba621b3..48de7624d49c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) return ERR_PTR(ret); }
+/**
- amdgpu_dma_buf_move_notify - &attach.move_notify implementation
- @attach: the DMA-buf attachment
- Invalidate the DMA-buf attachment, making sure that
the we re-create the
- mapping before the next use.
- */
+static void +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) +{ + struct drm_gem_object *obj = attach->importer_priv; + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); + struct ttm_operation_ctx ctx = { false, false }; + struct ttm_placement placement = {}; + struct amdgpu_vm_bo_base *bo_base; + int r;
+ if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) + return;
+ r = ttm_bo_validate(&bo->tbo, &placement, &ctx); + if (r) { + DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r); + return; + }
+ for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { + struct amdgpu_vm *vm = bo_base->vm; + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+ if (ticket) {
Yeah so this is kinda why I've been a total pain about the exact semantics of the move_notify hook. I think we should flat-out require that importers _always_ have a ticket attach when they call this, and that they can cope with additional locks being taken (i.e. full EDEADLCK) handling.
Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to the dma_resv object, which we then can take #ifdef CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
Now the real disaster is how we handle deadlocks. Two issues:
- Ideally we'd keep any lock we've taken locked until the
end, it helps needless backoffs. I've played around a bit with that but not even poc level, just an idea:
https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27...
Idea is essentially to track a list of objects we had to lock as part of the ttm_bo_validate of the main object.
- Second one is if we get a EDEADLCK on one of these
sublocks (like the one here). We need to pass that up the entire callchain, including a temporary reference (we have to drop locks to do the ww_mutex_lock_slow call), and need a custom callback to drop that temporary reference (since that's all driver specific, might even be internal ww_mutex and not anything remotely looking like a normal dma_buf). This probably needs the exec util helpers from ttm, but at the dma_resv level, so that we can do something like this:
struct dma_resv_ticket { struct ww_acquire_ctx base;
/* can be set by anyone (including other drivers) that got hold of * this ticket and had to acquire some new lock. This lock might * protect anything, including driver-internal stuff, and isn't * required to be a dma_buf or even just a dma_resv. */ struct ww_mutex *contended_lock;
/* callback which the driver (which might be a dma-buf exporter * and not matching the driver that started this locking ticket) * sets together with @contended_lock, for the main driver to drop * when it calls dma_resv_unlock on the contended_lock. */ void (drop_ref*)(struct ww_mutex *contended_lock); };
This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).
Just a short comment on this:
Neither the currently used wait-die or the wound-wait algorithm *strictly* requires a slow lock on the contended lock. For wait-die it's just very convenient since it makes us sleep instead of spinning with -EDEADLK on the contended lock. For wound-wait IIRC one could just immediately restart the whole locking transaction after an -EDEADLK, and the transaction would automatically end up waiting on the contended lock, provided the mutex lock stealing is not allowed. There is however a possibility that the transaction will be wounded again on another lock, taken before the contended lock, but I think there are ways to improve the wound-wait algorithm to reduce that probability.
So in short, choosing the wound-wait algorithm instead of wait-die and perhaps modifying the ww mutex code somewhat would probably help passing an -EDEADLK up the call chain without requiring passing the contended lock, as long as each locker releases its own locks when receiving an -EDEADLK.
Hm this is kinda tempting, since rolling out the full backoff tricker across driver boundaries is going to be real painful.
What I'm kinda worried about is the debug/validation checks we're losing with this. The required backoff has this nice property that ww_mutex debug code can check that we've fully unwound everything when we should, that we've blocked on the right lock, and that we're restarting everything without keeling over. Without that I think we could end up with situations where a driver in the middle feels like handling the EDEADLCK, which might go well most of the times (the deadlock will probably be mostly within a given driver, not across). Right up to the point where someone creates a deadlock across drivers, and the lack of full rollback will be felt.
So not sure whether we can still keep all these debug/validation checks, or whether this is a step too far towards clever tricks.
I think we could definitely find a way to keep debugging to make sure everything is unwound before attempting to restart the locking transaction. But the debug check that we're restarting on the contended lock only really makes sense for wait-die, (and we could easily keep it for wait-die). The lock returning -EDEADLK for wound-wait may actually not be the contending lock but an arbitrary lock that the wounded transaction attempts to take after it is wounded.
So in the end IMO this is a tradeoff between added (possibly severe) locking complexity into dma-buf and not being able to switch back to wait-die efficiently if we need / want to do that.
/Thomas
And as a consequence an interface *could* be:
*) We introduce functions
void ww_acquire_relax(struct ww_acquire_ctx *ctx); int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx);
that can be used instead of ww_mutex_lock_slow() in the absence of a contending lock to avoid spinning on -EDEADLK. While trying to take the contending lock is probably the best choice there are various second best approaches that can be explored, for example waiting on the contending acquire to finish or in the wound-wait case, perhaps do nothing. These functions will also help us keep the debugging.
Hm ... I guess this could work. Trouble is, it only gets rid of the slowpath locking book-keeping headaches, we still have quite a few others.
*) A function returning -EDEADLK to a caller *must* have already released its own locks.
So this ties to another question, as in should these callbacks have to drops the locks thei acquire (much simpler code) or not (less thrashing, if we drop locks we might end up in a situation where threads thrash around instead of realizing quicker that they're actually deadlocking and one of them should stop and back off).
But keeping locks locked means massive amounts of book-keeping in dma_resv layer, so goes all downhill from there.
*) move_notify() explicitly takes a struct ww_acquire_ctx * to make sure there is no ambiguity. (I think it would be valuable if we could do the same for ttm_bo_validate()).
Yeah I think more explicit locking ctx would be really good no matter what. Implicitly fishing the acquire_ctx out of the lock for the object you're called on is kinda nasty. -Daniel
On 2/20/20 7:04 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:
On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:
On 2/18/20 10:01 PM, Daniel Vetter wrote:
On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) thomas_os@shipmail.org wrote:
On 2/17/20 6:55 PM, Daniel Vetter wrote:
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: > Implement the importer side of unpinned DMA-buf handling. > > v2: update page tables immediately > > Signed-off-by: Christian König christian.koenig@amd.com > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 > ++++++++++++++++++++- > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ > 2 files changed, 71 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > index 770baba621b3..48de7624d49c 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct > drm_device *dev, struct dma_buf *dma_buf) > return ERR_PTR(ret); > } > > +/** > + * amdgpu_dma_buf_move_notify - &attach.move_notify implementation > + * > + * @attach: the DMA-buf attachment > + * > + * Invalidate the DMA-buf attachment, making sure that > the we re-create the > + * mapping before the next use. > + */ > +static void > +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) > +{ > + struct drm_gem_object *obj = attach->importer_priv; > + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); > + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); > + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); > + struct ttm_operation_ctx ctx = { false, false }; > + struct ttm_placement placement = {}; > + struct amdgpu_vm_bo_base *bo_base; > + int r; > + > + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) > + return; > + > + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); > + if (r) { > + DRM_ERROR("Failed to invalidate DMA-buf > import (%d))\n", r); > + return; > + } > + > + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { > + struct amdgpu_vm *vm = bo_base->vm; > + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv; > + > + if (ticket) { Yeah so this is kinda why I've been a total pain about the exact semantics of the move_notify hook. I think we should flat-out require that importers _always_ have a ticket attach when they call this, and that they can cope with additional locks being taken (i.e. full EDEADLCK) handling.
Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to the dma_resv object, which we then can take #ifdef CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
Now the real disaster is how we handle deadlocks. Two issues:
- Ideally we'd keep any lock we've taken locked until the
end, it helps needless backoffs. I've played around a bit with that but not even poc level, just an idea:
https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27...
Idea is essentially to track a list of objects we had to lock as part of the ttm_bo_validate of the main object.
- Second one is if we get a EDEADLCK on one of these
sublocks (like the one here). We need to pass that up the entire callchain, including a temporary reference (we have to drop locks to do the ww_mutex_lock_slow call), and need a custom callback to drop that temporary reference (since that's all driver specific, might even be internal ww_mutex and not anything remotely looking like a normal dma_buf). This probably needs the exec util helpers from ttm, but at the dma_resv level, so that we can do something like this:
struct dma_resv_ticket { struct ww_acquire_ctx base;
/* can be set by anyone (including other drivers) that got hold of * this ticket and had to acquire some new lock. This lock might * protect anything, including driver-internal stuff, and isn't * required to be a dma_buf or even just a dma_resv. */ struct ww_mutex *contended_lock;
/* callback which the driver (which might be a dma-buf exporter * and not matching the driver that started this locking ticket) * sets together with @contended_lock, for the main driver to drop * when it calls dma_resv_unlock on the contended_lock. */ void (drop_ref*)(struct ww_mutex *contended_lock); };
This is all supremely nasty (also ttm_bo_validate would need to be improved to handle these sublocks and random new objects that could force a ww_mutex_lock_slow).
Just a short comment on this:
Neither the currently used wait-die or the wound-wait algorithm *strictly* requires a slow lock on the contended lock. For wait-die it's just very convenient since it makes us sleep instead of spinning with -EDEADLK on the contended lock. For wound-wait IIRC one could just immediately restart the whole locking transaction after an -EDEADLK, and the transaction would automatically end up waiting on the contended lock, provided the mutex lock stealing is not allowed. There is however a possibility that the transaction will be wounded again on another lock, taken before the contended lock, but I think there are ways to improve the wound-wait algorithm to reduce that probability.
So in short, choosing the wound-wait algorithm instead of wait-die and perhaps modifying the ww mutex code somewhat would probably help passing an -EDEADLK up the call chain without requiring passing the contended lock, as long as each locker releases its own locks when receiving an -EDEADLK.
Hm this is kinda tempting, since rolling out the full backoff tricker across driver boundaries is going to be real painful.
What I'm kinda worried about is the debug/validation checks we're losing with this. The required backoff has this nice property that ww_mutex debug code can check that we've fully unwound everything when we should, that we've blocked on the right lock, and that we're restarting everything without keeling over. Without that I think we could end up with situations where a driver in the middle feels like handling the EDEADLCK, which might go well most of the times (the deadlock will probably be mostly within a given driver, not across). Right up to the point where someone creates a deadlock across drivers, and the lack of full rollback will be felt.
So not sure whether we can still keep all these debug/validation checks, or whether this is a step too far towards clever tricks.
I think we could definitely find a way to keep debugging to make sure everything is unwound before attempting to restart the locking transaction. But the debug check that we're restarting on the contended lock only really makes sense for wait-die, (and we could easily keep it for wait-die). The lock returning -EDEADLK for wound-wait may actually not be the contending lock but an arbitrary lock that the wounded transaction attempts to take after it is wounded.
So in the end IMO this is a tradeoff between added (possibly severe) locking complexity into dma-buf and not being able to switch back to wait-die efficiently if we need / want to do that.
/Thomas
And as a consequence an interface *could* be:
*) We introduce functions
void ww_acquire_relax(struct ww_acquire_ctx *ctx); int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx);
that can be used instead of ww_mutex_lock_slow() in the absence of a contending lock to avoid spinning on -EDEADLK. While trying to take the contending lock is probably the best choice there are various second best approaches that can be explored, for example waiting on the contending acquire to finish or in the wound-wait case, perhaps do nothing. These functions will also help us keep the debugging.
Hm ... I guess this could work. Trouble is, it only gets rid of the slowpath locking book-keeping headaches, we still have quite a few others.
*) A function returning -EDEADLK to a caller *must* have already released its own locks.
So this ties to another question, as in should these callbacks have to drops the locks thei acquire (much simpler code) or not (less thrashing, if we drop locks we might end up in a situation where threads thrash around instead of realizing quicker that they're actually deadlocking and one of them should stop and back off).
Hmm.. Could you describe such a thrashing case with an example?
/Thomas
On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:
On 2/20/20 7:04 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:
On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:
On 2/18/20 10:01 PM, Daniel Vetter wrote:
On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) thomas_os@shipmail.org wrote:
On 2/17/20 6:55 PM, Daniel Vetter wrote: > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: > > Implement the importer side of unpinned DMA-buf handling. > > > > v2: update page tables immediately > > > > Signed-off-by: Christian König christian.koenig@amd.com > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 > > ++++++++++++++++++++- > > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ > > 2 files changed, 71 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > index 770baba621b3..48de7624d49c 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct > > drm_device *dev, struct dma_buf *dma_buf) > > return ERR_PTR(ret); > > } > > > > +/** > > + * amdgpu_dma_buf_move_notify - &attach.move_notify implementation > > + * > > + * @attach: the DMA-buf attachment > > + * > > + * Invalidate the DMA-buf attachment, making sure that > > the we re-create the > > + * mapping before the next use. > > + */ > > +static void > > +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) > > +{ > > + struct drm_gem_object *obj = attach->importer_priv; > > + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); > > + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); > > + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); > > + struct ttm_operation_ctx ctx = { false, false }; > > + struct ttm_placement placement = {}; > > + struct amdgpu_vm_bo_base *bo_base; > > + int r; > > + > > + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) > > + return; > > + > > + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); > > + if (r) { > > + DRM_ERROR("Failed to invalidate DMA-buf > > import (%d))\n", r); > > + return; > > + } > > + > > + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { > > + struct amdgpu_vm *vm = bo_base->vm; > > + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv; > > + > > + if (ticket) { > Yeah so this is kinda why I've been a total pain about the > exact semantics > of the move_notify hook. I think we should flat-out require > that importers > _always_ have a ticket attach when they call this, and that > they can cope > with additional locks being taken (i.e. full EDEADLCK) handling. > > Simplest way to force that contract is to add a dummy 2nd > ww_mutex lock to > the dma_resv object, which we then can take #ifdef > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket). > > Now the real disaster is how we handle deadlocks. Two issues: > > - Ideally we'd keep any lock we've taken locked until the > end, it helps > needless backoffs. I've played around a bit with that > but not even poc > level, just an idea: > > https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27... > > > Idea is essentially to track a list of objects we had to > lock as part of > the ttm_bo_validate of the main object. > > - Second one is if we get a EDEADLCK on one of these > sublocks (like the > one here). We need to pass that up the entire callchain, > including a > temporary reference (we have to drop locks to do the > ww_mutex_lock_slow > call), and need a custom callback to drop that temporary reference > (since that's all driver specific, might even be > internal ww_mutex and > not anything remotely looking like a normal dma_buf). > This probably > needs the exec util helpers from ttm, but at the > dma_resv level, so that > we can do something like this: > > struct dma_resv_ticket { > struct ww_acquire_ctx base; > > /* can be set by anyone (including other drivers) > that got hold of > * this ticket and had to acquire some new lock. This > lock might > * protect anything, including driver-internal stuff, and isn't > * required to be a dma_buf or even just a dma_resv. */ > struct ww_mutex *contended_lock; > > /* callback which the driver (which might be a dma-buf exporter > * and not matching the driver that started this > locking ticket) > * sets together with @contended_lock, for the main > driver to drop > * when it calls dma_resv_unlock on the contended_lock. */ > void (drop_ref*)(struct ww_mutex *contended_lock); > }; > > This is all supremely nasty (also ttm_bo_validate would need to be > improved to handle these sublocks and random new objects > that could force > a ww_mutex_lock_slow). > Just a short comment on this:
Neither the currently used wait-die or the wound-wait algorithm *strictly* requires a slow lock on the contended lock. For wait-die it's just very convenient since it makes us sleep instead of spinning with -EDEADLK on the contended lock. For wound-wait IIRC one could just immediately restart the whole locking transaction after an -EDEADLK, and the transaction would automatically end up waiting on the contended lock, provided the mutex lock stealing is not allowed. There is however a possibility that the transaction will be wounded again on another lock, taken before the contended lock, but I think there are ways to improve the wound-wait algorithm to reduce that probability.
So in short, choosing the wound-wait algorithm instead of wait-die and perhaps modifying the ww mutex code somewhat would probably help passing an -EDEADLK up the call chain without requiring passing the contended lock, as long as each locker releases its own locks when receiving an -EDEADLK.
Hm this is kinda tempting, since rolling out the full backoff tricker across driver boundaries is going to be real painful.
What I'm kinda worried about is the debug/validation checks we're losing with this. The required backoff has this nice property that ww_mutex debug code can check that we've fully unwound everything when we should, that we've blocked on the right lock, and that we're restarting everything without keeling over. Without that I think we could end up with situations where a driver in the middle feels like handling the EDEADLCK, which might go well most of the times (the deadlock will probably be mostly within a given driver, not across). Right up to the point where someone creates a deadlock across drivers, and the lack of full rollback will be felt.
So not sure whether we can still keep all these debug/validation checks, or whether this is a step too far towards clever tricks.
I think we could definitely find a way to keep debugging to make sure everything is unwound before attempting to restart the locking transaction. But the debug check that we're restarting on the contended lock only really makes sense for wait-die, (and we could easily keep it for wait-die). The lock returning -EDEADLK for wound-wait may actually not be the contending lock but an arbitrary lock that the wounded transaction attempts to take after it is wounded.
So in the end IMO this is a tradeoff between added (possibly severe) locking complexity into dma-buf and not being able to switch back to wait-die efficiently if we need / want to do that.
/Thomas
And as a consequence an interface *could* be:
*) We introduce functions
void ww_acquire_relax(struct ww_acquire_ctx *ctx); int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx);
that can be used instead of ww_mutex_lock_slow() in the absence of a contending lock to avoid spinning on -EDEADLK. While trying to take the contending lock is probably the best choice there are various second best approaches that can be explored, for example waiting on the contending acquire to finish or in the wound-wait case, perhaps do nothing. These functions will also help us keep the debugging.
Hm ... I guess this could work. Trouble is, it only gets rid of the slowpath locking book-keeping headaches, we still have quite a few others.
*) A function returning -EDEADLK to a caller *must* have already released its own locks.
So this ties to another question, as in should these callbacks have to drops the locks thei acquire (much simpler code) or not (less thrashing, if we drop locks we might end up in a situation where threads thrash around instead of realizing quicker that they're actually deadlocking and one of them should stop and back off).
Hmm.. Could you describe such a thrashing case with an example?
Ignoring cross device fun and all that, just a simplified example of why holding onto locks you've acquired for eviction is useful, at least in a slow path.
- one thread trying to do an execbuf with a huge bo
vs.
- an entire pile of thread that try to do execbuf with just a few small bo
First thread is in the eviction loop, selects a bo, wins against all the other thread since it's been doing this forever already, gets the bo moved out, unlocks.
Since it's competing against lots of other threads with small bo, it'll have to do that a lot of times. Often enough to create a contiguous hole. If you have a smarter allocator that tries to create that hole more actively, just assume that the single huge bo is a substantial part of total vram.
The other threads will be quicker in cramming new stuff in, even if they occasionally lose the ww dance against the single thread. So the big thread livelocks.
If otoh the big thread would keep onto all the locks, eventually it have the entire vram locked, and every other thread is guaranteed to lose against it in the ww dance and queue up behind. And it could finally but its huge bo into vram and execute.
Vary example for multi-gpu and more realism, but that's roughly it.
Aside, a lot of the stuff Christian has been doing in ttm is to improve the chances that the competing threads will hit one of the locked objects of the big thread, and at least back off a bit. That's at least my understanding of what's been happening. -Daniel
On 2/20/20 9:08 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:
On 2/20/20 7:04 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:
On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:
On 2/18/20 10:01 PM, Daniel Vetter wrote:
On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) thomas_os@shipmail.org wrote: > On 2/17/20 6:55 PM, Daniel Vetter wrote: >> On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: >>> Implement the importer side of unpinned DMA-buf handling. >>> >>> v2: update page tables immediately >>> >>> Signed-off-by: Christian König christian.koenig@amd.com >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 >>> ++++++++++++++++++++- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ >>> 2 files changed, 71 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>> index 770baba621b3..48de7624d49c 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>> @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct >>> drm_device *dev, struct dma_buf *dma_buf) >>> return ERR_PTR(ret); >>> } >>> >>> +/** >>> + * amdgpu_dma_buf_move_notify - &attach.move_notify implementation >>> + * >>> + * @attach: the DMA-buf attachment >>> + * >>> + * Invalidate the DMA-buf attachment, making sure that >>> the we re-create the >>> + * mapping before the next use. >>> + */ >>> +static void >>> +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) >>> +{ >>> + struct drm_gem_object *obj = attach->importer_priv; >>> + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); >>> + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); >>> + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); >>> + struct ttm_operation_ctx ctx = { false, false }; >>> + struct ttm_placement placement = {}; >>> + struct amdgpu_vm_bo_base *bo_base; >>> + int r; >>> + >>> + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) >>> + return; >>> + >>> + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); >>> + if (r) { >>> + DRM_ERROR("Failed to invalidate DMA-buf >>> import (%d))\n", r); >>> + return; >>> + } >>> + >>> + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { >>> + struct amdgpu_vm *vm = bo_base->vm; >>> + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv; >>> + >>> + if (ticket) { >> Yeah so this is kinda why I've been a total pain about the >> exact semantics >> of the move_notify hook. I think we should flat-out require >> that importers >> _always_ have a ticket attach when they call this, and that >> they can cope >> with additional locks being taken (i.e. full EDEADLCK) handling. >> >> Simplest way to force that contract is to add a dummy 2nd >> ww_mutex lock to >> the dma_resv object, which we then can take #ifdef >> CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket). >> >> Now the real disaster is how we handle deadlocks. Two issues: >> >> - Ideally we'd keep any lock we've taken locked until the >> end, it helps >> needless backoffs. I've played around a bit with that >> but not even poc >> level, just an idea: >> >> https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27... >> >> >> Idea is essentially to track a list of objects we had to >> lock as part of >> the ttm_bo_validate of the main object. >> >> - Second one is if we get a EDEADLCK on one of these >> sublocks (like the >> one here). We need to pass that up the entire callchain, >> including a >> temporary reference (we have to drop locks to do the >> ww_mutex_lock_slow >> call), and need a custom callback to drop that temporary reference >> (since that's all driver specific, might even be >> internal ww_mutex and >> not anything remotely looking like a normal dma_buf). >> This probably >> needs the exec util helpers from ttm, but at the >> dma_resv level, so that >> we can do something like this: >> >> struct dma_resv_ticket { >> struct ww_acquire_ctx base; >> >> /* can be set by anyone (including other drivers) >> that got hold of >> * this ticket and had to acquire some new lock. This >> lock might >> * protect anything, including driver-internal stuff, and isn't >> * required to be a dma_buf or even just a dma_resv. */ >> struct ww_mutex *contended_lock; >> >> /* callback which the driver (which might be a dma-buf exporter >> * and not matching the driver that started this >> locking ticket) >> * sets together with @contended_lock, for the main >> driver to drop >> * when it calls dma_resv_unlock on the contended_lock. */ >> void (drop_ref*)(struct ww_mutex *contended_lock); >> }; >> >> This is all supremely nasty (also ttm_bo_validate would need to be >> improved to handle these sublocks and random new objects >> that could force >> a ww_mutex_lock_slow). >> > Just a short comment on this: > > Neither the currently used wait-die or the wound-wait algorithm > *strictly* requires a slow lock on the contended lock. For > wait-die it's > just very convenient since it makes us sleep instead of spinning with > -EDEADLK on the contended lock. For wound-wait IIRC one could just > immediately restart the whole locking transaction after an > -EDEADLK, and > the transaction would automatically end up waiting on the contended > lock, provided the mutex lock stealing is not allowed. There is however > a possibility that the transaction will be wounded again on another > lock, taken before the contended lock, but I think there are ways to > improve the wound-wait algorithm to reduce that probability. > > So in short, choosing the wound-wait algorithm instead of wait-die and > perhaps modifying the ww mutex code somewhat would probably help > passing > an -EDEADLK up the call chain without requiring passing the contended > lock, as long as each locker releases its own locks when receiving an > -EDEADLK. Hm this is kinda tempting, since rolling out the full backoff tricker across driver boundaries is going to be real painful.
What I'm kinda worried about is the debug/validation checks we're losing with this. The required backoff has this nice property that ww_mutex debug code can check that we've fully unwound everything when we should, that we've blocked on the right lock, and that we're restarting everything without keeling over. Without that I think we could end up with situations where a driver in the middle feels like handling the EDEADLCK, which might go well most of the times (the deadlock will probably be mostly within a given driver, not across). Right up to the point where someone creates a deadlock across drivers, and the lack of full rollback will be felt.
So not sure whether we can still keep all these debug/validation checks, or whether this is a step too far towards clever tricks.
I think we could definitely find a way to keep debugging to make sure everything is unwound before attempting to restart the locking transaction. But the debug check that we're restarting on the contended lock only really makes sense for wait-die, (and we could easily keep it for wait-die). The lock returning -EDEADLK for wound-wait may actually not be the contending lock but an arbitrary lock that the wounded transaction attempts to take after it is wounded.
So in the end IMO this is a tradeoff between added (possibly severe) locking complexity into dma-buf and not being able to switch back to wait-die efficiently if we need / want to do that.
/Thomas
And as a consequence an interface *could* be:
*) We introduce functions
void ww_acquire_relax(struct ww_acquire_ctx *ctx); int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx);
that can be used instead of ww_mutex_lock_slow() in the absence of a contending lock to avoid spinning on -EDEADLK. While trying to take the contending lock is probably the best choice there are various second best approaches that can be explored, for example waiting on the contending acquire to finish or in the wound-wait case, perhaps do nothing. These functions will also help us keep the debugging.
Hm ... I guess this could work. Trouble is, it only gets rid of the slowpath locking book-keeping headaches, we still have quite a few others.
*) A function returning -EDEADLK to a caller *must* have already released its own locks.
So this ties to another question, as in should these callbacks have to drops the locks thei acquire (much simpler code) or not (less thrashing, if we drop locks we might end up in a situation where threads thrash around instead of realizing quicker that they're actually deadlocking and one of them should stop and back off).
Hmm.. Could you describe such a thrashing case with an example?
Ignoring cross device fun and all that, just a simplified example of why holding onto locks you've acquired for eviction is useful, at least in a slow path.
- one thread trying to do an execbuf with a huge bo
vs.
- an entire pile of thread that try to do execbuf with just a few small bo
First thread is in the eviction loop, selects a bo, wins against all the other thread since it's been doing this forever already, gets the bo moved out, unlocks.
Since it's competing against lots of other threads with small bo, it'll have to do that a lot of times. Often enough to create a contiguous hole. If you have a smarter allocator that tries to create that hole more actively, just assume that the single huge bo is a substantial part of total vram.
The other threads will be quicker in cramming new stuff in, even if they occasionally lose the ww dance against the single thread. So the big thread livelocks.
If otoh the big thread would keep onto all the locks, eventually it have the entire vram locked, and every other thread is guaranteed to lose against it in the ww dance and queue up behind. And it could finally but its huge bo into vram and execute.
Hmm, yes this indeed explains why it's beneficial in some cases to keep a number of locks held across certain operations, but I still fail to see why we would like *all* locks held across the entire transaction? In the above case I'd envision us ending up with something like:
int validate(ctx, bo) {
for_each_suitable_bo_to_evict(ebo) { r = lock(ctx, ebo); if (r == EDEADLK) goto out_unlock;
r = move_notify(ctx, ebo);// locks and unlocks GPU VM bo. if (r == EDEADLK) goto out_unlock; evict(); }
place_bo(bo); //Repeat until success.
out_unlock: for_each_locked_bo(ebo) unlock(ctx, ebo); }
void command_submission() { acquire_init(ctx);
restart: for_each_bo_in_cs(bo) { r = lock(ctx, bo); if (r == -EDEADLK) goto out_unreserve; }
for_each_bo_in_cs(bo) { r = validate(ctx, bo); if (r == -EDEADLK) goto out_unreserve; };
cs();
for_each_bo_in_cs(bo) unlock(ctx, bo);
acquire_fini(ctx); return 0;
out_unreserve: for_each_locked_bo() unlock(ctx, bo);
acquire_relax(); goto restart; }
Vary example for multi-gpu and more realism, but that's roughly it.
Aside, a lot of the stuff Christian has been doing in ttm is to improve the chances that the competing threads will hit one of the locked objects of the big thread, and at least back off a bit. That's at least my understanding of what's been happening. -Daniel
OK unserstood. For vmwgfx the crude simplistic idea to avoid that situation has been to have an rwsem around command submission: When the thread with the big bo has run a full LRU worth of eviction without succeeding it would get annoyed and take the rwsem in write mode, blocking competing threads. But that would of course never work in a dma-buf setting, and IIRC the implementation is not complete either....
/Thomas
On Thu, Feb 20, 2020 at 11:51:07PM +0100, Thomas Hellström (VMware) wrote:
On 2/20/20 9:08 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:
On 2/20/20 7:04 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:
On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:
On 2/18/20 10:01 PM, Daniel Vetter wrote: > On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) > thomas_os@shipmail.org wrote: > > On 2/17/20 6:55 PM, Daniel Vetter wrote: > > > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: > > > > Implement the importer side of unpinned DMA-buf handling. > > > > > > > > v2: update page tables immediately > > > > > > > > Signed-off-by: Christian König christian.koenig@amd.com > > > > --- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 > > > > ++++++++++++++++++++- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ > > > > 2 files changed, 71 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > index 770baba621b3..48de7624d49c 100644 > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c > > > > @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct > > > > drm_device *dev, struct dma_buf *dma_buf) > > > > return ERR_PTR(ret); > > > > } > > > > > > > > +/** > > > > + * amdgpu_dma_buf_move_notify - &attach.move_notify implementation > > > > + * > > > > + * @attach: the DMA-buf attachment > > > > + * > > > > + * Invalidate the DMA-buf attachment, making sure that > > > > the we re-create the > > > > + * mapping before the next use. > > > > + */ > > > > +static void > > > > +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) > > > > +{ > > > > + struct drm_gem_object *obj = attach->importer_priv; > > > > + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); > > > > + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); > > > > + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); > > > > + struct ttm_operation_ctx ctx = { false, false }; > > > > + struct ttm_placement placement = {}; > > > > + struct amdgpu_vm_bo_base *bo_base; > > > > + int r; > > > > + > > > > + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) > > > > + return; > > > > + > > > > + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); > > > > + if (r) { > > > > + DRM_ERROR("Failed to invalidate DMA-buf > > > > import (%d))\n", r); > > > > + return; > > > > + } > > > > + > > > > + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { > > > > + struct amdgpu_vm *vm = bo_base->vm; > > > > + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv; > > > > + > > > > + if (ticket) { > > > Yeah so this is kinda why I've been a total pain about the > > > exact semantics > > > of the move_notify hook. I think we should flat-out require > > > that importers > > > _always_ have a ticket attach when they call this, and that > > > they can cope > > > with additional locks being taken (i.e. full EDEADLCK) handling. > > > > > > Simplest way to force that contract is to add a dummy 2nd > > > ww_mutex lock to > > > the dma_resv object, which we then can take #ifdef > > > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket). > > > > > > Now the real disaster is how we handle deadlocks. Two issues: > > > > > > - Ideally we'd keep any lock we've taken locked until the > > > end, it helps > > > needless backoffs. I've played around a bit with that > > > but not even poc > > > level, just an idea: > > > > > > https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27... > > > > > > > > > Idea is essentially to track a list of objects we had to > > > lock as part of > > > the ttm_bo_validate of the main object. > > > > > > - Second one is if we get a EDEADLCK on one of these > > > sublocks (like the > > > one here). We need to pass that up the entire callchain, > > > including a > > > temporary reference (we have to drop locks to do the > > > ww_mutex_lock_slow > > > call), and need a custom callback to drop that temporary reference > > > (since that's all driver specific, might even be > > > internal ww_mutex and > > > not anything remotely looking like a normal dma_buf). > > > This probably > > > needs the exec util helpers from ttm, but at the > > > dma_resv level, so that > > > we can do something like this: > > > > > > struct dma_resv_ticket { > > > struct ww_acquire_ctx base; > > > > > > /* can be set by anyone (including other drivers) > > > that got hold of > > > * this ticket and had to acquire some new lock. This > > > lock might > > > * protect anything, including driver-internal stuff, and isn't > > > * required to be a dma_buf or even just a dma_resv. */ > > > struct ww_mutex *contended_lock; > > > > > > /* callback which the driver (which might be a dma-buf exporter > > > * and not matching the driver that started this > > > locking ticket) > > > * sets together with @contended_lock, for the main > > > driver to drop > > > * when it calls dma_resv_unlock on the contended_lock. */ > > > void (drop_ref*)(struct ww_mutex *contended_lock); > > > }; > > > > > > This is all supremely nasty (also ttm_bo_validate would need to be > > > improved to handle these sublocks and random new objects > > > that could force > > > a ww_mutex_lock_slow). > > > > > Just a short comment on this: > > > > Neither the currently used wait-die or the wound-wait algorithm > > *strictly* requires a slow lock on the contended lock. For > > wait-die it's > > just very convenient since it makes us sleep instead of spinning with > > -EDEADLK on the contended lock. For wound-wait IIRC one could just > > immediately restart the whole locking transaction after an > > -EDEADLK, and > > the transaction would automatically end up waiting on the contended > > lock, provided the mutex lock stealing is not allowed. There is however > > a possibility that the transaction will be wounded again on another > > lock, taken before the contended lock, but I think there are ways to > > improve the wound-wait algorithm to reduce that probability. > > > > So in short, choosing the wound-wait algorithm instead of wait-die and > > perhaps modifying the ww mutex code somewhat would probably help > > passing > > an -EDEADLK up the call chain without requiring passing the contended > > lock, as long as each locker releases its own locks when receiving an > > -EDEADLK. > Hm this is kinda tempting, since rolling out the full backoff tricker > across driver boundaries is going to be real painful. > > What I'm kinda worried about is the debug/validation checks we're > losing with this. The required backoff has this nice property that > ww_mutex debug code can check that we've fully unwound everything when > we should, that we've blocked on the right lock, and that we're > restarting everything without keeling over. Without that I think we > could end up with situations where a driver in the middle feels like > handling the EDEADLCK, which might go well most of the times (the > deadlock will probably be mostly within a given driver, not across). > Right up to the point where someone creates a deadlock across drivers, > and the lack of full rollback will be felt. > > So not sure whether we can still keep all these debug/validation > checks, or whether this is a step too far towards clever tricks. I think we could definitely find a way to keep debugging to make sure everything is unwound before attempting to restart the locking transaction. But the debug check that we're restarting on the contended lock only really makes sense for wait-die, (and we could easily keep it for wait-die). The lock returning -EDEADLK for wound-wait may actually not be the contending lock but an arbitrary lock that the wounded transaction attempts to take after it is wounded.
So in the end IMO this is a tradeoff between added (possibly severe) locking complexity into dma-buf and not being able to switch back to wait-die efficiently if we need / want to do that.
/Thomas
And as a consequence an interface *could* be:
*) We introduce functions
void ww_acquire_relax(struct ww_acquire_ctx *ctx); int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx);
that can be used instead of ww_mutex_lock_slow() in the absence of a contending lock to avoid spinning on -EDEADLK. While trying to take the contending lock is probably the best choice there are various second best approaches that can be explored, for example waiting on the contending acquire to finish or in the wound-wait case, perhaps do nothing. These functions will also help us keep the debugging.
Hm ... I guess this could work. Trouble is, it only gets rid of the slowpath locking book-keeping headaches, we still have quite a few others.
*) A function returning -EDEADLK to a caller *must* have already released its own locks.
So this ties to another question, as in should these callbacks have to drops the locks thei acquire (much simpler code) or not (less thrashing, if we drop locks we might end up in a situation where threads thrash around instead of realizing quicker that they're actually deadlocking and one of them should stop and back off).
Hmm.. Could you describe such a thrashing case with an example?
Ignoring cross device fun and all that, just a simplified example of why holding onto locks you've acquired for eviction is useful, at least in a slow path.
- one thread trying to do an execbuf with a huge bo
vs.
- an entire pile of thread that try to do execbuf with just a few small bo
First thread is in the eviction loop, selects a bo, wins against all the other thread since it's been doing this forever already, gets the bo moved out, unlocks.
Since it's competing against lots of other threads with small bo, it'll have to do that a lot of times. Often enough to create a contiguous hole. If you have a smarter allocator that tries to create that hole more actively, just assume that the single huge bo is a substantial part of total vram.
The other threads will be quicker in cramming new stuff in, even if they occasionally lose the ww dance against the single thread. So the big thread livelocks.
If otoh the big thread would keep onto all the locks, eventually it have the entire vram locked, and every other thread is guaranteed to lose against it in the ww dance and queue up behind. And it could finally but its huge bo into vram and execute.
Hmm, yes this indeed explains why it's beneficial in some cases to keep a number of locks held across certain operations, but I still fail to see why we would like *all* locks held across the entire transaction? In the above case I'd envision us ending up with something like:
int validate(ctx, bo) {
for_each_suitable_bo_to_evict(ebo) { r = lock(ctx, ebo); if (r == EDEADLK) goto out_unlock;
r = move_notify(ctx, ebo);// locks and unlocks GPU VM bo.
Yeah I think for move_notify the "keep the locks" thing is probably not what we want. That's more for when you have to evict stuff and similar things like that (which hopefully no driver needs to do in their ->move_notify). But for placing buffers we kinda want to keep things, and that's also a cross-driver thing (eventually at least I think).
if (r == EDEADLK) goto out_unlock; evict(); }
place_bo(bo); //Repeat until success.
out_unlock: for_each_locked_bo(ebo) unlock(ctx, ebo);
So that this unlock loop would need to be moved up to higher levels perhaps. This here would solve the example of a single big bo, but if you have multiple then you still end up with a lot of thrashing until the younger thread realizes that it needs to back off.
}
void command_submission() { acquire_init(ctx);
restart: for_each_bo_in_cs(bo) { r = lock(ctx, bo); if (r == -EDEADLK) goto out_unreserve; }
for_each_bo_in_cs(bo) { r = validate(ctx, bo); if (r == -EDEADLK) goto out_unreserve; };
cs();
for_each_bo_in_cs(bo) unlock(ctx, bo);
acquire_fini(ctx); return 0;
out_unreserve: for_each_locked_bo() unlock(ctx, bo);
acquire_relax(); goto restart; }
Vary example for multi-gpu and more realism, but that's roughly it.
Aside, a lot of the stuff Christian has been doing in ttm is to improve the chances that the competing threads will hit one of the locked objects of the big thread, and at least back off a bit. That's at least my understanding of what's been happening. -Daniel
OK unserstood. For vmwgfx the crude simplistic idea to avoid that situation has been to have an rwsem around command submission: When the thread with the big bo has run a full LRU worth of eviction without succeeding it would get annoyed and take the rwsem in write mode, blocking competing threads. But that would of course never work in a dma-buf setting, and IIRC the implementation is not complete either....
Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating into essentially a global lock. But only when there's actual contention and thrashing. -Daniel
On 2/21/20 6:12 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 11:51:07PM +0100, Thomas Hellström (VMware) wrote:
On 2/20/20 9:08 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:
On 2/20/20 7:04 PM, Daniel Vetter wrote:
On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:
On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote: > On 2/18/20 10:01 PM, Daniel Vetter wrote: >> On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware) >> thomas_os@shipmail.org wrote: >>> On 2/17/20 6:55 PM, Daniel Vetter wrote: >>>> On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote: >>>>> Implement the importer side of unpinned DMA-buf handling. >>>>> >>>>> v2: update page tables immediately >>>>> >>>>> Signed-off-by: Christian König christian.koenig@amd.com >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 >>>>> ++++++++++++++++++++- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 ++ >>>>> 2 files changed, 71 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>>>> index 770baba621b3..48de7624d49c 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>>>> @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct >>>>> drm_device *dev, struct dma_buf *dma_buf) >>>>> return ERR_PTR(ret); >>>>> } >>>>> >>>>> +/** >>>>> + * amdgpu_dma_buf_move_notify - &attach.move_notify implementation >>>>> + * >>>>> + * @attach: the DMA-buf attachment >>>>> + * >>>>> + * Invalidate the DMA-buf attachment, making sure that >>>>> the we re-create the >>>>> + * mapping before the next use. >>>>> + */ >>>>> +static void >>>>> +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach) >>>>> +{ >>>>> + struct drm_gem_object *obj = attach->importer_priv; >>>>> + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv); >>>>> + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); >>>>> + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev); >>>>> + struct ttm_operation_ctx ctx = { false, false }; >>>>> + struct ttm_placement placement = {}; >>>>> + struct amdgpu_vm_bo_base *bo_base; >>>>> + int r; >>>>> + >>>>> + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM) >>>>> + return; >>>>> + >>>>> + r = ttm_bo_validate(&bo->tbo, &placement, &ctx); >>>>> + if (r) { >>>>> + DRM_ERROR("Failed to invalidate DMA-buf >>>>> import (%d))\n", r); >>>>> + return; >>>>> + } >>>>> + >>>>> + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) { >>>>> + struct amdgpu_vm *vm = bo_base->vm; >>>>> + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv; >>>>> + >>>>> + if (ticket) { >>>> Yeah so this is kinda why I've been a total pain about the >>>> exact semantics >>>> of the move_notify hook. I think we should flat-out require >>>> that importers >>>> _always_ have a ticket attach when they call this, and that >>>> they can cope >>>> with additional locks being taken (i.e. full EDEADLCK) handling. >>>> >>>> Simplest way to force that contract is to add a dummy 2nd >>>> ww_mutex lock to >>>> the dma_resv object, which we then can take #ifdef >>>> CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket). >>>> >>>> Now the real disaster is how we handle deadlocks. Two issues: >>>> >>>> - Ideally we'd keep any lock we've taken locked until the >>>> end, it helps >>>> needless backoffs. I've played around a bit with that >>>> but not even poc >>>> level, just an idea: >>>> >>>> https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27... >>>> >>>> >>>> Idea is essentially to track a list of objects we had to >>>> lock as part of >>>> the ttm_bo_validate of the main object. >>>> >>>> - Second one is if we get a EDEADLCK on one of these >>>> sublocks (like the >>>> one here). We need to pass that up the entire callchain, >>>> including a >>>> temporary reference (we have to drop locks to do the >>>> ww_mutex_lock_slow >>>> call), and need a custom callback to drop that temporary reference >>>> (since that's all driver specific, might even be >>>> internal ww_mutex and >>>> not anything remotely looking like a normal dma_buf). >>>> This probably >>>> needs the exec util helpers from ttm, but at the >>>> dma_resv level, so that >>>> we can do something like this: >>>> >>>> struct dma_resv_ticket { >>>> struct ww_acquire_ctx base; >>>> >>>> /* can be set by anyone (including other drivers) >>>> that got hold of >>>> * this ticket and had to acquire some new lock. This >>>> lock might >>>> * protect anything, including driver-internal stuff, and isn't >>>> * required to be a dma_buf or even just a dma_resv. */ >>>> struct ww_mutex *contended_lock; >>>> >>>> /* callback which the driver (which might be a dma-buf exporter >>>> * and not matching the driver that started this >>>> locking ticket) >>>> * sets together with @contended_lock, for the main >>>> driver to drop >>>> * when it calls dma_resv_unlock on the contended_lock. */ >>>> void (drop_ref*)(struct ww_mutex *contended_lock); >>>> }; >>>> >>>> This is all supremely nasty (also ttm_bo_validate would need to be >>>> improved to handle these sublocks and random new objects >>>> that could force >>>> a ww_mutex_lock_slow). >>>> >>> Just a short comment on this: >>> >>> Neither the currently used wait-die or the wound-wait algorithm >>> *strictly* requires a slow lock on the contended lock. For >>> wait-die it's >>> just very convenient since it makes us sleep instead of spinning with >>> -EDEADLK on the contended lock. For wound-wait IIRC one could just >>> immediately restart the whole locking transaction after an >>> -EDEADLK, and >>> the transaction would automatically end up waiting on the contended >>> lock, provided the mutex lock stealing is not allowed. There is however >>> a possibility that the transaction will be wounded again on another >>> lock, taken before the contended lock, but I think there are ways to >>> improve the wound-wait algorithm to reduce that probability. >>> >>> So in short, choosing the wound-wait algorithm instead of wait-die and >>> perhaps modifying the ww mutex code somewhat would probably help >>> passing >>> an -EDEADLK up the call chain without requiring passing the contended >>> lock, as long as each locker releases its own locks when receiving an >>> -EDEADLK. >> Hm this is kinda tempting, since rolling out the full backoff tricker >> across driver boundaries is going to be real painful. >> >> What I'm kinda worried about is the debug/validation checks we're >> losing with this. The required backoff has this nice property that >> ww_mutex debug code can check that we've fully unwound everything when >> we should, that we've blocked on the right lock, and that we're >> restarting everything without keeling over. Without that I think we >> could end up with situations where a driver in the middle feels like >> handling the EDEADLCK, which might go well most of the times (the >> deadlock will probably be mostly within a given driver, not across). >> Right up to the point where someone creates a deadlock across drivers, >> and the lack of full rollback will be felt. >> >> So not sure whether we can still keep all these debug/validation >> checks, or whether this is a step too far towards clever tricks. > I think we could definitely find a way to keep debugging to make sure > everything is unwound before attempting to restart the locking > transaction. But the debug check that we're restarting on the contended > lock only really makes sense for wait-die, (and we could easily keep it > for wait-die). The lock returning -EDEADLK for wound-wait may actually > not be the contending lock but an arbitrary lock that the wounded > transaction attempts to take after it is wounded. > > So in the end IMO this is a tradeoff between added (possibly severe) > locking complexity into dma-buf and not being able to switch back to > wait-die efficiently if we need / want to do that. > > /Thomas And as a consequence an interface *could* be:
*) We introduce functions
void ww_acquire_relax(struct ww_acquire_ctx *ctx); int ww_acquire_relax_interruptible(struct ww_acquire_ctx *ctx);
that can be used instead of ww_mutex_lock_slow() in the absence of a contending lock to avoid spinning on -EDEADLK. While trying to take the contending lock is probably the best choice there are various second best approaches that can be explored, for example waiting on the contending acquire to finish or in the wound-wait case, perhaps do nothing. These functions will also help us keep the debugging.
Hm ... I guess this could work. Trouble is, it only gets rid of the slowpath locking book-keeping headaches, we still have quite a few others.
*) A function returning -EDEADLK to a caller *must* have already released its own locks.
So this ties to another question, as in should these callbacks have to drops the locks thei acquire (much simpler code) or not (less thrashing, if we drop locks we might end up in a situation where threads thrash around instead of realizing quicker that they're actually deadlocking and one of them should stop and back off).
Hmm.. Could you describe such a thrashing case with an example?
Ignoring cross device fun and all that, just a simplified example of why holding onto locks you've acquired for eviction is useful, at least in a slow path.
- one thread trying to do an execbuf with a huge bo
vs.
- an entire pile of thread that try to do execbuf with just a few small bo
First thread is in the eviction loop, selects a bo, wins against all the other thread since it's been doing this forever already, gets the bo moved out, unlocks.
Since it's competing against lots of other threads with small bo, it'll have to do that a lot of times. Often enough to create a contiguous hole. If you have a smarter allocator that tries to create that hole more actively, just assume that the single huge bo is a substantial part of total vram.
The other threads will be quicker in cramming new stuff in, even if they occasionally lose the ww dance against the single thread. So the big thread livelocks.
If otoh the big thread would keep onto all the locks, eventually it have the entire vram locked, and every other thread is guaranteed to lose against it in the ww dance and queue up behind. And it could finally but its huge bo into vram and execute.
Hmm, yes this indeed explains why it's beneficial in some cases to keep a number of locks held across certain operations, but I still fail to see why we would like *all* locks held across the entire transaction? In the above case I'd envision us ending up with something like:
int validate(ctx, bo) {
for_each_suitable_bo_to_evict(ebo) { r = lock(ctx, ebo); if (r == EDEADLK) goto out_unlock;
r = move_notify(ctx, ebo);// locks and unlocks GPU VM bo.
Yeah I think for move_notify the "keep the locks" thing is probably not what we want. That's more for when you have to evict stuff and similar things like that (which hopefully no driver needs to do in their ->move_notify). But for placing buffers we kinda want to keep things, and that's also a cross-driver thing (eventually at least I think).
if (r == EDEADLK) goto out_unlock; evict(); }
place_bo(bo); //Repeat until success.
out_unlock: for_each_locked_bo(ebo) unlock(ctx, ebo);
So that this unlock loop would need to be moved up to higher levels perhaps. This here would solve the example of a single big bo, but if you have multiple then you still end up with a lot of thrashing until the younger thread realizes that it needs to back off.
Ah, so we hold on to resources we don't really need to attempt to block threads likely to be thrashing?
Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating into essentially a global lock. But only when there's actual contention and thrashing.
OK.
/Thomas
-Daniel
Am 21.02.20 um 18:12 schrieb Daniel Vetter:
[SNIP] Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating into essentially a global lock. But only when there's actual contention and thrashing.
Yes exactly. A really big problem in TTM is currently that we drop the lock after evicting BOs because they tend to move in again directly after that.
From practice I can also confirm that there is exactly zero benefit from dropping locks early and reacquire them for example for the VM page tables. That's just makes it more likely that somebody needs to roll back and this is what we need to avoid in the first place.
Contention on BO locks during command submission is perfectly fine as long as this is as lightweight as possible while we don't have trashing. When we have trashing multi submission performance is best archived to just favor a single process to finish its business and block everybody else.
Because of this I would actually vote for forbidding to release individual ww_mutex() locks in a context.
Regards, Christian.
-Daniel
On 2/23/20 4:45 PM, Christian König wrote:
Am 21.02.20 um 18:12 schrieb Daniel Vetter:
[SNIP] Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating into essentially a global lock. But only when there's actual contention and thrashing.
Yes exactly. A really big problem in TTM is currently that we drop the lock after evicting BOs because they tend to move in again directly after that.
From practice I can also confirm that there is exactly zero benefit from dropping locks early and reacquire them for example for the VM page tables. That's just makes it more likely that somebody needs to roll back and this is what we need to avoid in the first place.
If you have a benchmarking setup available it would be very interesting for future reference to see how changing from WD to WW mutexes affects the roll back frequency. WW is known to cause rollbacks much less frequently but there is more work associated with each rollback.
Contention on BO locks during command submission is perfectly fine as long as this is as lightweight as possible while we don't have trashing. When we have trashing multi submission performance is best archived to just favor a single process to finish its business and block everybody else.
Hmm. Sounds like we need a per-manager ww_rwsem protecting manager allocation, taken in write-mode then there's thrashing. In read-mode otherwise. That would limit the amount of "unnecessary" locks we'd have to keep and reduce unwanted side-effects, (see below):
Because of this I would actually vote for forbidding to release individual ww_mutex() locks in a context.
Yes, I see the problem.
But my first reaction is that this might have undersirable side-effects. Let's say somebody wanted to swap the evicted BOs out? Or cpu-writes to them causing faults, that might also block the mm_sem, which in turn blocks hugepaged?
Still it's a fairly simple solution to a problem that seems otherwise hard to solve efficiently.
Thanks, Thomas
Regards, Christian.
-Daniel
Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
On 2/23/20 4:45 PM, Christian König wrote:
Am 21.02.20 um 18:12 schrieb Daniel Vetter:
[SNIP] Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating into essentially a global lock. But only when there's actual contention and thrashing.
Yes exactly. A really big problem in TTM is currently that we drop the lock after evicting BOs because they tend to move in again directly after that.
From practice I can also confirm that there is exactly zero benefit from dropping locks early and reacquire them for example for the VM page tables. That's just makes it more likely that somebody needs to roll back and this is what we need to avoid in the first place.
If you have a benchmarking setup available it would be very interesting for future reference to see how changing from WD to WW mutexes affects the roll back frequency. WW is known to cause rollbacks much less frequently but there is more work associated with each rollback.
Not of hand. To be honest I still have a hard time to get a grip on the difference between WD and WW from the algorithm point of view. So I can't judge that difference at all.
Contention on BO locks during command submission is perfectly fine as long as this is as lightweight as possible while we don't have trashing. When we have trashing multi submission performance is best archived to just favor a single process to finish its business and block everybody else.
Hmm. Sounds like we need a per-manager ww_rwsem protecting manager allocation, taken in write-mode then there's thrashing. In read-mode otherwise. That would limit the amount of "unnecessary" locks we'd have to keep and reduce unwanted side-effects, (see below):
Well per-manager (you mean per domain here don't you?) doesn't sound like that useful because we rarely use only one domain, but I'm actually questioning for quite a while if the per BO lock scheme was the right approach.
See from the performance aspect the closest to ideal solution I can think of would be a ww_rwsem per user of a resource.
In other words we don't lock BOs, but instead a list of all their users and when you want to evict a BO you need to walk that list and inform all users that the BO will be moving.
During command submission you then have the fast path which rather just grabs the read side of the user lock and check if all BOs are still in the expected place.
If some BOs were evicted you back off and start the slow path, e.g. maybe even copy additional data from userspace then grab the write side of the lock etc.. etc...
That approach is similar to what we use in amdgpu with the per-VM BOs, but goes a step further. Problem is that we are so used to per BO locks in the kernel that this is probably not doable any more.
Because of this I would actually vote for forbidding to release individual ww_mutex() locks in a context.
Yes, I see the problem.
But my first reaction is that this might have undersirable side-effects. Let's say somebody wanted to swap the evicted BOs out?
Please explain further, I off hand don't see the problem here.
In general I actually wanted to re-work TTM in a way that BOs in the SYSTEM/SWAPABLE domain are always backed by a shmem file instead of the struct page array we currently have.
Or cpu-writes to them causing faults, that might also block the mm_sem, which in turn blocks hugepaged?
Mhm, I also only have a higher level view how hugepaged works so why does it grabs the mm_sem on the write side?
Thanks, Christian.
Still it's a fairly simple solution to a problem that seems otherwise hard to solve efficiently.
Thanks, Thomas
Regards, Christian.
-Daniel
On 2/24/20 7:46 PM, Christian König wrote:
Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
On 2/23/20 4:45 PM, Christian König wrote:
Am 21.02.20 um 18:12 schrieb Daniel Vetter:
[SNIP] Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating into essentially a global lock. But only when there's actual contention and thrashing.
Yes exactly. A really big problem in TTM is currently that we drop the lock after evicting BOs because they tend to move in again directly after that.
From practice I can also confirm that there is exactly zero benefit from dropping locks early and reacquire them for example for the VM page tables. That's just makes it more likely that somebody needs to roll back and this is what we need to avoid in the first place.
If you have a benchmarking setup available it would be very interesting for future reference to see how changing from WD to WW mutexes affects the roll back frequency. WW is known to cause rollbacks much less frequently but there is more work associated with each rollback.
Not of hand. To be honest I still have a hard time to get a grip on the difference between WD and WW from the algorithm point of view. So I can't judge that difference at all.
OK. I don't think a detailed understanding of the algorithms is strictly necessary, though. If we had had a good testcase we'd just have to change DEFINE_WD_CLASS in dma-resv.c to DEFINE_WW_CLASS and benchmark relevant figures.
Contention on BO locks during command submission is perfectly fine as long as this is as lightweight as possible while we don't have trashing. When we have trashing multi submission performance is best archived to just favor a single process to finish its business and block everybody else.
Hmm. Sounds like we need a per-manager ww_rwsem protecting manager allocation, taken in write-mode then there's thrashing. In read-mode otherwise. That would limit the amount of "unnecessary" locks we'd have to keep and reduce unwanted side-effects, (see below):
Well per-manager (you mean per domain here don't you?)
yes.
doesn't sound like that useful because we rarely use only one domain,
Well the difference to keeping all locks would boil down to: "Instead of keeping all locks of all bos we evict from thrashing domains, keep locks of all thrashing domains in write mode". This means that for domains that are not thrashing, we'd just keep read locks.
but I'm actually questioning for quite a while if the per BO lock scheme was the right approach.
See from the performance aspect the closest to ideal solution I can think of would be a ww_rwsem per user of a resource.
In other words we don't lock BOs, but instead a list of all their users and when you want to evict a BO you need to walk that list and inform all users that the BO will be moving.
During command submission you then have the fast path which rather just grabs the read side of the user lock and check if all BOs are still in the expected place.
If some BOs were evicted you back off and start the slow path, e.g. maybe even copy additional data from userspace then grab the write side of the lock etc.. etc...
That approach is similar to what we use in amdgpu with the per-VM BOs, but goes a step further. Problem is that we are so used to per BO locks in the kernel that this is probably not doable any more.
I think we need some-sort of per-bo lock to protect bo metadata. But yes, relying solely on them to resolve other resource (domain) contention may not be (or rather probably isn't) the right choice.
Because of this I would actually vote for forbidding to release individual ww_mutex() locks in a context.
Yes, I see the problem.
But my first reaction is that this might have undersirable side-effects. Let's say somebody wanted to swap the evicted BOs out?
Please explain further, I off hand don't see the problem here.
Lets say thread A evicts a lot of thread B's bos, and keeps the locks of those bos for a prolonged time. Then thread C needs memory and wants to swap out thread B's bos. It can't, or at least not during a certain delay because thread A unnecessarily holds the locks.
In general I actually wanted to re-work TTM in a way that BOs in the SYSTEM/SWAPABLE domain are always backed by a shmem file instead of the struct page array we currently have.
That would probably work well if there are no SYSTEM+write-combined users anymore. Typically in the old AGP days, you wouldn't change caching mode when evicting from write-combine AGP to SYSTEM because of the dead slow wbinvd() operation.
Or cpu-writes to them causing faults, that might also block the mm_sem, which in turn blocks hugepaged?
Mhm, I also only have a higher level view how hugepaged works so why does it grabs the mm_sem on the write side?
If I understand it correctly, it's needed when collapsing PMD directories to huge PMD pages. But this was merely an example. For this particular case the RETRY mechanism in the TTM fault handler we've discussed before will try reasonably hard to release the mmap_sem when sleeping on a bo lock.
Thanks, Thomas
Thanks, Christian.
Still it's a fairly simple solution to a problem that seems otherwise hard to solve efficiently.
Thanks, Thomas
Regards, Christian.
-Daniel
On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:
Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
On 2/23/20 4:45 PM, Christian König wrote:
Am 21.02.20 um 18:12 schrieb Daniel Vetter:
[SNIP] Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating into essentially a global lock. But only when there's actual contention and thrashing.
Yes exactly. A really big problem in TTM is currently that we drop the lock after evicting BOs because they tend to move in again directly after that.
From practice I can also confirm that there is exactly zero benefit from dropping locks early and reacquire them for example for the VM page tables. That's just makes it more likely that somebody needs to roll back and this is what we need to avoid in the first place.
If you have a benchmarking setup available it would be very interesting for future reference to see how changing from WD to WW mutexes affects the roll back frequency. WW is known to cause rollbacks much less frequently but there is more work associated with each rollback.
Not of hand. To be honest I still have a hard time to get a grip on the difference between WD and WW from the algorithm point of view. So I can't judge that difference at all.
Contention on BO locks during command submission is perfectly fine as long as this is as lightweight as possible while we don't have trashing. When we have trashing multi submission performance is best archived to just favor a single process to finish its business and block everybody else.
Hmm. Sounds like we need a per-manager ww_rwsem protecting manager allocation, taken in write-mode then there's thrashing. In read-mode otherwise. That would limit the amount of "unnecessary" locks we'd have to keep and reduce unwanted side-effects, (see below):
Well per-manager (you mean per domain here don't you?) doesn't sound like that useful because we rarely use only one domain, but I'm actually questioning for quite a while if the per BO lock scheme was the right approach.
See from the performance aspect the closest to ideal solution I can think of would be a ww_rwsem per user of a resource.
In other words we don't lock BOs, but instead a list of all their users and when you want to evict a BO you need to walk that list and inform all users that the BO will be moving.
During command submission you then have the fast path which rather just grabs the read side of the user lock and check if all BOs are still in the expected place.
If some BOs were evicted you back off and start the slow path, e.g. maybe even copy additional data from userspace then grab the write side of the lock etc.. etc...
That approach is similar to what we use in amdgpu with the per-VM BOs, but goes a step further. Problem is that we are so used to per BO locks in the kernel that this is probably not doable any more.
Yeah I think it'd be nice to have the same approach for shared bo too. I guess what we could do is something like this (spinning your ww_rwmutex idea a bit further):
dma_buf_read_lock(buf, vm) { if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH)) { check that vm is indeed listed in buf and splat if not }
/* for a buf that's not shared in multiple vm we'd have buf->resv * == vm->resv here */ return ww_mutex_lock(vm->resv); }
dma_buf_write_lock(buf) { for_each_vm_in_buf(buf, vm) { ww_mutex_lock(vm->resv); } }
Ideally we'd track all these vms with something slightly less shoddy than a linked list :-) Resizeable array is probably pretty good, I think we only ever need to go from buf -> vm list, not the other way round. At least in dma_resv/dma_buf code, driver code ofc needs to keep a list of all bo bound to a vm somewhere. But that's probably a much bigger datastructure for tracking vma offsets and mappings and other things on top.
Ofc to even just get there we'd need something like the sublock list to keep track of all the additional locks we'd need for the writer lock. And we'd need the release callback for backoff, so that we could also go through the slowpath on a vm object that we're not holding a full reference on. That also means vm need to be refcounted.
And the list of vms on a buffer need to be protected with some lock and the usual kref_get_unless_zero trickery.
But with all this I think we can make the dma_buf_write_lock lock 100% like the old per-buffer lock for everyone. And execbuf could switch over to dma_buf_read_lock for shared buffers. Bonus points when the gpu context just keeps track of a list of shared vm used by buffers in that context ...
That way we could make vm fastpath locking a la amdgpu opt-in, while keeping everyone else on the per-object locking juices.
Thoughts?
Cheers, Daniel
PS: Of course the write lock for these buffers is going to be terrible, so every time you need to update fences for implicit sync on shared buffer (well write fence at least) it's going to suck. We probably also want a read_to_write_upgrade function, which also can be done easily with ww_mutex magic.
On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:
Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
On 2/23/20 4:45 PM, Christian König wrote:
Am 21.02.20 um 18:12 schrieb Daniel Vetter:
[SNIP] Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating into essentially a global lock. But only when there's actual contention and thrashing.
Yes exactly. A really big problem in TTM is currently that we drop the lock after evicting BOs because they tend to move in again directly after that.
From practice I can also confirm that there is exactly zero benefit from dropping locks early and reacquire them for example for the VM page tables. That's just makes it more likely that somebody needs to roll back and this is what we need to avoid in the first place.
If you have a benchmarking setup available it would be very interesting for future reference to see how changing from WD to WW mutexes affects the roll back frequency. WW is known to cause rollbacks much less frequently but there is more work associated with each rollback.
Not of hand. To be honest I still have a hard time to get a grip on the difference between WD and WW from the algorithm point of view. So I can't judge that difference at all.
Contention on BO locks during command submission is perfectly fine as long as this is as lightweight as possible while we don't have trashing. When we have trashing multi submission performance is best archived to just favor a single process to finish its business and block everybody else.
Hmm. Sounds like we need a per-manager ww_rwsem protecting manager allocation, taken in write-mode then there's thrashing. In read-mode otherwise. That would limit the amount of "unnecessary" locks we'd have to keep and reduce unwanted side-effects, (see below):
Well per-manager (you mean per domain here don't you?) doesn't sound like that useful because we rarely use only one domain, but I'm actually questioning for quite a while if the per BO lock scheme was the right approach.
See from the performance aspect the closest to ideal solution I can think of would be a ww_rwsem per user of a resource.
In other words we don't lock BOs, but instead a list of all their users and when you want to evict a BO you need to walk that list and inform all users that the BO will be moving.
During command submission you then have the fast path which rather just grabs the read side of the user lock and check if all BOs are still in the expected place.
If some BOs were evicted you back off and start the slow path, e.g. maybe even copy additional data from userspace then grab the write side of the lock etc.. etc...
That approach is similar to what we use in amdgpu with the per-VM BOs, but goes a step further. Problem is that we are so used to per BO locks in the kernel that this is probably not doable any more.
Yeah I think it'd be nice to have the same approach for shared bo too. I guess what we could do is something like this (spinning your ww_rwmutex idea a bit further):
dma_buf_read_lock(buf, vm) { if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH)) { check that vm is indeed listed in buf and splat if not }
/* for a buf that's not shared in multiple vm we'd have buf->resv * == vm->resv here */ return ww_mutex_lock(vm->resv);
}
dma_buf_write_lock(buf) { for_each_vm_in_buf(buf, vm) { ww_mutex_lock(vm->resv); } }
Ideally we'd track all these vms with something slightly less shoddy than a linked list :-) Resizeable array is probably pretty good, I think we only ever need to go from buf -> vm list, not the other way round. At least in dma_resv/dma_buf code, driver code ofc needs to keep a list of all bo bound to a vm somewhere. But that's probably a much bigger datastructure for tracking vma offsets and mappings and other things on top.
Ofc to even just get there we'd need something like the sublock list to keep track of all the additional locks we'd need for the writer lock. And we'd need the release callback for backoff, so that we could also go through the slowpath on a vm object that we're not holding a full reference on. That also means vm need to be refcounted.
And the list of vms on a buffer need to be protected with some lock and the usual kref_get_unless_zero trickery.
But with all this I think we can make the dma_buf_write_lock lock 100% like the old per-buffer lock for everyone. And execbuf could switch over to dma_buf_read_lock for shared buffers. Bonus points when the gpu context just keeps track of a list of shared vm used by buffers in that context ...
That way we could make vm fastpath locking a la amdgpu opt-in, while keeping everyone else on the per-object locking juices.
Thoughts?
One thing I just realized, which is nasty: The full (write) lock needs ww_acquire_ctx with this, because it needs to take a bunch of locks. Rolling that out everywhere is going to be nasty.
I guess though we could do a fallback and have a locally created ww_acquire_ctx if there's none passed in, with backoff entirely implemented within dma_resv_lock. -Daniel
Cheers, Daniel
PS: Of course the write lock for these buffers is going to be terrible, so every time you need to update fences for implicit sync on shared buffer (well write fence at least) it's going to suck. We probably also want a read_to_write_upgrade function, which also can be done easily with ww_mutex magic. -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Am 26.02.20 um 17:32 schrieb Daniel Vetter:
On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:
Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
On 2/23/20 4:45 PM, Christian König wrote:
Am 21.02.20 um 18:12 schrieb Daniel Vetter:
[SNIP] Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating into essentially a global lock. But only when there's actual contention and thrashing.
Yes exactly. A really big problem in TTM is currently that we drop the lock after evicting BOs because they tend to move in again directly after that.
From practice I can also confirm that there is exactly zero benefit from dropping locks early and reacquire them for example for the VM page tables. That's just makes it more likely that somebody needs to roll back and this is what we need to avoid in the first place.
If you have a benchmarking setup available it would be very interesting for future reference to see how changing from WD to WW mutexes affects the roll back frequency. WW is known to cause rollbacks much less frequently but there is more work associated with each rollback.
Not of hand. To be honest I still have a hard time to get a grip on the difference between WD and WW from the algorithm point of view. So I can't judge that difference at all.
Contention on BO locks during command submission is perfectly fine as long as this is as lightweight as possible while we don't have trashing. When we have trashing multi submission performance is best archived to just favor a single process to finish its business and block everybody else.
Hmm. Sounds like we need a per-manager ww_rwsem protecting manager allocation, taken in write-mode then there's thrashing. In read-mode otherwise. That would limit the amount of "unnecessary" locks we'd have to keep and reduce unwanted side-effects, (see below):
Well per-manager (you mean per domain here don't you?) doesn't sound like that useful because we rarely use only one domain, but I'm actually questioning for quite a while if the per BO lock scheme was the right approach.
See from the performance aspect the closest to ideal solution I can think of would be a ww_rwsem per user of a resource.
In other words we don't lock BOs, but instead a list of all their users and when you want to evict a BO you need to walk that list and inform all users that the BO will be moving.
During command submission you then have the fast path which rather just grabs the read side of the user lock and check if all BOs are still in the expected place.
If some BOs were evicted you back off and start the slow path, e.g. maybe even copy additional data from userspace then grab the write side of the lock etc.. etc...
That approach is similar to what we use in amdgpu with the per-VM BOs, but goes a step further. Problem is that we are so used to per BO locks in the kernel that this is probably not doable any more.
Yeah I think it'd be nice to have the same approach for shared bo too. I guess what we could do is something like this (spinning your ww_rwmutex idea a bit further):
dma_buf_read_lock(buf, vm) { if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH)) { check that vm is indeed listed in buf and splat if not }
/* for a buf that's not shared in multiple vm we'd have buf->resv * == vm->resv here */ return ww_mutex_lock(vm->resv);
}
dma_buf_write_lock(buf) { for_each_vm_in_buf(buf, vm) { ww_mutex_lock(vm->resv); } }
Ideally we'd track all these vms with something slightly less shoddy than a linked list :-) Resizeable array is probably pretty good, I think we only ever need to go from buf -> vm list, not the other way round. At least in dma_resv/dma_buf code, driver code ofc needs to keep a list of all bo bound to a vm somewhere. But that's probably a much bigger datastructure for tracking vma offsets and mappings and other things on top.
Ofc to even just get there we'd need something like the sublock list to keep track of all the additional locks we'd need for the writer lock. And we'd need the release callback for backoff, so that we could also go through the slowpath on a vm object that we're not holding a full reference on. That also means vm need to be refcounted.
And the list of vms on a buffer need to be protected with some lock and the usual kref_get_unless_zero trickery.
But with all this I think we can make the dma_buf_write_lock lock 100% like the old per-buffer lock for everyone. And execbuf could switch over to dma_buf_read_lock for shared buffers. Bonus points when the gpu context just keeps track of a list of shared vm used by buffers in that context ...
That way we could make vm fastpath locking a la amdgpu opt-in, while keeping everyone else on the per-object locking juices.
Thoughts?
At least to me that sounds like a plan.
One thing I just realized, which is nasty: The full (write) lock needs ww_acquire_ctx with this, because it needs to take a bunch of locks. Rolling that out everywhere is going to be nasty.
Why? Take a single write lock shouldn't be different to taking a single ww_mutex, or am I missing something?
I guess though we could do a fallback and have a locally created ww_acquire_ctx if there's none passed in, with backoff entirely implemented within dma_resv_lock.
How should that work? As far as I understand it the ww_acquire_ctx must be kept existing until after the last of the locks it was used with is unlocked. Or do I see this incorrectly?
-Daniel
Cheers, Daniel
PS: Of course the write lock for these buffers is going to be terrible, so every time you need to update fences for implicit sync on shared buffer (well write fence at least) it's going to suck. We probably also want a read_to_write_upgrade function, which also can be done easily with ww_mutex magic.
I'm thinking that we probably sole want a read_to_write upgrade function.
Regards, Christian.
-- Daniel Vetter Software Engineer, Intel Corporation
On Thu, Feb 27, 2020 at 10:20 AM Christian König christian.koenig@amd.com wrote:
Am 26.02.20 um 17:32 schrieb Daniel Vetter:
On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:
Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
On 2/23/20 4:45 PM, Christian König wrote:
Am 21.02.20 um 18:12 schrieb Daniel Vetter: > [SNIP] > Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly > degenerating > into essentially a global lock. But only when there's actual contention > and thrashing. Yes exactly. A really big problem in TTM is currently that we drop the lock after evicting BOs because they tend to move in again directly after that.
From practice I can also confirm that there is exactly zero benefit from dropping locks early and reacquire them for example for the VM page tables. That's just makes it more likely that somebody needs to roll back and this is what we need to avoid in the first place.
If you have a benchmarking setup available it would be very interesting for future reference to see how changing from WD to WW mutexes affects the roll back frequency. WW is known to cause rollbacks much less frequently but there is more work associated with each rollback.
Not of hand. To be honest I still have a hard time to get a grip on the difference between WD and WW from the algorithm point of view. So I can't judge that difference at all.
Contention on BO locks during command submission is perfectly fine as long as this is as lightweight as possible while we don't have trashing. When we have trashing multi submission performance is best archived to just favor a single process to finish its business and block everybody else.
Hmm. Sounds like we need a per-manager ww_rwsem protecting manager allocation, taken in write-mode then there's thrashing. In read-mode otherwise. That would limit the amount of "unnecessary" locks we'd have to keep and reduce unwanted side-effects, (see below):
Well per-manager (you mean per domain here don't you?) doesn't sound like that useful because we rarely use only one domain, but I'm actually questioning for quite a while if the per BO lock scheme was the right approach.
See from the performance aspect the closest to ideal solution I can think of would be a ww_rwsem per user of a resource.
In other words we don't lock BOs, but instead a list of all their users and when you want to evict a BO you need to walk that list and inform all users that the BO will be moving.
During command submission you then have the fast path which rather just grabs the read side of the user lock and check if all BOs are still in the expected place.
If some BOs were evicted you back off and start the slow path, e.g. maybe even copy additional data from userspace then grab the write side of the lock etc.. etc...
That approach is similar to what we use in amdgpu with the per-VM BOs, but goes a step further. Problem is that we are so used to per BO locks in the kernel that this is probably not doable any more.
Yeah I think it'd be nice to have the same approach for shared bo too. I guess what we could do is something like this (spinning your ww_rwmutex idea a bit further):
dma_buf_read_lock(buf, vm) { if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH)) { check that vm is indeed listed in buf and splat if not }
/* for a buf that's not shared in multiple vm we'd have buf->resv * == vm->resv here */ return ww_mutex_lock(vm->resv);
}
dma_buf_write_lock(buf) { for_each_vm_in_buf(buf, vm) { ww_mutex_lock(vm->resv); } }
Ideally we'd track all these vms with something slightly less shoddy than a linked list :-) Resizeable array is probably pretty good, I think we only ever need to go from buf -> vm list, not the other way round. At least in dma_resv/dma_buf code, driver code ofc needs to keep a list of all bo bound to a vm somewhere. But that's probably a much bigger datastructure for tracking vma offsets and mappings and other things on top.
Ofc to even just get there we'd need something like the sublock list to keep track of all the additional locks we'd need for the writer lock. And we'd need the release callback for backoff, so that we could also go through the slowpath on a vm object that we're not holding a full reference on. That also means vm need to be refcounted.
And the list of vms on a buffer need to be protected with some lock and the usual kref_get_unless_zero trickery.
But with all this I think we can make the dma_buf_write_lock lock 100% like the old per-buffer lock for everyone. And execbuf could switch over to dma_buf_read_lock for shared buffers. Bonus points when the gpu context just keeps track of a list of shared vm used by buffers in that context ...
That way we could make vm fastpath locking a la amdgpu opt-in, while keeping everyone else on the per-object locking juices.
Thoughts?
At least to me that sounds like a plan.
One thing I just realized, which is nasty: The full (write) lock needs ww_acquire_ctx with this, because it needs to take a bunch of locks. Rolling that out everywhere is going to be nasty.
Why? Take a single write lock shouldn't be different to taking a single ww_mutex, or am I missing something?
There's no single write lock in my proposal. The write lock for a buffer is "take all the vm locks this buffer is mapped into". Which means we need to take multiple ww_mutex locks, which means backoff and everything.
Otherwise the read lock of just taking the per vm dma_resv lock won't work. And doing an true rw lock on each buffer feels a bit pointless, since then execbuf is back to O(num_buffers). And that's the suck we're trying to avoid.
I guess though we could do a fallback and have a locally created ww_acquire_ctx if there's none passed in, with backoff entirely implemented within dma_resv_lock.
How should that work? As far as I understand it the ww_acquire_ctx must be kept existing until after the last of the locks it was used with is unlocked. Or do I see this incorrectly?
Ugh right.
I guess we could put the ww_acquire_ctx into the dma_resv so it survives until the unlock. But that's a rather gross hack. -Daniel
-Daniel
Cheers, Daniel
PS: Of course the write lock for these buffers is going to be terrible, so every time you need to update fences for implicit sync on shared buffer (well write fence at least) it's going to suck. We probably also want a read_to_write_upgrade function, which also can be done easily with ww_mutex magic.
I'm thinking that we probably sole want a read_to_write upgrade function.
Regards, Christian.
-- Daniel Vetter Software Engineer, Intel Corporation
dri-devel@lists.freedesktop.org