On Wed, Jul 31, 2019 at 10:55:02AM +0200, Daniel Vetter wrote:
On Thu, Jun 27, 2019 at 09:28:11AM +0200, Christian König wrote:
Hi Daniel,
those fails look like something random to me and not related to my patch set. Correct?
First one I looked at has the reservation_obj all over:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cml-u/igt@gem_ex...
So 5 second guees is ... probably real?
Note that with the entire lmem stuff going on right now there's massive discussions about how we're doing resv_obj vs obj->mm.lock the wrong way round in i915, so I'm not surprised at all that you managed to trip this.
The way I see it right now is that obj->mm.lock needs to be limited to dealing with the i915 shrinker interactions only, and only for i915 native objects. And for dma-bufs we need to make sure it's not anywhere in the callchain. Unfortunately that's a massive refactor I guess ...
Thought about this some more, aside from just breaking i915 or waiting until it's refactored (Both not awesome) I think the only option is get back to the original caching. And figure out whether we really need to take the direction into account for that, or whether upgrading to bidirectional unconditionally won't be ok. I think there's only really two cases where this matters:
- display drivers using the cma/dma_alloc helpers. Everything is allocated fully coherent, cpu side wc, no flushing.
- Everyone else (on platforms where there's actually some flushing going on) is for rendering gpus, and those always map bidirectional and want the mapping cached for as long as possible.
With that we could go back to creating the cached mapping at attach time and avoid inflicting the reservation object lock to places that would keel over.
Thoughts? -Daniel
-Daniel
Christian.
Am 26.06.19 um 15:32 schrieb Patchwork:
== Series Details ==
Series: series starting with [1/6] dma-buf: add dynamic DMA-buf handling v13 URL : https://patchwork.freedesktop.org/series/62777/ State : failure
== Summary ==
CI Bug Log - changes from CI_DRM_6358 -> Patchwork_13438
Summary
**FAILURE**
Serious unknown changes coming with Patchwork_13438 absolutely need to be verified manually. If you think the reported changes have nothing to do with the changes introduced in Patchwork_13438, please notify your bug team to allow them to document this new failure mode, which will reduce false positives in CI.
External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/
Possible new issues
Here are the unknown changes that may have been introduced in Patchwork_13438:
### IGT changes ###
#### Possible regressions ####
- igt@i915_selftest@live_contexts:
- fi-kbl-7567u: [PASS][1] -> [INCOMPLETE][2]
[1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-kbl-7567u/igt@i915_selftest@live_contexts.html [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-kbl-7567u/igt@i915_selftest@live_contexts.html
- igt@i915_selftest@live_hugepages:
- fi-skl-gvtdvm: [PASS][3] -> [INCOMPLETE][4]
[3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-skl-gvtdvm/igt@i915_selftest@live_hugepages.html [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-skl-gvtdvm/igt@i915_selftest@live_hugepages.html
Known issues
Here are the changes found in Patchwork_13438 that come from known issues:
### IGT changes ###
#### Issues hit ####
- igt@gem_basic@create-close:
- fi-icl-u3: [PASS][5] -> [DMESG-WARN][6] ([fdo#107724])
[5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-icl-u3/igt@gem_basic@create-close.html [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-icl-u3/igt@gem_basic@create-close.html
- igt@gem_ctx_switch@basic-default:
- fi-icl-guc: [PASS][7] -> [INCOMPLETE][8] ([fdo#107713] / [fdo#108569])
[7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-icl-guc/igt@gem_ctx_switch@basic-default.html [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-icl-guc/igt@gem_ctx_switch@basic-default.html
- igt@gem_exec_create@basic:
- fi-icl-u2: [PASS][9] -> [INCOMPLETE][10] ([fdo#107713])
[9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-icl-u2/igt@gem_exec_create@basic.html [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-icl-u2/igt@gem_exec_create@basic.html
- igt@gem_exec_fence@basic-busy-default:
- fi-blb-e6850: [PASS][11] -> [DMESG-WARN][12] ([fdo#110913 ])
[11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-blb-e6850/igt@gem_exec_fence@basic-busy-default.html [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-blb-e6850/igt@gem_exec_fence@basic-busy-default.html - fi-cfl-guc: [PASS][13] -> [DMESG-WARN][14] ([fdo#110913 ]) [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-cfl-guc/igt@gem_exec_fence@basic-busy-default.html [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cfl-guc/igt@gem_exec_fence@basic-busy-default.html - fi-skl-guc: [PASS][15] -> [DMESG-WARN][16] ([fdo#110913 ]) [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-skl-guc/igt@gem_exec_fence@basic-busy-default.html [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-skl-guc/igt@gem_exec_fence@basic-busy-default.html - fi-apl-guc: [PASS][17] -> [DMESG-WARN][18] ([fdo#110913 ]) [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-apl-guc/igt@gem_exec_fence@basic-busy-default.html [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-apl-guc/igt@gem_exec_fence@basic-busy-default.html - fi-hsw-4770r: [PASS][19] -> [DMESG-WARN][20] ([fdo#110913 ]) [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-hsw-4770r/igt@gem_exec_fence@basic-busy-default.html [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-hsw-4770r/igt@gem_exec_fence@basic-busy-default.html - fi-glk-dsi: [PASS][21] -> [DMESG-WARN][22] ([fdo#110913 ]) [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-glk-dsi/igt@gem_exec_fence@basic-busy-default.html [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-glk-dsi/igt@gem_exec_fence@basic-busy-default.html - fi-pnv-d510: [PASS][23] -> [DMESG-WARN][24] ([fdo#110913 ]) [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-pnv-d510/igt@gem_exec_fence@basic-busy-default.html [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-pnv-d510/igt@gem_exec_fence@basic-busy-default.html - fi-skl-6600u: [PASS][25] -> [DMESG-WARN][26] ([fdo#110913 ]) [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-skl-6600u/igt@gem_exec_fence@basic-busy-default.html [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-skl-6600u/igt@gem_exec_fence@basic-busy-default.html - fi-hsw-peppy: [PASS][27] -> [DMESG-WARN][28] ([fdo#110913 ]) [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-hsw-peppy/igt@gem_exec_fence@basic-busy-default.html [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-hsw-peppy/igt@gem_exec_fence@basic-busy-default.html - fi-kbl-x1275: [PASS][29] -> [DMESG-WARN][30] ([fdo#110913 ]) [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-kbl-x1275/igt@gem_exec_fence@basic-busy-default.html [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-kbl-x1275/igt@gem_exec_fence@basic-busy-default.html - fi-cfl-8700k: [PASS][31] -> [DMESG-WARN][32] ([fdo#110913 ]) [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-cfl-8700k/igt@gem_exec_fence@basic-busy-default.html [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cfl-8700k/igt@gem_exec_fence@basic-busy-default.html - fi-icl-dsi: [PASS][33] -> [DMESG-WARN][34] ([fdo#110913 ]) [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-icl-dsi/igt@gem_exec_fence@basic-busy-default.html [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-icl-dsi/igt@gem_exec_fence@basic-busy-default.html - fi-bdw-5557u: [PASS][35] -> [DMESG-WARN][36] ([fdo#110913 ]) [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-bdw-5557u/igt@gem_exec_fence@basic-busy-default.html [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-bdw-5557u/igt@gem_exec_fence@basic-busy-default.html - fi-kbl-7500u: [PASS][37] -> [DMESG-WARN][38] ([fdo#110913 ]) [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-kbl-7500u/igt@gem_exec_fence@basic-busy-default.html [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-kbl-7500u/igt@gem_exec_fence@basic-busy-default.html - fi-kbl-8809g: [PASS][39] -> [DMESG-WARN][40] ([fdo#110913 ]) [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-kbl-8809g/igt@gem_exec_fence@basic-busy-default.html [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-kbl-8809g/igt@gem_exec_fence@basic-busy-default.html - fi-bxt-dsi: [PASS][41] -> [DMESG-WARN][42] ([fdo#109385] / [fdo#110913 ]) [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-bxt-dsi/igt@gem_exec_fence@basic-busy-default.html [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-bxt-dsi/igt@gem_exec_fence@basic-busy-default.html - fi-whl-u: [PASS][43] -> [DMESG-WARN][44] ([fdo#110913 ]) [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-whl-u/igt@gem_exec_fence@basic-busy-default.html [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-whl-u/igt@gem_exec_fence@basic-busy-default.html - fi-skl-gvtdvm: [PASS][45] -> [DMESG-WARN][46] ([fdo#110913 ]) [45]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-skl-gvtdvm/igt@gem_exec_fence@basic-busy-default.html [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-skl-gvtdvm/igt@gem_exec_fence@basic-busy-default.html - fi-cml-u: [PASS][47] -> [DMESG-WARN][48] ([fdo#110913 ]) [47]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-cml-u/igt@gem_exec_fence@basic-busy-default.html [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cml-u/igt@gem_exec_fence@basic-busy-default.html - fi-bsw-n3050: [PASS][49] -> [DMESG-WARN][50] ([fdo#109385] / [fdo#110913 ]) [49]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-bsw-n3050/igt@gem_exec_fence@basic-busy-default.html [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-bsw-n3050/igt@gem_exec_fence@basic-busy-default.html - fi-ilk-650: [PASS][51] -> [DMESG-WARN][52] ([fdo#110913 ]) [51]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-ilk-650/igt@gem_exec_fence@basic-busy-default.html [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-ilk-650/igt@gem_exec_fence@basic-busy-default.html - fi-hsw-4770: [PASS][53] -> [DMESG-WARN][54] ([fdo#110913 ]) [53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-hsw-4770/igt@gem_exec_fence@basic-busy-default.html [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-hsw-4770/igt@gem_exec_fence@basic-busy-default.html - fi-bdw-gvtdvm: [PASS][55] -> [DMESG-WARN][56] ([fdo#110913 ]) [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-bdw-gvtdvm/igt@gem_exec_fence@basic-busy-default.html [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-bdw-gvtdvm/igt@gem_exec_fence@basic-busy-default.html - fi-skl-iommu: [PASS][57] -> [DMESG-WARN][58] ([fdo#110913 ]) [57]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-skl-iommu/igt@gem_exec_fence@basic-busy-default.html [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-skl-iommu/igt@gem_exec_fence@basic-busy-default.html - fi-kbl-7567u: [PASS][59] -> [DMESG-WARN][60] ([fdo#110913 ]) [59]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-kbl-7567u/igt@gem_exec_fence@basic-busy-default.html [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-kbl-7567u/igt@gem_exec_fence@basic-busy-default.html - fi-ivb-3770: [PASS][61] -> [DMESG-WARN][62] ([fdo#110913 ]) [61]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-ivb-3770/igt@gem_exec_fence@basic-busy-default.html [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-ivb-3770/igt@gem_exec_fence@basic-busy-default.html - fi-snb-2520m: [PASS][63] -> [DMESG-WARN][64] ([fdo#110913 ]) [63]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-snb-2520m/igt@gem_exec_fence@basic-busy-default.html [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-snb-2520m/igt@gem_exec_fence@basic-busy-default.html - fi-kbl-guc: [PASS][65] -> [DMESG-WARN][66] ([fdo#110913 ]) [65]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-kbl-guc/igt@gem_exec_fence@basic-busy-default.html [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-kbl-guc/igt@gem_exec_fence@basic-busy-default.html - fi-bsw-kefka: [PASS][67] -> [DMESG-WARN][68] ([fdo#109385] / [fdo#110913 ]) [67]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-bsw-kefka/igt@gem_exec_fence@basic-busy-default.html [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-bsw-kefka/igt@gem_exec_fence@basic-busy-default.html - fi-kbl-r: [PASS][69] -> [DMESG-WARN][70] ([fdo#110913 ]) [69]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-kbl-r/igt@gem_exec_fence@basic-busy-default.html [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-kbl-r/igt@gem_exec_fence@basic-busy-default.html - fi-skl-6770hq: [PASS][71] -> [DMESG-WARN][72] ([fdo#110913 ]) [71]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-skl-6770hq/igt@gem_exec_fence@basic-busy-default.html [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-skl-6770hq/igt@gem_exec_fence@basic-busy-default.html - fi-byt-n2820: [PASS][73] -> [DMESG-WARN][74] ([fdo#110913 ]) [73]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-byt-n2820/igt@gem_exec_fence@basic-busy-default.html [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-byt-n2820/igt@gem_exec_fence@basic-busy-default.html - fi-cfl-8109u: [PASS][75] -> [DMESG-WARN][76] ([fdo#110913 ]) [75]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-cfl-8109u/igt@gem_exec_fence@basic-busy-default.html [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cfl-8109u/igt@gem_exec_fence@basic-busy-default.html - fi-icl-u3: [PASS][77] -> [DMESG-WARN][78] ([fdo#110913 ]) [77]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-icl-u3/igt@gem_exec_fence@basic-busy-default.html [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-icl-u3/igt@gem_exec_fence@basic-busy-default.html - fi-byt-j1900: [PASS][79] -> [DMESG-WARN][80] ([fdo#110913 ]) [79]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-byt-j1900/igt@gem_exec_fence@basic-busy-default.html [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-byt-j1900/igt@gem_exec_fence@basic-busy-default.html - fi-skl-6260u: [PASS][81] -> [DMESG-WARN][82] ([fdo#110913 ]) [81]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-skl-6260u/igt@gem_exec_fence@basic-busy-default.html [82]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-skl-6260u/igt@gem_exec_fence@basic-busy-default.html - fi-bxt-j4205: [PASS][83] -> [DMESG-WARN][84] ([fdo#109385] / [fdo#110913 ]) [83]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-bxt-j4205/igt@gem_exec_fence@basic-busy-default.html [84]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-bxt-j4205/igt@gem_exec_fence@basic-busy-default.html - fi-snb-2600: [PASS][85] -> [DMESG-WARN][86] ([fdo#110913 ]) [85]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-snb-2600/igt@gem_exec_fence@basic-busy-default.html [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-snb-2600/igt@gem_exec_fence@basic-busy-default.html - fi-elk-e7500: [PASS][87] -> [DMESG-WARN][88] ([fdo#110913 ]) [87]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-elk-e7500/igt@gem_exec_fence@basic-busy-default.html [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-elk-e7500/igt@gem_exec_fence@basic-busy-default.html - fi-skl-6700k2: [PASS][89] -> [DMESG-WARN][90] ([fdo#110913 ]) [89]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-skl-6700k2/igt@gem_exec_fence@basic-busy-default.html [90]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-skl-6700k2/igt@gem_exec_fence@basic-busy-default.html
#### Possible fixes ####
- igt@gem_exec_fence@basic-wait-default:
- fi-icl-u3: [DMESG-WARN][91] ([fdo#107724]) -> [PASS][92] +2 similar issues
[91]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6358/fi-icl-u3/igt@gem_exec_fence@basic-wait-default.html [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-icl-u3/igt@gem_exec_fence@basic-wait-default.html
{name}: This element is suppressed. This means it is ignored when computing the status of the difference (SUCCESS, WARNING, or FAILURE).
Participating hosts (54 -> 46)
Additional (1): fi-cml-u2 Missing (9): fi-kbl-soraka fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-ctg-p8600 fi-icl-y fi-byt-clapper fi-bdw-samus
Build changes
- Linux: CI_DRM_6358 -> Patchwork_13438
CI_DRM_6358: 3454454b146cd9f684feb07ab9dff61dc875a022 @ git://anongit.freedesktop.org/gfx-ci/linux IGT_5068: 15ad664534413628f06c0f172aac11598bfdb895 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools Patchwork_13438: fbf9925e3e0220296c42260729dff5cc9d5d4c84 @ git://anongit.freedesktop.org/gfx-ci/linux
== Linux commits ==
fbf9925e3e02 drm/amdgpu: add independent DMA-buf import v7 13b9bef056c0 drm/amdgpu: add independent DMA-buf export v6 b0d7a97cb362 drm/amdgpu: use allowed_domains for exported DMA-bufs 4624e4f46a05 drm/ttm: use the parent resv for ghost objects v2 8f8445fa6219 drm/ttm: remove the backing store if no placement is given 6d09a7de3969 dma-buf: add dynamic DMA-buf handling v13
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Am 07.08.19 um 23:19 schrieb Daniel Vetter:
On Wed, Jul 31, 2019 at 10:55:02AM +0200, Daniel Vetter wrote:
On Thu, Jun 27, 2019 at 09:28:11AM +0200, Christian König wrote:
Hi Daniel,
those fails look like something random to me and not related to my patch set. Correct?
First one I looked at has the reservation_obj all over:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cml-u/igt@gem_ex...
So 5 second guees is ... probably real?
Note that with the entire lmem stuff going on right now there's massive discussions about how we're doing resv_obj vs obj->mm.lock the wrong way round in i915, so I'm not surprised at all that you managed to trip this.
The way I see it right now is that obj->mm.lock needs to be limited to dealing with the i915 shrinker interactions only, and only for i915 native objects. And for dma-bufs we need to make sure it's not anywhere in the callchain. Unfortunately that's a massive refactor I guess ...
Thought about this some more, aside from just breaking i915 or waiting until it's refactored (Both not awesome) I think the only option is get back to the original caching. And figure out whether we really need to take the direction into account for that, or whether upgrading to bidirectional unconditionally won't be ok. I think there's only really two cases where this matters:
display drivers using the cma/dma_alloc helpers. Everything is allocated fully coherent, cpu side wc, no flushing.
Everyone else (on platforms where there's actually some flushing going on) is for rendering gpus, and those always map bidirectional and want the mapping cached for as long as possible.
With that we could go back to creating the cached mapping at attach time and avoid inflicting the reservation object lock to places that would keel over.
Thoughts?
Actually we had a not so nice internal mail thread with our hardware guys and it looks like we have tons of hardware bugs/exceptions that sometimes PCIe BARs are only readable or only writable. So it turned out that always caching with bidirectional won't work for us either.
Additional to that I'm not sure how i915 actually triggered the issue, cause with the current code that shouldn't be possible.
But independent of that I came to the conclusion that we first need to get to a common view of what the fences in the reservation mean or otherwise the whole stuff here isn't going to work smooth either.
So working on that for now and when that's finished I will come back to this problem here again.
Regards, Christian.
-Daniel
On Thu, Aug 8, 2019 at 9:09 AM Koenig, Christian Christian.Koenig@amd.com wrote:
Am 07.08.19 um 23:19 schrieb Daniel Vetter:
On Wed, Jul 31, 2019 at 10:55:02AM +0200, Daniel Vetter wrote:
On Thu, Jun 27, 2019 at 09:28:11AM +0200, Christian König wrote:
Hi Daniel,
those fails look like something random to me and not related to my patch set. Correct?
First one I looked at has the reservation_obj all over:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cml-u/igt@gem_ex...
So 5 second guees is ... probably real?
Note that with the entire lmem stuff going on right now there's massive discussions about how we're doing resv_obj vs obj->mm.lock the wrong way round in i915, so I'm not surprised at all that you managed to trip this.
The way I see it right now is that obj->mm.lock needs to be limited to dealing with the i915 shrinker interactions only, and only for i915 native objects. And for dma-bufs we need to make sure it's not anywhere in the callchain. Unfortunately that's a massive refactor I guess ...
Thought about this some more, aside from just breaking i915 or waiting until it's refactored (Both not awesome) I think the only option is get back to the original caching. And figure out whether we really need to take the direction into account for that, or whether upgrading to bidirectional unconditionally won't be ok. I think there's only really two cases where this matters:
display drivers using the cma/dma_alloc helpers. Everything is allocated fully coherent, cpu side wc, no flushing.
Everyone else (on platforms where there's actually some flushing going on) is for rendering gpus, and those always map bidirectional and want the mapping cached for as long as possible.
With that we could go back to creating the cached mapping at attach time and avoid inflicting the reservation object lock to places that would keel over.
Thoughts?
Actually we had a not so nice internal mail thread with our hardware guys and it looks like we have tons of hardware bugs/exceptions that sometimes PCIe BARs are only readable or only writable. So it turned out that always caching with bidirectional won't work for us either.
Additional to that I'm not sure how i915 actually triggered the issue, cause with the current code that shouldn't be possible.
But independent of that I came to the conclusion that we first need to get to a common view of what the fences in the reservation mean or otherwise the whole stuff here isn't going to work smooth either.
So working on that for now and when that's finished I will come back to this problem here again.
Yeah makes sense. I think we also need to clarify a bit the existing rules around reservatrion_object, dma_fence signaling, and how that nests with everything else (like memory allocation/fs_reclaim critical sections, or mmap_sem).
Ignore the drivers which just pin everything system memory (mostly just socs) I think we have a bunch of groups, and they're all somewhat incompatible with each another. Examples:
- old ttm drivers (anything except amdgpu) nest the mmap_sem within the reservation_object. That allows you to do copy_*_user while holding reservations, simplifying command submission since you don't need fallback paths when you take a fault. But means you have this awkward trylock in the mmap path with no forward progress guarantee at all.
amdgpu fixed that (but left ttm alone), i915 also works like that with mmap_sem being the outer lock.
- other is reservation_object vs memory allocations. Currently all drivers assume you can allocate memory while holding a reservation, but i915 gem folks seem to have some plans to change that for i915. Which isn't going to work I think, so we need to clarify that before things get more inconsistent.
Above two can at least be ensured by adding somme lockdep annotations and dependency priming, see i915_gem_shrinker_taints_mutex for what I have in mind for reservation_obj.
The real pain/scary thing is dma_fence. All the shrinkers/mmu_notifiers/hmm_mirrors we have assume that you can wait for a fence from allocation contexts/direct reclaim. Which means nothing you do between publishing a fence somewhere (dma-buf, syncobj, syncpt fd) and signalling that fence is allowed to allocate memory or pull in any dependencies which might need memory allocations. I think we're mostly ok with this, but there's some i915 patches that break this.
Much worse is that lockdep can't help us check this: dma_fence is essentially struct completion on steroids, and the cross-release lockdep support for struct completion looks like it's never going to get merged. So no debugging aids to make sure we get this right, all we have is review and testing and machines deadlocking in really complicated ways if we get it wrong.
Cheers, Daniel
Am 08.08.19 um 09:29 schrieb Daniel Vetter:
On Thu, Aug 8, 2019 at 9:09 AM Koenig, Christian Christian.Koenig@amd.com wrote:
Am 07.08.19 um 23:19 schrieb Daniel Vetter:
On Wed, Jul 31, 2019 at 10:55:02AM +0200, Daniel Vetter wrote:
On Thu, Jun 27, 2019 at 09:28:11AM +0200, Christian König wrote:
Hi Daniel,
those fails look like something random to me and not related to my patch set. Correct?
First one I looked at has the reservation_obj all over:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cml-u/igt@gem_ex...
So 5 second guees is ... probably real?
Note that with the entire lmem stuff going on right now there's massive discussions about how we're doing resv_obj vs obj->mm.lock the wrong way round in i915, so I'm not surprised at all that you managed to trip this.
The way I see it right now is that obj->mm.lock needs to be limited to dealing with the i915 shrinker interactions only, and only for i915 native objects. And for dma-bufs we need to make sure it's not anywhere in the callchain. Unfortunately that's a massive refactor I guess ...
Thought about this some more, aside from just breaking i915 or waiting until it's refactored (Both not awesome) I think the only option is get back to the original caching. And figure out whether we really need to take the direction into account for that, or whether upgrading to bidirectional unconditionally won't be ok. I think there's only really two cases where this matters:
display drivers using the cma/dma_alloc helpers. Everything is allocated fully coherent, cpu side wc, no flushing.
Everyone else (on platforms where there's actually some flushing going on) is for rendering gpus, and those always map bidirectional and want the mapping cached for as long as possible.
With that we could go back to creating the cached mapping at attach time and avoid inflicting the reservation object lock to places that would keel over.
Thoughts?
Actually we had a not so nice internal mail thread with our hardware guys and it looks like we have tons of hardware bugs/exceptions that sometimes PCIe BARs are only readable or only writable. So it turned out that always caching with bidirectional won't work for us either.
Additional to that I'm not sure how i915 actually triggered the issue, cause with the current code that shouldn't be possible.
But independent of that I came to the conclusion that we first need to get to a common view of what the fences in the reservation mean or otherwise the whole stuff here isn't going to work smooth either.
So working on that for now and when that's finished I will come back to this problem here again.
Yeah makes sense. I think we also need to clarify a bit the existing rules around reservatrion_object, dma_fence signaling, and how that nests with everything else (like memory allocation/fs_reclaim critical sections, or mmap_sem).
Ignore the drivers which just pin everything system memory (mostly just socs) I think we have a bunch of groups, and they're all somewhat incompatible with each another. Examples:
- old ttm drivers (anything except amdgpu) nest the mmap_sem within
the reservation_object. That allows you to do copy_*_user while holding reservations, simplifying command submission since you don't need fallback paths when you take a fault. But means you have this awkward trylock in the mmap path with no forward progress guarantee at all.
amdgpu fixed that (but left ttm alone), i915 also works like that with mmap_sem being the outer lock.
By the way that is incorrect. Both amdgpu as well as readeon don't use copy_to/from_user while holding the reservation lock.
The last time I checked the only driver still doing that was nouveau.
Maybe time to add a might_lock() so that we will be informed about misuse by lockdep?
Christian.
- other is reservation_object vs memory allocations. Currently all
drivers assume you can allocate memory while holding a reservation, but i915 gem folks seem to have some plans to change that for i915. Which isn't going to work I think, so we need to clarify that before things get more inconsistent.
Above two can at least be ensured by adding somme lockdep annotations and dependency priming, see i915_gem_shrinker_taints_mutex for what I have in mind for reservation_obj.
The real pain/scary thing is dma_fence. All the shrinkers/mmu_notifiers/hmm_mirrors we have assume that you can wait for a fence from allocation contexts/direct reclaim. Which means nothing you do between publishing a fence somewhere (dma-buf, syncobj, syncpt fd) and signalling that fence is allowed to allocate memory or pull in any dependencies which might need memory allocations. I think we're mostly ok with this, but there's some i915 patches that break this.
Much worse is that lockdep can't help us check this: dma_fence is essentially struct completion on steroids, and the cross-release lockdep support for struct completion looks like it's never going to get merged. So no debugging aids to make sure we get this right, all we have is review and testing and machines deadlocking in really complicated ways if we get it wrong.
Cheers, Daniel
On Thu, Aug 8, 2019 at 1:05 PM Koenig, Christian Christian.Koenig@amd.com wrote:
Am 08.08.19 um 09:29 schrieb Daniel Vetter:
On Thu, Aug 8, 2019 at 9:09 AM Koenig, Christian Christian.Koenig@amd.com wrote:
Am 07.08.19 um 23:19 schrieb Daniel Vetter:
On Wed, Jul 31, 2019 at 10:55:02AM +0200, Daniel Vetter wrote:
On Thu, Jun 27, 2019 at 09:28:11AM +0200, Christian König wrote:
Hi Daniel,
those fails look like something random to me and not related to my patch set. Correct?
First one I looked at has the reservation_obj all over:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cml-u/igt@gem_ex...
So 5 second guees is ... probably real?
Note that with the entire lmem stuff going on right now there's massive discussions about how we're doing resv_obj vs obj->mm.lock the wrong way round in i915, so I'm not surprised at all that you managed to trip this.
The way I see it right now is that obj->mm.lock needs to be limited to dealing with the i915 shrinker interactions only, and only for i915 native objects. And for dma-bufs we need to make sure it's not anywhere in the callchain. Unfortunately that's a massive refactor I guess ...
Thought about this some more, aside from just breaking i915 or waiting until it's refactored (Both not awesome) I think the only option is get back to the original caching. And figure out whether we really need to take the direction into account for that, or whether upgrading to bidirectional unconditionally won't be ok. I think there's only really two cases where this matters:
display drivers using the cma/dma_alloc helpers. Everything is allocated fully coherent, cpu side wc, no flushing.
Everyone else (on platforms where there's actually some flushing going on) is for rendering gpus, and those always map bidirectional and want the mapping cached for as long as possible.
With that we could go back to creating the cached mapping at attach time and avoid inflicting the reservation object lock to places that would keel over.
Thoughts?
Actually we had a not so nice internal mail thread with our hardware guys and it looks like we have tons of hardware bugs/exceptions that sometimes PCIe BARs are only readable or only writable. So it turned out that always caching with bidirectional won't work for us either.
Additional to that I'm not sure how i915 actually triggered the issue, cause with the current code that shouldn't be possible.
But independent of that I came to the conclusion that we first need to get to a common view of what the fences in the reservation mean or otherwise the whole stuff here isn't going to work smooth either.
So working on that for now and when that's finished I will come back to this problem here again.
Yeah makes sense. I think we also need to clarify a bit the existing rules around reservatrion_object, dma_fence signaling, and how that nests with everything else (like memory allocation/fs_reclaim critical sections, or mmap_sem).
Ignore the drivers which just pin everything system memory (mostly just socs) I think we have a bunch of groups, and they're all somewhat incompatible with each another. Examples:
- old ttm drivers (anything except amdgpu) nest the mmap_sem within
the reservation_object. That allows you to do copy_*_user while holding reservations, simplifying command submission since you don't need fallback paths when you take a fault. But means you have this awkward trylock in the mmap path with no forward progress guarantee at all.
amdgpu fixed that (but left ttm alone), i915 also works like that with mmap_sem being the outer lock.
By the way that is incorrect. Both amdgpu as well as readeon don't use copy_to/from_user while holding the reservation lock.
Cool, this is great. When I recently re-read a pile of code in there I only looked at the amdgpu cs again in detail. And yeah on rechecking at least nouveau does it backwards still, first it calls nouveau_gem_pushbuf_validate (which does all the ww_mutex stuff), then nouveau_gem_pushbuf_reloc_apply (which calls copy_from_user and doesn't seem to have any fallback/retry or slowpath that does the copy without holding).
I think vmwgfx is also (no longer?) inverted, it seems to drop the resv_obj locks in vmw_validation_bo_fence(), and only after that call vmw_execbuf_copy_fence_user(). So not a resv_obj vs. mmap_sem inversion, but the copy_*_user might trigger the shrinker. At least amdgpu and i915 rely on being able to wait on published fences in there, so we might look at some other inversion here. Unfortunately this isn't one lockdep can catch :-/
The last time I checked the only driver still doing that was nouveau.
Maybe time to add a might_lock() so that we will be informed about misuse by lockdep?
Not even a might_lock, just priming the lockdep dependency tracking by essentially doing a quick sequence of:
mutex_lock(mmap_sem); reservation_object_lock(); fs_reclaim_acquire(GFP_KERNEL);
And then unwinding. Once you've done that then lockdep will know that a) resv_obj nests within mmap_sem b) you can allocate memory when holding a resv_obj
And like with any other lockdep splats it will loudly complain if anyone breaks the rules. Since we know that nouveau (and probably also vmwgfx I guess) breaks the rules we'll probably need to hide this behind CONFIG_DEBUG_WW_MUTEX_SLOWPATH or something similar.
I'm also working on some lockdep annotation improvements for mmu_notifier. We might also want to make sure we handle that part consistent across drivers, but I'm much less clear on what the real rules are there (aside from i915 userptr looks really strange compared to the amdgpu one). -Daniel
Christian.
- other is reservation_object vs memory allocations. Currently all
drivers assume you can allocate memory while holding a reservation, but i915 gem folks seem to have some plans to change that for i915. Which isn't going to work I think, so we need to clarify that before things get more inconsistent.
Above two can at least be ensured by adding somme lockdep annotations and dependency priming, see i915_gem_shrinker_taints_mutex for what I have in mind for reservation_obj.
The real pain/scary thing is dma_fence. All the shrinkers/mmu_notifiers/hmm_mirrors we have assume that you can wait for a fence from allocation contexts/direct reclaim. Which means nothing you do between publishing a fence somewhere (dma-buf, syncobj, syncpt fd) and signalling that fence is allowed to allocate memory or pull in any dependencies which might need memory allocations. I think we're mostly ok with this, but there's some i915 patches that break this.
Much worse is that lockdep can't help us check this: dma_fence is essentially struct completion on steroids, and the cross-release lockdep support for struct completion looks like it's never going to get merged. So no debugging aids to make sure we get this right, all we have is review and testing and machines deadlocking in really complicated ways if we get it wrong.
Cheers, Daniel
On Thu, Aug 8, 2019 at 9:09 AM Koenig, Christian Christian.Koenig@amd.com wrote:
Am 07.08.19 um 23:19 schrieb Daniel Vetter:
On Wed, Jul 31, 2019 at 10:55:02AM +0200, Daniel Vetter wrote:
On Thu, Jun 27, 2019 at 09:28:11AM +0200, Christian König wrote:
Hi Daniel,
those fails look like something random to me and not related to my patch set. Correct?
First one I looked at has the reservation_obj all over:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_13438/fi-cml-u/igt@gem_ex...
So 5 second guees is ... probably real?
Note that with the entire lmem stuff going on right now there's massive discussions about how we're doing resv_obj vs obj->mm.lock the wrong way round in i915, so I'm not surprised at all that you managed to trip this.
The way I see it right now is that obj->mm.lock needs to be limited to dealing with the i915 shrinker interactions only, and only for i915 native objects. And for dma-bufs we need to make sure it's not anywhere in the callchain. Unfortunately that's a massive refactor I guess ...
Thought about this some more, aside from just breaking i915 or waiting until it's refactored (Both not awesome) I think the only option is get back to the original caching. And figure out whether we really need to take the direction into account for that, or whether upgrading to bidirectional unconditionally won't be ok. I think there's only really two cases where this matters:
display drivers using the cma/dma_alloc helpers. Everything is allocated fully coherent, cpu side wc, no flushing.
Everyone else (on platforms where there's actually some flushing going on) is for rendering gpus, and those always map bidirectional and want the mapping cached for as long as possible.
With that we could go back to creating the cached mapping at attach time and avoid inflicting the reservation object lock to places that would keel over.
Thoughts?
Actually we had a not so nice internal mail thread with our hardware guys and it looks like we have tons of hardware bugs/exceptions that sometimes PCIe BARs are only readable or only writable. So it turned out that always caching with bidirectional won't work for us either.
Additional to that I'm not sure how i915 actually triggered the issue, cause with the current code that shouldn't be possible.
Forgot to explain this: i915 has it's own lock for managing buffer state, originally struct_mutex, now also i915_gem_obj->mm.lock. When importing we take both of these before calling into the exporter, when exporting we need these when getting called from the import. If the importer uses the reservation_object lock you get a classic ABBA deadlock.
I thought the plan was to push struct_mutex down and obj->mm.lock up until they meet in the middle and we can start using the resv_obj ww_mutex for everything. But looking at some of the latest in-flight patches (I cc'ed you on them) that seems to not really be the plan, which is bad :-/ -Daniel
But independent of that I came to the conclusion that we first need to get to a common view of what the fences in the reservation mean or otherwise the whole stuff here isn't going to work smooth either.
So working on that for now and when that's finished I will come back to this problem here again.
Regards, Christian.
-Daniel
dri-devel@lists.freedesktop.org