This patch introduces a workaround for a case where a uevent is issued by the kernel because of DP link training failing on a connector unrelated to the current test. Since the test depends on receiving a hotplug uevent, it previously passed even though it should not have.
False positives also occur due to the plug/unplug events being delayed and issued at resume time. This is mitigated by catching and flushing hotplugs everytime a change is made on connectors, but it is not enough to ensure that all hotplug events were caught and not delayed.
The problem here is that it is not possible to find out the exact reason why a uevent is issued by the kernel. A possible way to fix this would be to introduce more fields (the connector name and some reason why the event is triggered would probably be sufficient).
It may occur that a hotplug uevent is detected at resume, even though it does not indicate that an actual hotplug happened. This is the case when link training fails on any other connector.
There is currently no way to distinguish what connector caused a hotplug uevent, nor what the reason for that uevent really is. This makes it impossible to find out whether the test actually passed or not.
To circumvent this problem, the link status of each connector is collected before and after suspend and compared to skip the test if the state was good before and turned to bad after resume.
This only concerns the EDID change test, where we cannot check the connector state (that is not supposed to have changed). For actual hotplug tests, the tests should be safe since they check each connector's state after receiving the uevent.
The situation described here happens with DP-VGA bridges that fail link training after resume, as they need some more time to response on their AUX channel.
Signed-off-by: Paul Kocialkowski paul.kocialkowski@linux.intel.com --- tests/chamelium.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)
diff --git a/tests/chamelium.c b/tests/chamelium.c index e26f0557..8af33aaa 100644 --- a/tests/chamelium.c +++ b/tests/chamelium.c @@ -87,6 +87,31 @@ get_precalculated_crc(struct chamelium_port *port, int w, int h) }
static void +get_connectors_link_status_failed(data_t *data, bool *link_status_failed) +{ + drmModeConnector *connector; + uint64_t link_status; + drmModePropertyPtr prop; + int p; + + for (p = 0; p < data->port_count; p++) { + connector = chamelium_port_get_connector(data->chamelium, + data->ports[p], false); + + igt_assert(kmstest_get_property(data->drm_fd, + connector->connector_id, + DRM_MODE_OBJECT_CONNECTOR, + "link-status", NULL, + &link_status, &prop)); + + link_status_failed[p] = link_status == DRM_MODE_LINK_STATUS_BAD; + + drmModeFreeProperty(prop); + drmModeFreeConnector(connector); + } +} + +static void require_connector_present(data_t *data, unsigned int type) { int i; @@ -310,6 +335,8 @@ test_suspend_resume_edid_change(data_t *data, struct chamelium_port *port, int alt_edid_id) { struct udev_monitor *mon = igt_watch_hotplug(); + bool link_status_failed[2][data->port_count]; + int p;
reset_state(data, port);
@@ -326,8 +353,16 @@ test_suspend_resume_edid_change(data_t *data, struct chamelium_port *port, */ chamelium_port_set_edid(data->chamelium, port, alt_edid_id);
+ get_connectors_link_status_failed(data, link_status_failed[0]); + igt_system_suspend_autoresume(state, test); + igt_assert(igt_hotplug_detected(mon, HOTPLUG_TIMEOUT)); + + get_connectors_link_status_failed(data, link_status_failed[1]); + + for (p = 0; p < data->port_count; p++) + igt_skip_on(!link_status_failed[0][p] && link_status_failed[1][p]); }
static igt_output_t *
Quoting Paul Kocialkowski (2017-07-18 16:16:26)
It may occur that a hotplug uevent is detected at resume, even though it does not indicate that an actual hotplug happened. This is the case when link training fails on any other connector.
There is currently no way to distinguish what connector caused a hotplug uevent, nor what the reason for that uevent really is. This makes it impossible to find out whether the test actually passed or not.
And you may get more than one and then this skips even though the test passed. Looks like the patch is overcompensating. What you can do is repeat the test a few times, and then look at all the different errors you get. If the connector remains (no mst disappareance) once it goes bad, it should remain bad and so not generate any new uevent. Or you only repeat the test whilst link_status[old] != link_status[new]. -Chris
On Tue, 2017-07-18 at 22:21 +0100, Chris Wilson wrote:
Quoting Paul Kocialkowski (2017-07-18 16:16:26)
It may occur that a hotplug uevent is detected at resume, even though it does not indicate that an actual hotplug happened. This is the case when link training fails on any other connector.
There is currently no way to distinguish what connector caused a hotplug uevent, nor what the reason for that uevent really is. This makes it impossible to find out whether the test actually passed or not.
And you may get more than one and then this skips even though the test passed. Looks like the patch is overcompensating. What you can do is repeat the test a few times, and then look at all the different errors you get. If the connector remains (no mst disappareance) once it goes bad, it should remain bad and so not generate any new uevent. Or you only repeat the test whilst link_status[old] != link_status[new].
I am not sure it is really desirable to repeat the test until we are fairly certain it succeeds. This involves suspend/resume, that is already long enough as it is.
Also, a uevent will be generated everytime link training fails, regardless of whether it was already failing before (I just tested that to make sure). In my case, it's due to a DP-VGA bridge that will consistently fail link training in the first seconds after resume.
So this is actually even worse that I thought, because there is no way to find out that this is why a uevent was generated if the link status was already bad before.
So I don't see how we can manage with the current information at disposal.
My main point here is that we need more information about what's going on than simply "HOTPLUG=1". These patches demonstrate that working around the lack of information is a pain for testing purposes and can only leads to semi-working hackish workarounds.
Do you agree that this is what the problem really is?
On Wed, 2017-07-19 at 11:31 +0300, Paul Kocialkowski wrote:
On Tue, 2017-07-18 at 22:21 +0100, Chris Wilson wrote:
Quoting Paul Kocialkowski (2017-07-18 16:16:26)
It may occur that a hotplug uevent is detected at resume, even though it does not indicate that an actual hotplug happened. This is the case when link training fails on any other connector.
There is currently no way to distinguish what connector caused a hotplug uevent, nor what the reason for that uevent really is. This makes it impossible to find out whether the test actually passed or not.
And you may get more than one and then this skips even though the test passed. Looks like the patch is overcompensating. What you can do is repeat the test a few times, and then look at all the different errors you get. If the connector remains (no mst disappareance) once it goes bad, it should remain bad and so not generate any new uevent. Or you only repeat the test whilst link_status[old] != link_status[new].
I am not sure it is really desirable to repeat the test until we are fairly certain it succeeds. This involves suspend/resume, that is already long enough as it is.
Also, a uevent will be generated everytime link training fails, regardless of whether it was already failing before (I just tested that to make sure). In my case, it's due to a DP-VGA bridge that will consistently fail link training in the first seconds after resume.
So this is actually even worse that I thought, because there is no way to find out that this is why a uevent was generated if the link status was already bad before.
So I don't see how we can manage with the current information at disposal.
My main point here is that we need more information about what's going on than simply "HOTPLUG=1". These patches demonstrate that working around the lack of information is a pain for testing purposes and can only leads to semi-working hackish workarounds.
Do you agree that this is what the problem really is?
Yes, I agree we need more debugging information for when hotplugs fail. This being said though, the fact that i915 is unconditionally sending hotplugs on resume (this appears to be a hack that they did add to stop from missign hotplug events between suspend/resume) is really what's causing this problem specifically.
We really need the debugging stuff me and martin suggested for the kernel, and also more drm helpers to actually do edid checks and that sort of stuff so that we don't have to deal with dirty hacks like this :.
This adds calls to igt_hotplug_detected and igt_flush_hotplugs to catch and flush hotplugs from connector unplug (due to chamelium reset) and plug. These need to be intercepted so that they are not delayed and issued after resume, providing a false positive for the test result.
In addition, the final hotplug uevent flush is brought closer to the suspend call, to decrease the likeliness of false positive.
However, false positives still do happen, because it is not possible to make sure that the uevent caused by each connector's state change was caught instead of being delayed and issued at resume time.
Signed-off-by: Paul Kocialkowski paul.kocialkowski@linux.intel.com --- tests/chamelium.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/tests/chamelium.c b/tests/chamelium.c index 8af33aaa..0528ffb3 100644 --- a/tests/chamelium.c +++ b/tests/chamelium.c @@ -340,12 +340,16 @@ test_suspend_resume_edid_change(data_t *data, struct chamelium_port *port,
reset_state(data, port);
+ /* Catch the event and flush all remaining ones. */ + igt_assert(igt_hotplug_detected(mon, HOTPLUG_TIMEOUT)); + igt_flush_hotplugs(mon); + /* First plug in the port */ chamelium_port_set_edid(data->chamelium, port, edid_id); chamelium_plug(data->chamelium, port); - wait_for_connector(data, port, DRM_MODE_CONNECTED); + igt_assert(igt_hotplug_detected(mon, HOTPLUG_TIMEOUT));
- igt_flush_hotplugs(mon); + wait_for_connector(data, port, DRM_MODE_CONNECTED);
/* * Change the edid before we suspend. On resume, the machine should @@ -355,6 +359,8 @@ test_suspend_resume_edid_change(data_t *data, struct chamelium_port *port,
get_connectors_link_status_failed(data, link_status_failed[0]);
+ igt_flush_hotplugs(mon); + igt_system_suspend_autoresume(state, test);
igt_assert(igt_hotplug_detected(mon, HOTPLUG_TIMEOUT));
For the whole series
Reviewed-by: Lyude lyude@redhat.com
will push in just a sec
On Tue, 2017-07-18 at 18:16 +0300, Paul Kocialkowski wrote:
This patch introduces a workaround for a case where a uevent is issued by the kernel because of DP link training failing on a connector unrelated to the current test. Since the test depends on receiving a hotplug uevent, it previously passed even though it should not have.
False positives also occur due to the plug/unplug events being delayed and issued at resume time. This is mitigated by catching and flushing hotplugs everytime a change is made on connectors, but it is not enough to ensure that all hotplug events were caught and not delayed.
The problem here is that it is not possible to find out the exact reason why a uevent is issued by the kernel. A possible way to fix this would be to introduce more fields (the connector name and some reason why the event is triggered would probably be sufficient).
Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
dri-devel@lists.freedesktop.org