Hi Philipp,
On 13.09.2018 01:36, Philipp Zabel wrote:
Hi Stefan,
thank you for the report. I think what happens is the following:
Thanks for looking into it!
On Wed, 2018-09-12 at 15:26 -0700, Stefan Agner wrote:
Hi,
While working on Apalis iMX6 with four displays, I encountered the following crash:
[...]
[ 3.781163] imx-drm display-subsystem: bound 120000.hdmi (ops dw_hdmi_imx_ops) [ 3.788441] imx-drm display-subsystem: binding ldb (ops imx_ldb_ops) [ 3.794902] imx_ldb_bind, imx_ldb ec0da010
component_bind calls devres_open_group right before component->ops->bind
The devmem_kzalloc'ed imx_ldb_channel that contains the encoder is allocated in this devres group.
[ 3.799818] drm_encoder_init, encoder ec0da388 [ 3.804363] drm_encoder_init, ret 0
drm_encoder_init adds the encoder in imx_ldb to the mode_config.encoder_list.
[ 3.807908] imx_ldb_register, encoder ec0da388 [ 3.812432] imx_ldb_register, ret 0 [ 3.815951] imx_ldb_bind encoder LVDS-46, ec0da388 ->func c0e6371c [ 3.822227] imx_ldb_bind, done [ 3.825331] imx-drm display-subsystem: bound ldb (ops imx_ldb_ops) [ 3.831614] imx-drm display-subsystem: binding parallel-display-controller1 (ops imx_pd_ops) [ 3.840285] drm_encoder_init, encoder ec8a2b78 [ 3.844931] imx-drm display-subsystem: Linked as a consumer to panel [ 3.851472] imx-drm display-subsystem: bound parallel-display-controller1 (ops imx_pd_ops) [ 3.859785] imx-drm display-subsystem: binding parallel-display-controller0 (ops imx_pd_ops) [ 3.868561] imx-drm display-subsystem: failed to bind parallel-display-controller0 (ops imx_pd_ops): -517 [ 3.878225] imx-drm display-subsystem: Dropping the link to panel [ 3.884679] imx_ldb_unbind [ 3.887421] imx_ldb_unbind, encoder LVDS-46, ec0da388 ->func c0e6371c [ 3.893950] imx_ldb_unbind, encoder (null), ec0da838 ->func 0 [ 3.899723] imx_ldb_unbind, done
component_unbind calls devres_release_group after component->ops-
unbind, freeing imx_ldb and with it the encoder that is still in the
mode_config.encoder_list
Oh I see, I did not realize that component_unbind releases devres...
[ 3.908345] imx_drm_bind, component_bind_all, ret -517 [ 3.913638] drm_mode_config_cleanup, encoder TMDS-44, ec861894 ->funcs c0e63a18 [ 3.921260] drm_mode_config_cleanup, calling encoder->funcs->destroy! [ 3.927897] drm_encoder_cleanup, encoder ec861894 [ 3.932717] drm_mode_config_cleanup, encoder (null), ec0da388 ->funcs 0
Use after free.
That all makes sense now...
[ 3.939356] drm_mode_config_cleanup, calling encoder->funcs->destroy! [ 3.946050] Unable to handle kernel NULL pointer dereference at virtual address 00000004
[...]
The Apalis iMX6 device tree I work with uses a dumb VGA bridge and a LVDS display directly specified in the ldb node. In the case above I did not compile in the dumb VGA bridge driver, that is the reason why binding parallel-display-controller0 fails.
I suppose this means that component drivers must not allocate the encoder in the component devres group during bind. Not only their own failure, but any other component's failure can cause component_unbind to be called in the cleanup path of component_bind_all, freeing the memory before drm_mode_config_cleanup is called.
Indeed, moving allocation into driver bind fixes the crash!
I wonder, how does devres knows that imx_ldb got allocated in probe and does not free on unbind? I guess that is the group mechanism which makes sure of that?
ldb would not the only driver affected, also drivers/gpu/drm/imx/parallel-display.c allocates encoders in component bind, there are probably more.
Actually, when thinking more about it, if we would have a way to call framework cleanup functions (drm_mode_config_cleanup() in this case) between the bind and unbind loops in component_bind_all(), we would not have that problem.
[adding Russel, since he introduced the component infrastructure]
The LVDS encoder (0xec1a0388 in that case) is somehow already cleaned up when calling drm_mode_config_cleanup(), which leads to a null pointer deference when trying to call destroy.
I was unable to figure out why the DRM encoder in the second drm_mode_config_cleanup() call is already zeroed out. The component framework calls imx_ldb_unbind() first, but that should not free up the encoder afaict. Any idea?
I was also wondering in the deferred probing case, functions like imx_ldb_bind() call devm_kzalloc on every try. Since the underlying device is not removed, doesn't this leads to multiple active allocations?
See above. To me it looks like we have to allocate imx_ldb in probe and deal with possibly partially preinitialzied structure when bind is called a second time.
Yeah with the fact that component framework does free devres at unbind time, my concern stated here is null...
-- Stefan