Hello guys,
I've been playing with reloading intel gfx driver (i915) in a cycle, for a while, and at some point I've found a non-deterministic kernel crash with a highly-variable iteration dependency -- 2 to 200 driver reload iterations.
The apparent race is over the shared internal string buffer in drm_get_connector_name(). It is mostly harmless, due to its results being mostly used for log output, but in at least one case -- drm_sysfs_connector_add() -- this leads to a more critical error.
Race scenario:
- drm_sysfs_connector_add() - drm_get_connector_name() vs. - anything that generates log messages involving DRM connectors - drm_get_connector_name()
(and many other from drm_crtc.c) shares with caller const char* to internal static char buffer. If something call it from other thread, while main thread strore and use returned pointer it may overwrite connector name.
Here are we go: registering HDMI connecter (drm_sysfs_connector_add store and use pointer from drm_get_connector_name) and the same time got VGA connector name down through the stack. (the second thread is upowerd who watch continuously sysfs)
Mar 24 14:23:04 haswell01 kernel: [ 417.570043] ------------[ cut here ]------------ Mar 24 14:23:04 haswell01 kernel: [ 417.570045] WARNING: CPU: 0 PID: 12700 at /build/buildd/linux-3.13.0/lib/kobject.c:223 kobject_add_internal+0x224/0x330() Mar 24 14:23:04 haswell01 kernel: [ 417.570046] kobject_add_internal failed for card0-VGA-1 with -EEXIST, don't try to register things with the same name in the same directory. Mar 24 14:23:04 haswell01 kernel: [ 417.570047] Modules linked in: i915(+) video drm_kms_helper drm i2c_algo_bit snd_hda_codec_realtek snd_hda_codec_hdmi bnep rfcomm bluetooth x86_pkg_temp_thermal intel_p owerclamp coretemp kvm_intel snd_hda_codec kvm snd_hwdep snd_pcm hid_generic snd_page_alloc crct10dif_pclmul snd_seq_midi crc32_pclmul ghash_clmulni_intel snd_seq_midi_event usbhid snd_rawmidi hid aesni_in tel aes_x86_64 lrw gf128mul ppdev glue_helper ablk_helper cryptd snd_seq snd_seq_device snd_timer snd mei_me psmouse lpc_ich soundcore mei mac_hid parport_pc serio_raw nls_iso8859_1 lp parport e1000e ahci ptp libahci pps_core [last unloaded: video] Mar 24 14:23:04 haswell01 kernel: [ 417.570068] CPU: 0 PID: 12700 Comm: modprobe Tainted: G W 3.13.0-19-generic #39-Ubuntu Mar 24 14:23:04 haswell01 kernel: [ 417.570069] Hardware name: /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014 Mar 24 14:23:04 haswell01 kernel: [ 417.570069] 0000000000000009 ffff8804051295f8 ffffffff81711075 ffff880405129640 Mar 24 14:23:04 haswell01 kernel: [ 417.570071] ffff880405129630 ffffffff810662cd ffff88040776a410 00000000ffffffef Mar 24 14:23:04 haswell01 kernel: [ 417.570074] 0000000000000000 ffff8804048dcc10 ffff880407769000 ffff880405129690 Mar 24 14:23:04 haswell01 kernel: [ 417.570076] Call Trace: Mar 24 14:23:04 haswell01 kernel: [ 417.570078] [<ffffffff81711075>] dump_stack+0x45/0x56 Mar 24 14:23:04 haswell01 kernel: [ 417.570080] [<ffffffff810662cd>] warn_slowpath_common+0x7d/0xa0 Mar 24 14:23:04 haswell01 kernel: [ 417.570081] [<ffffffff8106633c>] warn_slowpath_fmt+0x4c/0x50 Mar 24 14:23:04 haswell01 kernel: [ 417.570082] [<ffffffff81230083>] ? sysfs_create_dir_ns+0x73/0xc0 Mar 24 14:23:04 haswell01 kernel: [ 417.570084] [<ffffffff8135b9a4>] kobject_add_internal+0x224/0x330 Mar 24 14:23:04 haswell01 kernel: [ 417.570086] [<ffffffff8135bed5>] kobject_add+0x65/0xb0 Mar 24 14:23:04 haswell01 kernel: [ 417.570088] [<ffffffff814874f5>] device_add+0x125/0x640 Mar 24 14:23:04 haswell01 kernel: [ 417.570090] [<ffffffff81487c20>] device_create_groups_vargs+0xe0/0x110 Mar 24 14:23:04 haswell01 kernel: [ 417.570092] [<ffffffff81487cb1>] device_create+0x41/0x50 Mar 24 14:23:04 haswell01 kernel: [ 417.570097] [<ffffffffa0137fc9>] drm_sysfs_connector_add+0x69/0x230 [drm] Mar 24 14:23:04 haswell01 kernel: [ 417.570110] [<ffffffffa0549ca1>] intel_hdmi_init_connector+0x111/0x260 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570119] [<ffffffffa0541670>] intel_ddi_init+0x270/0x2a0 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570130] [<ffffffffa0533176>] intel_setup_outputs+0x4c6/0x750 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570139] [<ffffffffa05370f7>] intel_modeset_init+0x607/0x8f0 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570147] [<ffffffffa04f90a4>] i915_driver_load+0xbb4/0xe70 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570153] [<ffffffffa0134cd2>] drm_dev_register+0xa2/0x1e0 [drm] Mar 24 14:23:04 haswell01 kernel: [ 417.570158] [<ffffffffa0136bc2>] drm_get_pci_dev+0x92/0x140 [drm] Mar 24 14:23:04 haswell01 kernel: [ 417.570166] [<ffffffffa04f567c>] i915_pci_probe+0x3c/0x90 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570168] [<ffffffff8139e0e5>] local_pci_probe+0x45/0xa0 Mar 24 14:23:04 haswell01 kernel: [ 417.570170] [<ffffffff8139f385>] ? pci_match_device+0xc5/0xd0 Mar 24 14:23:04 haswell01 kernel: [ 417.570172] [<ffffffff8139f4a9>] pci_device_probe+0xd9/0x130 Mar 24 14:23:04 haswell01 kernel: [ 417.570174] [<ffffffff8148a7c5>] driver_probe_device+0x125/0x3b0 Mar 24 14:23:04 haswell01 kernel: [ 417.570176] [<ffffffff8148ab23>] __driver_attach+0x93/0xa0 Mar 24 14:23:04 haswell01 kernel: [ 417.570178] [<ffffffff8148aa90>] ? __device_attach+0x40/0x40 Mar 24 14:23:04 haswell01 kernel: [ 417.570179] [<ffffffff81488733>] bus_for_each_dev+0x63/0xa0 Mar 24 14:23:04 haswell01 kernel: [ 417.570181] [<ffffffff8148a17e>] driver_attach+0x1e/0x20 Mar 24 14:23:04 haswell01 kernel: [ 417.570183] [<ffffffff81489d60>] bus_add_driver+0x180/0x250 Mar 24 14:23:04 haswell01 kernel: [ 417.570185] [<ffffffffa01be000>] ? 0xffffffffa01bdfff Mar 24 14:23:04 haswell01 kernel: [ 417.570187] [<ffffffff8148b1a4>] driver_register+0x64/0xf0 Mar 24 14:23:04 haswell01 kernel: [ 417.570189] [<ffffffffa01be000>] ? 0xffffffffa01bdfff Mar 24 14:23:04 haswell01 kernel: [ 417.570191] [<ffffffff8139da7c>] __pci_register_driver+0x4c/0x50 Mar 24 14:23:04 haswell01 kernel: [ 417.570196] [<ffffffffa0136d8a>] drm_pci_init+0x11a/0x130 [drm] Mar 24 14:23:04 haswell01 kernel: [ 417.570198] [<ffffffffa01be000>] ? 0xffffffffa01bdfff Mar 24 14:23:04 haswell01 kernel: [ 417.570205] [<ffffffffa01be066>] i915_init+0x66/0x68 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570207] [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0 Mar 24 14:23:04 haswell01 kernel: [ 417.570208] [<ffffffff81058ae3>] ? set_memory_nx+0x43/0x50 Mar 24 14:23:04 haswell01 kernel: [ 417.570211] [<ffffffff810e091d>] load_module+0x12dd/0x1b40 Mar 24 14:23:04 haswell01 kernel: [ 417.570213] [<ffffffff810dc3a0>] ? store_uevent+0x40/0x40 Mar 24 14:23:04 haswell01 kernel: [ 417.570215] [<ffffffff810e12f6>] SyS_finit_module+0x86/0xb0 Mar 24 14:23:04 haswell01 kernel: [ 417.570217] [<ffffffff81721c7f>] tracesys+0xe1/0xe6 Mar 24 14:23:04 haswell01 kernel: [ 417.570219] ---[ end trace 8cd466c13137554f ]---
How to reproduce: load/reload i915 driver many times (in my case it happens after 2-200 attempts) and you will got a sysfs dup warning and then while unloading driver it will crash (because of malformed connectors list):
Mar 24 14:23:16 haswell01 kernel: [ 429.326174] BUG: unable to handle kernel NULL pointer dereference at 000000000000002f Mar 24 14:23:16 haswell01 kernel: [ 429.326177] IP: [<ffffffff8122de46>] sysfs_remove_file_ns+0x6/0x20 Mar 24 14:23:16 haswell01 kernel: [ 429.326184] PGD 3f588d067 PUD 406361067 PMD 0 Mar 24 14:23:16 haswell01 kernel: [ 429.326187] Oops: 0000 [#1] SMP Mar 24 14:23:16 haswell01 kernel: [ 429.326189] Modules linked in: i915 video drm_kms_helper drm i2c_algo_bit snd_hda_codec_realtek snd_hda_codec_hdmi bnep rfcomm bluetooth x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec kvm snd_hwdep snd_pcm hid_generic snd_page_alloc crct10dif_pclmul snd_seq_midi crc32_pclmul ghash_clmulni_intel snd_seq_midi_event usbhid snd_rawmidi hid aesni_intel aes_x86_64 lrw gf128mul ppdev glue_helper ablk_helper cryptd snd_seq snd_seq_device snd_timer snd mei_me psmouse lpc_ich soundcore mei mac_hid parport_pc serio_raw nls_iso8859_1 lp parport e1000e ahci ptp libahci pps_core [last unloaded: video] Mar 24 14:23:16 haswell01 kernel: [ 429.326219] CPU: 0 PID: 13302 Comm: dd Tainted: G W 3.13.0-19-generic #39-Ubuntu Mar 24 14:23:16 haswell01 kernel: [ 429.326221] Hardware name: /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014 Mar 24 14:23:16 haswell01 kernel: [ 429.326222] task: ffff8803f5aaafe0 ti: ffff880404d20000 task.ti: ffff880404d20000 Mar 24 14:23:16 haswell01 kernel: [ 429.326224] RIP: 0010:[<ffffffff8122de46>] [<ffffffff8122de46>] sysfs_remove_file_ns+0x6/0x20 Mar 24 14:23:16 haswell01 kernel: [ 429.326227] RSP: 0018:ffff880404d21d20 EFLAGS: 00010246 Mar 24 14:23:16 haswell01 kernel: [ 429.326228] RAX: 0000000000000000 RBX: ffffffffa0160160 RCX: ffffffffa0159356 Mar 24 14:23:16 haswell01 kernel: [ 429.326230] RDX: 0000000000000000 RSI: ffffffffa0160140 RDI: ffffffffffffffff Mar 24 14:23:16 haswell01 kernel: [ 429.326231] RBP: ffff880404d21d30 R08: ffffffffa0161120 R09: 000000000000fffe Mar 24 14:23:16 haswell01 kernel: [ 429.326232] R10: 0000000000000000 R11: ffffea0010199f80 R12: ffff880407769000 Mar 24 14:23:16 haswell01 kernel: [ 429.326234] R13: ffff880035d46af8 R14: ffff880035d46820 R15: ffffffffffffffed Mar 24 14:23:16 haswell01 kernel: [ 429.326236] FS: 00007f472f4ec740(0000) GS:ffff88041ea00000(0000) knlGS:0000000000000000 Mar 24 14:23:16 haswell01 kernel: [ 429.326238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 24 14:23:16 haswell01 kernel: [ 429.326239] CR2: 000000000000002f CR3: 00000004050d3000 CR4: 00000000001407f0 Mar 24 14:23:16 haswell01 kernel: [ 429.326241] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 24 14:23:16 haswell01 kernel: [ 429.326242] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Mar 24 14:23:16 haswell01 kernel: [ 429.326243] Stack: Mar 24 14:23:16 haswell01 kernel: [ 429.326244] ffff880404d21d30 ffffffff81485a59 ffff880404d21d50 ffffffffa0137eb7 Mar 24 14:23:16 haswell01 kernel: [ 429.326247] ffff880407769000 ffff880035d46800 ffff880404d21d80 ffffffffa05381d0 Mar 24 14:23:16 haswell01 kernel: [ 429.326250] ffff8803f24e4000 ffff880035d46800 ffff8804081b9000 000000000000000c Mar 24 14:23:16 haswell01 kernel: [ 429.326253] Call Trace: Mar 24 14:23:16 haswell01 kernel: [ 429.326257] [<ffffffff81485a59>] ? device_remove_file+0x19/0x20 Mar 24 14:23:16 haswell01 kernel: [ 429.326267] [<ffffffffa0137eb7>] drm_sysfs_connector_remove+0x57/0x90 [drm] Mar 24 14:23:16 haswell01 kernel: [ 429.326282] [<ffffffffa05381d0>] intel_modeset_cleanup+0xd0/0x100 [i915] Mar 24 14:23:16 haswell01 kernel: [ 429.326288] [<ffffffffa04f95f0>] i915_driver_unload+0x290/0x340 [i915] Mar 24 14:23:16 haswell01 kernel: [ 429.326294] [<ffffffffa01346fc>] drm_dev_unregister+0x2c/0xe0 [drm] Mar 24 14:23:16 haswell01 kernel: [ 429.326299] [<ffffffffa01347eb>] drm_put_dev+0x3b/0x70 [drm] Mar 24 14:23:16 haswell01 kernel: [ 429.326304] [<ffffffffa04f558d>] i915_pci_remove+0x1d/0x20 [i915] Mar 24 14:23:16 haswell01 kernel: [ 429.326307] [<ffffffff8139efbb>] pci_device_remove+0x3b/0xb0 Mar 24 14:23:16 haswell01 kernel: [ 429.326310] [<ffffffff8148a24f>] __device_release_driver+0x7f/0xf0 Mar 24 14:23:16 haswell01 kernel: [ 429.326313] [<ffffffff8148a2e3>] device_release_driver+0x23/0x30 Mar 24 14:23:16 haswell01 kernel: [ 429.326315] [<ffffffff8148905d>] unbind_store+0xbd/0xe0 Mar 24 14:23:16 haswell01 kernel: [ 429.326317] [<ffffffff81488484>] drv_attr_store+0x24/0x40 Mar 24 14:23:16 haswell01 kernel: [ 429.326320] [<ffffffff8122e698>] sysfs_write_file+0x128/0x1c0 Mar 24 14:23:16 haswell01 kernel: [ 429.326323] [<ffffffff811b88c4>] vfs_write+0xb4/0x1f0 Mar 24 14:23:16 haswell01 kernel: [ 429.326325] [<ffffffff811b92f9>] SyS_write+0x49/0xa0 Mar 24 14:23:16 haswell01 kernel: [ 429.326328] [<ffffffff81721c7f>] tracesys+0xe1/0xe6 Mar 24 14:23:16 haswell01 kernel: [ 429.326329] Code: 58 c7 81 e8 7d 98 4e 00 48 83 c4 50 89 d8 5b 41 5c 41 5d 5d c3 bb fe ff ff ff eb e0 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 <48> 8b 7f 30 48 8b 36 48 89 e5 e8 bb 22 00 00 5d c3 66 0f 1f 84 Mar 24 14:23:16 haswell01 kernel: [ 429.326351] RIP [<ffffffff8122de46>] sysfs_remove_file_ns+0x6/0x20 Mar 24 14:23:16 haswell01 kernel: [ 429.326353] RSP <ffff880404d21d20> Mar 24 14:23:16 haswell01 kernel: [ 429.326355] CR2: 000000000000002f Mar 24 14:23:16 haswell01 kernel: [ 429.326356] ---[ end trace 8cd466c131375550 ]---
On Mon, Mar 24, 2014 at 12:06:21PM +0000, Dmitry Malkin wrote:
Hello guys,
I've been playing with reloading intel gfx driver (i915) in a cycle, for a while, and at some point I've found a non-deterministic kernel crash with a highly-variable iteration dependency -- 2 to 200 driver reload iterations.
The apparent race is over the shared internal string buffer in drm_get_connector_name(). It is mostly harmless, due to its results being mostly used for log output, but in at least one case -- drm_sysfs_connector_add() -- this leads to a more critical error.
Race scenario:
- drm_sysfs_connector_add()
- drm_get_connector_name()
vs.
- anything that generates log messages involving DRM connectors
- drm_get_connector_name()
(and many other from drm_crtc.c) shares with caller const char* to internal static char buffer. If something call it from other thread, while main thread strore and use returned pointer it may overwrite connector name.
Here are we go: registering HDMI connecter (drm_sysfs_connector_add store and use pointer from drm_get_connector_name) and the same time got VGA connector name down through the stack. (the second thread is upowerd who watch continuously sysfs)
Yeah, in my recent kerneldoc series I've noticed this too and added FIXME comments. There's a lot more than just those you've pointed out. The problem is that fixing these will be an awful lot of work since you need to add proper cleanup code (calling kfree()) to all the required places.
So a full subsystem wide code audit for every single use-site of these. -Daniel
Mar 24 14:23:04 haswell01 kernel: [ 417.570043] ------------[ cut here ]------------ Mar 24 14:23:04 haswell01 kernel: [ 417.570045] WARNING: CPU: 0 PID: 12700 at /build/buildd/linux-3.13.0/lib/kobject.c:223 kobject_add_internal+0x224/0x330() Mar 24 14:23:04 haswell01 kernel: [ 417.570046] kobject_add_internal failed for card0-VGA-1 with -EEXIST, don't try to register things with the same name in the same directory. Mar 24 14:23:04 haswell01 kernel: [ 417.570047] Modules linked in: i915(+) video drm_kms_helper drm i2c_algo_bit snd_hda_codec_realtek snd_hda_codec_hdmi bnep rfcomm bluetooth x86_pkg_temp_thermal intel_p owerclamp coretemp kvm_intel snd_hda_codec kvm snd_hwdep snd_pcm hid_generic snd_page_alloc crct10dif_pclmul snd_seq_midi crc32_pclmul ghash_clmulni_intel snd_seq_midi_event usbhid snd_rawmidi hid aesni_in tel aes_x86_64 lrw gf128mul ppdev glue_helper ablk_helper cryptd snd_seq snd_seq_device snd_timer snd mei_me psmouse lpc_ich soundcore mei mac_hid parport_pc serio_raw nls_iso8859_1 lp parport e1000e ahci ptp libahci pps_core [last unloaded: video] Mar 24 14:23:04 haswell01 kernel: [ 417.570068] CPU: 0 PID: 12700 Comm: modprobe Tainted: G W 3.13.0-19-generic #39-Ubuntu Mar 24 14:23:04 haswell01 kernel: [ 417.570069] Hardware name: /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014 Mar 24 14:23:04 haswell01 kernel: [ 417.570069] 0000000000000009 ffff8804051295f8 ffffffff81711075 ffff880405129640 Mar 24 14:23:04 haswell01 kernel: [ 417.570071] ffff880405129630 ffffffff810662cd ffff88040776a410 00000000ffffffef Mar 24 14:23:04 haswell01 kernel: [ 417.570074] 0000000000000000 ffff8804048dcc10 ffff880407769000 ffff880405129690 Mar 24 14:23:04 haswell01 kernel: [ 417.570076] Call Trace: Mar 24 14:23:04 haswell01 kernel: [ 417.570078] [<ffffffff81711075>] dump_stack+0x45/0x56 Mar 24 14:23:04 haswell01 kernel: [ 417.570080] [<ffffffff810662cd>] warn_slowpath_common+0x7d/0xa0 Mar 24 14:23:04 haswell01 kernel: [ 417.570081] [<ffffffff8106633c>] warn_slowpath_fmt+0x4c/0x50 Mar 24 14:23:04 haswell01 kernel: [ 417.570082] [<ffffffff81230083>] ? sysfs_create_dir_ns+0x73/0xc0 Mar 24 14:23:04 haswell01 kernel: [ 417.570084] [<ffffffff8135b9a4>] kobject_add_internal+0x224/0x330 Mar 24 14:23:04 haswell01 kernel: [ 417.570086] [<ffffffff8135bed5>] kobject_add+0x65/0xb0 Mar 24 14:23:04 haswell01 kernel: [ 417.570088] [<ffffffff814874f5>] device_add+0x125/0x640 Mar 24 14:23:04 haswell01 kernel: [ 417.570090] [<ffffffff81487c20>] device_create_groups_vargs+0xe0/0x110 Mar 24 14:23:04 haswell01 kernel: [ 417.570092] [<ffffffff81487cb1>] device_create+0x41/0x50 Mar 24 14:23:04 haswell01 kernel: [ 417.570097] [<ffffffffa0137fc9>] drm_sysfs_connector_add+0x69/0x230 [drm] Mar 24 14:23:04 haswell01 kernel: [ 417.570110] [<ffffffffa0549ca1>] intel_hdmi_init_connector+0x111/0x260 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570119] [<ffffffffa0541670>] intel_ddi_init+0x270/0x2a0 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570130] [<ffffffffa0533176>] intel_setup_outputs+0x4c6/0x750 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570139] [<ffffffffa05370f7>] intel_modeset_init+0x607/0x8f0 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570147] [<ffffffffa04f90a4>] i915_driver_load+0xbb4/0xe70 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570153] [<ffffffffa0134cd2>] drm_dev_register+0xa2/0x1e0 [drm] Mar 24 14:23:04 haswell01 kernel: [ 417.570158] [<ffffffffa0136bc2>] drm_get_pci_dev+0x92/0x140 [drm] Mar 24 14:23:04 haswell01 kernel: [ 417.570166] [<ffffffffa04f567c>] i915_pci_probe+0x3c/0x90 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570168] [<ffffffff8139e0e5>] local_pci_probe+0x45/0xa0 Mar 24 14:23:04 haswell01 kernel: [ 417.570170] [<ffffffff8139f385>] ? pci_match_device+0xc5/0xd0 Mar 24 14:23:04 haswell01 kernel: [ 417.570172] [<ffffffff8139f4a9>] pci_device_probe+0xd9/0x130 Mar 24 14:23:04 haswell01 kernel: [ 417.570174] [<ffffffff8148a7c5>] driver_probe_device+0x125/0x3b0 Mar 24 14:23:04 haswell01 kernel: [ 417.570176] [<ffffffff8148ab23>] __driver_attach+0x93/0xa0 Mar 24 14:23:04 haswell01 kernel: [ 417.570178] [<ffffffff8148aa90>] ? __device_attach+0x40/0x40 Mar 24 14:23:04 haswell01 kernel: [ 417.570179] [<ffffffff81488733>] bus_for_each_dev+0x63/0xa0 Mar 24 14:23:04 haswell01 kernel: [ 417.570181] [<ffffffff8148a17e>] driver_attach+0x1e/0x20 Mar 24 14:23:04 haswell01 kernel: [ 417.570183] [<ffffffff81489d60>] bus_add_driver+0x180/0x250 Mar 24 14:23:04 haswell01 kernel: [ 417.570185] [<ffffffffa01be000>] ? 0xffffffffa01bdfff Mar 24 14:23:04 haswell01 kernel: [ 417.570187] [<ffffffff8148b1a4>] driver_register+0x64/0xf0 Mar 24 14:23:04 haswell01 kernel: [ 417.570189] [<ffffffffa01be000>] ? 0xffffffffa01bdfff Mar 24 14:23:04 haswell01 kernel: [ 417.570191] [<ffffffff8139da7c>] __pci_register_driver+0x4c/0x50 Mar 24 14:23:04 haswell01 kernel: [ 417.570196] [<ffffffffa0136d8a>] drm_pci_init+0x11a/0x130 [drm] Mar 24 14:23:04 haswell01 kernel: [ 417.570198] [<ffffffffa01be000>] ? 0xffffffffa01bdfff Mar 24 14:23:04 haswell01 kernel: [ 417.570205] [<ffffffffa01be066>] i915_init+0x66/0x68 [i915] Mar 24 14:23:04 haswell01 kernel: [ 417.570207] [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0 Mar 24 14:23:04 haswell01 kernel: [ 417.570208] [<ffffffff81058ae3>] ? set_memory_nx+0x43/0x50 Mar 24 14:23:04 haswell01 kernel: [ 417.570211] [<ffffffff810e091d>] load_module+0x12dd/0x1b40 Mar 24 14:23:04 haswell01 kernel: [ 417.570213] [<ffffffff810dc3a0>] ? store_uevent+0x40/0x40 Mar 24 14:23:04 haswell01 kernel: [ 417.570215] [<ffffffff810e12f6>] SyS_finit_module+0x86/0xb0 Mar 24 14:23:04 haswell01 kernel: [ 417.570217] [<ffffffff81721c7f>] tracesys+0xe1/0xe6 Mar 24 14:23:04 haswell01 kernel: [ 417.570219] ---[ end trace 8cd466c13137554f ]---
How to reproduce: load/reload i915 driver many times (in my case it happens after 2-200 attempts) and you will got a sysfs dup warning and then while unloading driver it will crash (because of malformed connectors list):
Mar 24 14:23:16 haswell01 kernel: [ 429.326174] BUG: unable to handle kernel NULL pointer dereference at 000000000000002f Mar 24 14:23:16 haswell01 kernel: [ 429.326177] IP: [<ffffffff8122de46>] sysfs_remove_file_ns+0x6/0x20 Mar 24 14:23:16 haswell01 kernel: [ 429.326184] PGD 3f588d067 PUD 406361067 PMD 0 Mar 24 14:23:16 haswell01 kernel: [ 429.326187] Oops: 0000 [#1] SMP Mar 24 14:23:16 haswell01 kernel: [ 429.326189] Modules linked in: i915 video drm_kms_helper drm i2c_algo_bit snd_hda_codec_realtek snd_hda_codec_hdmi bnep rfcomm bluetooth x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec kvm snd_hwdep snd_pcm hid_generic snd_page_alloc crct10dif_pclmul snd_seq_midi crc32_pclmul ghash_clmulni_intel snd_seq_midi_event usbhid snd_rawmidi hid aesni_intel aes_x86_64 lrw gf128mul ppdev glue_helper ablk_helper cryptd snd_seq snd_seq_device snd_timer snd mei_me psmouse lpc_ich soundcore mei mac_hid parport_pc serio_raw nls_iso8859_1 lp parport e1000e ahci ptp libahci pps_core [last unloaded: video] Mar 24 14:23:16 haswell01 kernel: [ 429.326219] CPU: 0 PID: 13302 Comm: dd Tainted: G W 3.13.0-19-generic #39-Ubuntu Mar 24 14:23:16 haswell01 kernel: [ 429.326221] Hardware name: /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014 Mar 24 14:23:16 haswell01 kernel: [ 429.326222] task: ffff8803f5aaafe0 ti: ffff880404d20000 task.ti: ffff880404d20000 Mar 24 14:23:16 haswell01 kernel: [ 429.326224] RIP: 0010:[<ffffffff8122de46>] [<ffffffff8122de46>] sysfs_remove_file_ns+0x6/0x20 Mar 24 14:23:16 haswell01 kernel: [ 429.326227] RSP: 0018:ffff880404d21d20 EFLAGS: 00010246 Mar 24 14:23:16 haswell01 kernel: [ 429.326228] RAX: 0000000000000000 RBX: ffffffffa0160160 RCX: ffffffffa0159356 Mar 24 14:23:16 haswell01 kernel: [ 429.326230] RDX: 0000000000000000 RSI: ffffffffa0160140 RDI: ffffffffffffffff Mar 24 14:23:16 haswell01 kernel: [ 429.326231] RBP: ffff880404d21d30 R08: ffffffffa0161120 R09: 000000000000fffe Mar 24 14:23:16 haswell01 kernel: [ 429.326232] R10: 0000000000000000 R11: ffffea0010199f80 R12: ffff880407769000 Mar 24 14:23:16 haswell01 kernel: [ 429.326234] R13: ffff880035d46af8 R14: ffff880035d46820 R15: ffffffffffffffed Mar 24 14:23:16 haswell01 kernel: [ 429.326236] FS: 00007f472f4ec740(0000) GS:ffff88041ea00000(0000) knlGS:0000000000000000 Mar 24 14:23:16 haswell01 kernel: [ 429.326238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 24 14:23:16 haswell01 kernel: [ 429.326239] CR2: 000000000000002f CR3: 00000004050d3000 CR4: 00000000001407f0 Mar 24 14:23:16 haswell01 kernel: [ 429.326241] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 24 14:23:16 haswell01 kernel: [ 429.326242] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Mar 24 14:23:16 haswell01 kernel: [ 429.326243] Stack: Mar 24 14:23:16 haswell01 kernel: [ 429.326244] ffff880404d21d30 ffffffff81485a59 ffff880404d21d50 ffffffffa0137eb7 Mar 24 14:23:16 haswell01 kernel: [ 429.326247] ffff880407769000 ffff880035d46800 ffff880404d21d80 ffffffffa05381d0 Mar 24 14:23:16 haswell01 kernel: [ 429.326250] ffff8803f24e4000 ffff880035d46800 ffff8804081b9000 000000000000000c Mar 24 14:23:16 haswell01 kernel: [ 429.326253] Call Trace: Mar 24 14:23:16 haswell01 kernel: [ 429.326257] [<ffffffff81485a59>] ? device_remove_file+0x19/0x20 Mar 24 14:23:16 haswell01 kernel: [ 429.326267] [<ffffffffa0137eb7>] drm_sysfs_connector_remove+0x57/0x90 [drm] Mar 24 14:23:16 haswell01 kernel: [ 429.326282] [<ffffffffa05381d0>] intel_modeset_cleanup+0xd0/0x100 [i915] Mar 24 14:23:16 haswell01 kernel: [ 429.326288] [<ffffffffa04f95f0>] i915_driver_unload+0x290/0x340 [i915] Mar 24 14:23:16 haswell01 kernel: [ 429.326294] [<ffffffffa01346fc>] drm_dev_unregister+0x2c/0xe0 [drm] Mar 24 14:23:16 haswell01 kernel: [ 429.326299] [<ffffffffa01347eb>] drm_put_dev+0x3b/0x70 [drm] Mar 24 14:23:16 haswell01 kernel: [ 429.326304] [<ffffffffa04f558d>] i915_pci_remove+0x1d/0x20 [i915] Mar 24 14:23:16 haswell01 kernel: [ 429.326307] [<ffffffff8139efbb>] pci_device_remove+0x3b/0xb0 Mar 24 14:23:16 haswell01 kernel: [ 429.326310] [<ffffffff8148a24f>] __device_release_driver+0x7f/0xf0 Mar 24 14:23:16 haswell01 kernel: [ 429.326313] [<ffffffff8148a2e3>] device_release_driver+0x23/0x30 Mar 24 14:23:16 haswell01 kernel: [ 429.326315] [<ffffffff8148905d>] unbind_store+0xbd/0xe0 Mar 24 14:23:16 haswell01 kernel: [ 429.326317] [<ffffffff81488484>] drv_attr_store+0x24/0x40 Mar 24 14:23:16 haswell01 kernel: [ 429.326320] [<ffffffff8122e698>] sysfs_write_file+0x128/0x1c0 Mar 24 14:23:16 haswell01 kernel: [ 429.326323] [<ffffffff811b88c4>] vfs_write+0xb4/0x1f0 Mar 24 14:23:16 haswell01 kernel: [ 429.326325] [<ffffffff811b92f9>] SyS_write+0x49/0xa0 Mar 24 14:23:16 haswell01 kernel: [ 429.326328] [<ffffffff81721c7f>] tracesys+0xe1/0xe6 Mar 24 14:23:16 haswell01 kernel: [ 429.326329] Code: 58 c7 81 e8 7d 98 4e 00 48 83 c4 50 89 d8 5b 41 5c 41 5d 5d c3 bb fe ff ff ff eb e0 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 <48> 8b 7f 30 48 8b 36 48 89 e5 e8 bb 22 00 00 5d c3 66 0f 1f 84 Mar 24 14:23:16 haswell01 kernel: [ 429.326351] RIP [<ffffffff8122de46>] sysfs_remove_file_ns+0x6/0x20 Mar 24 14:23:16 haswell01 kernel: [ 429.326353] RSP <ffff880404d21d20> Mar 24 14:23:16 haswell01 kernel: [ 429.326355] CR2: 000000000000002f Mar 24 14:23:16 haswell01 kernel: [ 429.326356] ---[ end trace 8cd466c131375550 ]---
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Hello, Daniel,
Thank you for response. I've got a couple questions for you:
0. What do you think about making integration test with continuous reloading i915 driver and X server (with FLR between iteration)? Simplified example for ubuntu (root required):
#!/bin/bash service lightdm stop rmmod snd_hda_intel echo -n "0000:00:02.0" > /sys/bus/pci/devices/0000:00:02.0/driver/unbind rmmod i915 echo 1 > /sys/bus/pci/devices/0000:00:02.0/reset modprobe i915 service lightdm start
This will uncover next problems: * sysfs add/remove i2c connectors (parent/child warning) * drm static buffer races * NX bit violation crash (see dump below) * and bunch of DMAR problems
1. Could you point me out git/branch with FIXME comments?
2. About kfree() problem: what if put string buffer onto stack of caller for drm_get_connector_name and drm_get_encoder_name? It will be auto-removed and there is will be the patch about adding new param for these functions. (yes the patch will be big and awful to read)
======================= NX bit violation crash ======================================== Mar 20 21:53:29 haswell01 kernel: [13447.100849] Console: switching to colour dummy device 80x25 Mar 20 21:53:29 haswell01 kernel: [13447.100950] drm_kms_helper: drm: unregistered panic notifier Mar 20 21:53:29 haswell01 kernel: [13447.117785] waiting module removal not supported: please upgrade Mar 20 21:53:29 haswell01 kernel: [13447.117880] [drm] Module unloaded Mar 20 21:53:29 haswell01 kernel: [13447.118244] waiting module removal not supported: please upgrade Mar 20 21:53:29 haswell01 kernel: [13447.118360] waiting module removal not supported: please upgradewaiting module removal not supported: please upgrade Mar 20 21:53:39 haswell01 kernel: [13447.118590] waiting module removal not supported: please upgrade<6>[13457.263808] [drm] Initialized drm 1.1.0 20060810 Mar 20 21:53:39 haswell01 kernel: [13457.333992] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) Mar 20 21:53:39 haswell01 kernel: [13457.333996] BUG: unable to handle kernel paging request at ffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.333998] IP: [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334001] PGD 2fd9067 PUD 3e68c3063 PMD 404902063 PTE 80000003eb27f163 Mar 20 21:53:39 haswell01 kernel: [13457.334004] Oops: 0011 [#1] SMP Mar 20 21:53:39 haswell01 kernel: [13457.334006] Modules linked in: i915(+) video drm_kms_helper drm i2c_algo_bit snd_hda_codec_hdmi rfcomm bnep bluetooth nls_iso8859_1 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek kvm snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq hid_generic crct10dif_pclmul snd_seq_device usbhid crc32_pclmul snd_timer ghash_clmulni_intel snd aesni_intel psmouse hid mei_me aes_x86_64 lrw ppdev gf128mul mei glue_helper parport_pc ablk_helper lp cryptd soundcore parport serio_raw lpc_ich mac_hid e1000e ahci ptp libahci pps_core [last unloaded: video] Mar 20 21:53:39 haswell01 kernel: [13457.334031] CPU: 0 PID: 4974 Comm: modprobe Not tainted 3.13.6 #10 Mar 20 21:53:39 haswell01 kernel: [13457.334032] Hardware name: /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014 Mar 20 21:53:39 haswell01 kernel: [13457.334034] task: ffff88002e6c5fc0 ti: ffff880406394000 task.ti: ffff880406394000 Mar 20 21:53:39 haswell01 kernel: [13457.334035] RIP: 0010:[<ffff8803eb27fcb0>] [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334037] RSP: 0018:ffff880406395af0 EFLAGS: 00010282 Mar 20 21:53:39 haswell01 kernel: [13457.334038] RAX: ffff8803eb27f4b0 RBX: ffff8803f6016000 RCX: ffffffffa0238630 Mar 20 21:53:39 haswell01 kernel: [13457.334039] RDX: ffff8803eb27fcb0 RSI: ffffffffa03ba3c4 RDI: ffff8803f6016000 Mar 20 21:53:39 haswell01 kernel: [13457.334040] RBP: ffff880406395b10 R08: 0000000000000000 R09: ffff88041ea172e0 Mar 20 21:53:39 haswell01 kernel: [13457.334042] R10: ffffea00101d7200 R11: 0000000000000000 R12: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334043] R13: ffffffffa03ba3c4 R14: ffffffffa03dd100 R15: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334044] FS: 00007f3f392d0740(0000) GS:ffff88041ea00000(0000) knlGS:0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 20 21:53:39 haswell01 kernel: [13457.334047] CR2: ffff8803eb27fcb0 CR3: 00000003e7358000 CR4: 00000000001407f0 Mar 20 21:53:39 haswell01 kernel: [13457.334048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Mar 20 21:53:39 haswell01 kernel: [13457.334051] Stack: Mar 20 21:53:39 haswell01 kernel: [13457.334052] ffffffffa0213cb2 ffff8803f6016000 ffff880408199000 ffff880408199098 Mar 20 21:53:39 haswell01 kernel: [13457.334054] ffff880406395b60 ffffffffa0215b92 0000000000000004 ffff880408199140 Mar 20 21:53:39 haswell01 kernel: [13457.334057] ffffffffa03b9920 ffff880408199000 0000000000000000 ffffffffa03dd000 Mar 20 21:53:39 haswell01 kernel: [13457.334059] Call Trace: Mar 20 21:53:39 haswell01 kernel: [13457.334067] [<ffffffffa0213cb2>] ? drm_dev_register+0xa2/0x1e0 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334073] [<ffffffffa0215b92>] drm_get_pci_dev+0x92/0x140 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334082] [<ffffffffa033d67c>] i915_pci_probe+0x3c/0x90 [i915] Mar 20 21:53:39 haswell01 kernel: [13457.334086] [<ffffffff813982b5>] local_pci_probe+0x45/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334088] [<ffffffff81399555>] ? pci_match_device+0xc5/0xd0 Mar 20 21:53:39 haswell01 kernel: [13457.334090] [<ffffffff81399679>] pci_device_probe+0xd9/0x130 Mar 20 21:53:39 haswell01 kernel: [13457.334093] [<ffffffff81484795>] driver_probe_device+0x125/0x3b0 Mar 20 21:53:39 haswell01 kernel: [13457.334095] [<ffffffff81484af3>] __driver_attach+0x93/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334098] [<ffffffff81484a60>] ? __device_attach+0x40/0x40 Mar 20 21:53:39 haswell01 kernel: [13457.334100] [<ffffffff81482703>] bus_for_each_dev+0x63/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334102] [<ffffffff8148414e>] driver_attach+0x1e/0x20 Mar 20 21:53:39 haswell01 kernel: [13457.334104] [<ffffffff81483d30>] bus_add_driver+0x180/0x250 Mar 20 21:53:39 haswell01 kernel: [13457.334106] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334109] [<ffffffff81485174>] driver_register+0x64/0xf0 Mar 20 21:53:39 haswell01 kernel: [13457.334110] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334113] [<ffffffff81397c4c>] __pci_register_driver+0x4c/0x50 Mar 20 21:53:39 haswell01 kernel: [13457.334117] [<ffffffffa0215d5a>] drm_pci_init+0x11a/0x130 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334119] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334127] [<ffffffffa03fe066>] i915_init+0x66/0x68 [i915] Mar 20 21:53:39 haswell01 kernel: [13457.334130] [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0 Mar 20 21:53:39 haswell01 kernel: [13457.334133] [<ffffffff810589e3>] ? set_memory_nx+0x43/0x50 Mar 20 21:53:39 haswell01 kernel: [13457.334137] [<ffffffff810e0630>] load_module+0x2050/0x26f0 Mar 20 21:53:39 haswell01 kernel: [13457.334139] [<ffffffff810dbfa0>] ? store_uevent+0x40/0x40 Mar 20 21:53:39 haswell01 kernel: [13457.334141] [<ffffffff810e0e46>] SyS_finit_module+0x86/0xb0 Mar 20 21:53:39 haswell01 kernel: [13457.334144] [<ffffffff8171823f>] tracesys+0xe1/0xe6 Mar 20 21:53:39 haswell01 kernel: [13457.334145] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 <f0> b1 ca 81 ff ff ff ff b0 f4 27 eb 03 88 ff ff ff ff ff 7f 00 Mar 20 21:53:39 haswell01 kernel: [13457.334164] RIP [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334166] RSP <ffff880406395af0> Mar 20 21:53:39 haswell01 kernel: [13457.334167] CR2: ffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334168] ---[ end trace e2598b1fc83a65fd ]---
________________________________________ From: Daniel Vetter daniel.vetter@ffwll.ch on behalf of Daniel Vetter daniel@ffwll.ch Sent: Tuesday, March 25, 2014 12:31 AM To: Dmitry Malkin Cc: dri-devel@lists.freedesktop.org Subject: Re: [DRM] drm_get_connector_name internal static string buffer causes a race
On Mon, Mar 24, 2014 at 12:06:21PM +0000, Dmitry Malkin wrote:
Hello guys,
I've been playing with reloading intel gfx driver (i915) in a cycle, for a while, and at some point I've found a non-deterministic kernel crash with a highly-variable iteration dependency -- 2 to 200 driver reload iterations.
The apparent race is over the shared internal string buffer in drm_get_connector_name(). It is mostly harmless, due to its results being mostly used for log output, but in at least one case -- drm_sysfs_connector_add() -- this leads to a more critical error.
Race scenario:
- drm_sysfs_connector_add()
- drm_get_connector_name()
vs.
- anything that generates log messages involving DRM connectors
- drm_get_connector_name()
(and many other from drm_crtc.c) shares with caller const char* to internal static char buffer. If something call it from other thread, while main thread strore and use returned pointer it may overwrite connector name.
Here are we go: registering HDMI connecter (drm_sysfs_connector_add store and use pointer from drm_get_connector_name) and the same time got VGA connector name down through the stack. (the second thread is upowerd who watch continuously sysfs)
Yeah, in my recent kerneldoc series I've noticed this too and added FIXME comments. There's a lot more than just those you've pointed out. The problem is that fixing these will be an awful lot of work since you need to add proper cleanup code (calling kfree()) to all the required places.
So a full subsystem wide code audit for every single use-site of these. -Daniel
On Tue, Mar 25, 2014 at 08:08:23AM +0000, Dmitry Malkin wrote:
Hello, Daniel,
Thank you for response. I've got a couple questions for you:
- What do you think about making integration test with continuous reloading i915 driver and X server (with FLR between iteration)?
Simplified example for ubuntu (root required):
Module reloading is know to be horribly racy atm. Don't do that in any kind of stress-test situation before fixing up piles of bugs ;-)
Will mean: You'll uncover at _lot_ more than just this. Patches for all this highly welcome though.
Thanks, Daniel
#!/bin/bash service lightdm stop rmmod snd_hda_intel echo -n "0000:00:02.0" > /sys/bus/pci/devices/0000:00:02.0/driver/unbind rmmod i915 echo 1 > /sys/bus/pci/devices/0000:00:02.0/reset modprobe i915 service lightdm start
This will uncover next problems:
- sysfs add/remove i2c connectors (parent/child warning)
- drm static buffer races
- NX bit violation crash (see dump below)
- and bunch of DMAR problems
Could you point me out git/branch with FIXME comments?
About kfree() problem: what if put string buffer onto stack of caller for drm_get_connector_name and drm_get_encoder_name?
It will be auto-removed and there is will be the patch about adding new param for these functions. (yes the patch will be big and awful to read)
======================= NX bit violation crash ======================================== Mar 20 21:53:29 haswell01 kernel: [13447.100849] Console: switching to colour dummy device 80x25 Mar 20 21:53:29 haswell01 kernel: [13447.100950] drm_kms_helper: drm: unregistered panic notifier Mar 20 21:53:29 haswell01 kernel: [13447.117785] waiting module removal not supported: please upgrade Mar 20 21:53:29 haswell01 kernel: [13447.117880] [drm] Module unloaded Mar 20 21:53:29 haswell01 kernel: [13447.118244] waiting module removal not supported: please upgrade Mar 20 21:53:29 haswell01 kernel: [13447.118360] waiting module removal not supported: please upgradewaiting module removal not supported: please upgrade Mar 20 21:53:39 haswell01 kernel: [13447.118590] waiting module removal not supported: please upgrade<6>[13457.263808] [drm] Initialized drm 1.1.0 20060810 Mar 20 21:53:39 haswell01 kernel: [13457.333992] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) Mar 20 21:53:39 haswell01 kernel: [13457.333996] BUG: unable to handle kernel paging request at ffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.333998] IP: [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334001] PGD 2fd9067 PUD 3e68c3063 PMD 404902063 PTE 80000003eb27f163 Mar 20 21:53:39 haswell01 kernel: [13457.334004] Oops: 0011 [#1] SMP Mar 20 21:53:39 haswell01 kernel: [13457.334006] Modules linked in: i915(+) video drm_kms_helper drm i2c_algo_bit snd_hda_codec_hdmi rfcomm bnep bluetooth nls_iso8859_1 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek kvm snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq hid_generic crct10dif_pclmul snd_seq_device usbhid crc32_pclmul snd_timer ghash_clmulni_intel snd aesni_intel psmouse hid mei_me aes_x86_64 lrw ppdev gf128mul mei glue_helper parport_pc ablk_helper lp cryptd soundcore parport serio_raw lpc_ich mac_hid e1000e ahci ptp libahci pps_core [last unloaded: video] Mar 20 21:53:39 haswell01 kernel: [13457.334031] CPU: 0 PID: 4974 Comm: modprobe Not tainted 3.13.6 #10 Mar 20 21:53:39 haswell01 kernel: [13457.334032] Hardware name: /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014 Mar 20 21:53:39 haswell01 kernel: [13457.334034] task: ffff88002e6c5fc0 ti: ffff880406394000 task.ti: ffff880406394000 Mar 20 21:53:39 haswell01 kernel: [13457.334035] RIP: 0010:[<ffff8803eb27fcb0>] [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334037] RSP: 0018:ffff880406395af0 EFLAGS: 00010282 Mar 20 21:53:39 haswell01 kernel: [13457.334038] RAX: ffff8803eb27f4b0 RBX: ffff8803f6016000 RCX: ffffffffa0238630 Mar 20 21:53:39 haswell01 kernel: [13457.334039] RDX: ffff8803eb27fcb0 RSI: ffffffffa03ba3c4 RDI: ffff8803f6016000 Mar 20 21:53:39 haswell01 kernel: [13457.334040] RBP: ffff880406395b10 R08: 0000000000000000 R09: ffff88041ea172e0 Mar 20 21:53:39 haswell01 kernel: [13457.334042] R10: ffffea00101d7200 R11: 0000000000000000 R12: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334043] R13: ffffffffa03ba3c4 R14: ffffffffa03dd100 R15: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334044] FS: 00007f3f392d0740(0000) GS:ffff88041ea00000(0000) knlGS:0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 20 21:53:39 haswell01 kernel: [13457.334047] CR2: ffff8803eb27fcb0 CR3: 00000003e7358000 CR4: 00000000001407f0 Mar 20 21:53:39 haswell01 kernel: [13457.334048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Mar 20 21:53:39 haswell01 kernel: [13457.334051] Stack: Mar 20 21:53:39 haswell01 kernel: [13457.334052] ffffffffa0213cb2 ffff8803f6016000 ffff880408199000 ffff880408199098 Mar 20 21:53:39 haswell01 kernel: [13457.334054] ffff880406395b60 ffffffffa0215b92 0000000000000004 ffff880408199140 Mar 20 21:53:39 haswell01 kernel: [13457.334057] ffffffffa03b9920 ffff880408199000 0000000000000000 ffffffffa03dd000 Mar 20 21:53:39 haswell01 kernel: [13457.334059] Call Trace: Mar 20 21:53:39 haswell01 kernel: [13457.334067] [<ffffffffa0213cb2>] ? drm_dev_register+0xa2/0x1e0 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334073] [<ffffffffa0215b92>] drm_get_pci_dev+0x92/0x140 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334082] [<ffffffffa033d67c>] i915_pci_probe+0x3c/0x90 [i915] Mar 20 21:53:39 haswell01 kernel: [13457.334086] [<ffffffff813982b5>] local_pci_probe+0x45/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334088] [<ffffffff81399555>] ? pci_match_device+0xc5/0xd0 Mar 20 21:53:39 haswell01 kernel: [13457.334090] [<ffffffff81399679>] pci_device_probe+0xd9/0x130 Mar 20 21:53:39 haswell01 kernel: [13457.334093] [<ffffffff81484795>] driver_probe_device+0x125/0x3b0 Mar 20 21:53:39 haswell01 kernel: [13457.334095] [<ffffffff81484af3>] __driver_attach+0x93/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334098] [<ffffffff81484a60>] ? __device_attach+0x40/0x40 Mar 20 21:53:39 haswell01 kernel: [13457.334100] [<ffffffff81482703>] bus_for_each_dev+0x63/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334102] [<ffffffff8148414e>] driver_attach+0x1e/0x20 Mar 20 21:53:39 haswell01 kernel: [13457.334104] [<ffffffff81483d30>] bus_add_driver+0x180/0x250 Mar 20 21:53:39 haswell01 kernel: [13457.334106] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334109] [<ffffffff81485174>] driver_register+0x64/0xf0 Mar 20 21:53:39 haswell01 kernel: [13457.334110] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334113] [<ffffffff81397c4c>] __pci_register_driver+0x4c/0x50 Mar 20 21:53:39 haswell01 kernel: [13457.334117] [<ffffffffa0215d5a>] drm_pci_init+0x11a/0x130 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334119] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334127] [<ffffffffa03fe066>] i915_init+0x66/0x68 [i915] Mar 20 21:53:39 haswell01 kernel: [13457.334130] [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0 Mar 20 21:53:39 haswell01 kernel: [13457.334133] [<ffffffff810589e3>] ? set_memory_nx+0x43/0x50 Mar 20 21:53:39 haswell01 kernel: [13457.334137] [<ffffffff810e0630>] load_module+0x2050/0x26f0 Mar 20 21:53:39 haswell01 kernel: [13457.334139] [<ffffffff810dbfa0>] ? store_uevent+0x40/0x40 Mar 20 21:53:39 haswell01 kernel: [13457.334141] [<ffffffff810e0e46>] SyS_finit_module+0x86/0xb0 Mar 20 21:53:39 haswell01 kernel: [13457.334144] [<ffffffff8171823f>] tracesys+0xe1/0xe6 Mar 20 21:53:39 haswell01 kernel: [13457.334145] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 <f0> b1 ca 81 ff ff ff ff b0 f4 27 eb 03 88 ff ff ff ff ff 7f 00 Mar 20 21:53:39 haswell01 kernel: [13457.334164] RIP [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334166] RSP <ffff880406395af0> Mar 20 21:53:39 haswell01 kernel: [13457.334167] CR2: ffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334168] ---[ end trace e2598b1fc83a65fd ]---
From: Daniel Vetter daniel.vetter@ffwll.ch on behalf of Daniel Vetter daniel@ffwll.ch Sent: Tuesday, March 25, 2014 12:31 AM To: Dmitry Malkin Cc: dri-devel@lists.freedesktop.org Subject: Re: [DRM] drm_get_connector_name internal static string buffer causes a race
On Mon, Mar 24, 2014 at 12:06:21PM +0000, Dmitry Malkin wrote:
Hello guys,
I've been playing with reloading intel gfx driver (i915) in a cycle, for a while, and at some point I've found a non-deterministic kernel crash with a highly-variable iteration dependency -- 2 to 200 driver reload iterations.
The apparent race is over the shared internal string buffer in drm_get_connector_name(). It is mostly harmless, due to its results being mostly used for log output, but in at least one case -- drm_sysfs_connector_add() -- this leads to a more critical error.
Race scenario:
- drm_sysfs_connector_add()
- drm_get_connector_name()
vs.
- anything that generates log messages involving DRM connectors
- drm_get_connector_name()
(and many other from drm_crtc.c) shares with caller const char* to internal static char buffer. If something call it from other thread, while main thread strore and use returned pointer it may overwrite connector name.
Here are we go: registering HDMI connecter (drm_sysfs_connector_add store and use pointer from drm_get_connector_name) and the same time got VGA connector name down through the stack. (the second thread is upowerd who watch continuously sysfs)
Yeah, in my recent kerneldoc series I've noticed this too and added FIXME comments. There's a lot more than just those you've pointed out. The problem is that fixing these will be an awful lot of work since you need to add proper cleanup code (calling kfree()) to all the required places.
So a full subsystem wide code audit for every single use-site of these. -Daniel
Hi, will you accept bash scripts for reloading driver/X/FLR for intel-gpu-tools to uncover exists and future bugs besides those patches? ________________________________________ From: Daniel Vetter daniel.vetter@ffwll.ch on behalf of Daniel Vetter daniel@ffwll.ch Sent: Tuesday, March 25, 2014 2:43 PM To: Dmitry Malkin Cc: Daniel Vetter; dri-devel@lists.freedesktop.org Subject: Re: [DRM] drm_get_connector_name internal static string buffer causes a race
On Tue, Mar 25, 2014 at 08:08:23AM +0000, Dmitry Malkin wrote:
Hello, Daniel,
Thank you for response. I've got a couple questions for you:
- What do you think about making integration test with continuous reloading i915 driver and X server (with FLR between iteration)?
Simplified example for ubuntu (root required):
Module reloading is know to be horribly racy atm. Don't do that in any kind of stress-test situation before fixing up piles of bugs ;-)
Will mean: You'll uncover at _lot_ more than just this. Patches for all this highly welcome though.
Thanks, Daniel
#!/bin/bash service lightdm stop rmmod snd_hda_intel echo -n "0000:00:02.0" > /sys/bus/pci/devices/0000:00:02.0/driver/unbind rmmod i915 echo 1 > /sys/bus/pci/devices/0000:00:02.0/reset modprobe i915 service lightdm start
This will uncover next problems:
- sysfs add/remove i2c connectors (parent/child warning)
- drm static buffer races
- NX bit violation crash (see dump below)
- and bunch of DMAR problems
Could you point me out git/branch with FIXME comments?
About kfree() problem: what if put string buffer onto stack of caller for drm_get_connector_name and drm_get_encoder_name?
It will be auto-removed and there is will be the patch about adding new param for these functions. (yes the patch will be big and awful to read)
======================= NX bit violation crash ======================================== Mar 20 21:53:29 haswell01 kernel: [13447.100849] Console: switching to colour dummy device 80x25 Mar 20 21:53:29 haswell01 kernel: [13447.100950] drm_kms_helper: drm: unregistered panic notifier Mar 20 21:53:29 haswell01 kernel: [13447.117785] waiting module removal not supported: please upgrade Mar 20 21:53:29 haswell01 kernel: [13447.117880] [drm] Module unloaded Mar 20 21:53:29 haswell01 kernel: [13447.118244] waiting module removal not supported: please upgrade Mar 20 21:53:29 haswell01 kernel: [13447.118360] waiting module removal not supported: please upgradewaiting module removal not supported: please upgrade Mar 20 21:53:39 haswell01 kernel: [13447.118590] waiting module removal not supported: please upgrade<6>[13457.263808] [drm] Initialized drm 1.1.0 20060810 Mar 20 21:53:39 haswell01 kernel: [13457.333992] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) Mar 20 21:53:39 haswell01 kernel: [13457.333996] BUG: unable to handle kernel paging request at ffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.333998] IP: [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334001] PGD 2fd9067 PUD 3e68c3063 PMD 404902063 PTE 80000003eb27f163 Mar 20 21:53:39 haswell01 kernel: [13457.334004] Oops: 0011 [#1] SMP Mar 20 21:53:39 haswell01 kernel: [13457.334006] Modules linked in: i915(+) video drm_kms_helper drm i2c_algo_bit snd_hda_codec_hdmi rfcomm bnep bluetooth nls_iso8859_1 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek kvm snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq hid_generic crct10dif_pclmul snd_seq_device usbhid crc32_pclmul snd_timer ghash_clmulni_intel snd aesni_intel psmouse hid mei_me aes_x86_64 lrw ppdev gf128mul mei glue_helper parport_pc ablk_helper lp cryptd soundcore parport serio_raw lpc_ich mac_hid e1000e ahci ptp libahci pps_core [last unloaded: video] Mar 20 21:53:39 haswell01 kernel: [13457.334031] CPU: 0 PID: 4974 Comm: modprobe Not tainted 3.13.6 #10 Mar 20 21:53:39 haswell01 kernel: [13457.334032] Hardware name: /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014 Mar 20 21:53:39 haswell01 kernel: [13457.334034] task: ffff88002e6c5fc0 ti: ffff880406394000 task.ti: ffff880406394000 Mar 20 21:53:39 haswell01 kernel: [13457.334035] RIP: 0010:[<ffff8803eb27fcb0>] [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334037] RSP: 0018:ffff880406395af0 EFLAGS: 00010282 Mar 20 21:53:39 haswell01 kernel: [13457.334038] RAX: ffff8803eb27f4b0 RBX: ffff8803f6016000 RCX: ffffffffa0238630 Mar 20 21:53:39 haswell01 kernel: [13457.334039] RDX: ffff8803eb27fcb0 RSI: ffffffffa03ba3c4 RDI: ffff8803f6016000 Mar 20 21:53:39 haswell01 kernel: [13457.334040] RBP: ffff880406395b10 R08: 0000000000000000 R09: ffff88041ea172e0 Mar 20 21:53:39 haswell01 kernel: [13457.334042] R10: ffffea00101d7200 R11: 0000000000000000 R12: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334043] R13: ffffffffa03ba3c4 R14: ffffffffa03dd100 R15: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334044] FS: 00007f3f392d0740(0000) GS:ffff88041ea00000(0000) knlGS:0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 20 21:53:39 haswell01 kernel: [13457.334047] CR2: ffff8803eb27fcb0 CR3: 00000003e7358000 CR4: 00000000001407f0 Mar 20 21:53:39 haswell01 kernel: [13457.334048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Mar 20 21:53:39 haswell01 kernel: [13457.334051] Stack: Mar 20 21:53:39 haswell01 kernel: [13457.334052] ffffffffa0213cb2 ffff8803f6016000 ffff880408199000 ffff880408199098 Mar 20 21:53:39 haswell01 kernel: [13457.334054] ffff880406395b60 ffffffffa0215b92 0000000000000004 ffff880408199140 Mar 20 21:53:39 haswell01 kernel: [13457.334057] ffffffffa03b9920 ffff880408199000 0000000000000000 ffffffffa03dd000 Mar 20 21:53:39 haswell01 kernel: [13457.334059] Call Trace: Mar 20 21:53:39 haswell01 kernel: [13457.334067] [<ffffffffa0213cb2>] ? drm_dev_register+0xa2/0x1e0 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334073] [<ffffffffa0215b92>] drm_get_pci_dev+0x92/0x140 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334082] [<ffffffffa033d67c>] i915_pci_probe+0x3c/0x90 [i915] Mar 20 21:53:39 haswell01 kernel: [13457.334086] [<ffffffff813982b5>] local_pci_probe+0x45/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334088] [<ffffffff81399555>] ? pci_match_device+0xc5/0xd0 Mar 20 21:53:39 haswell01 kernel: [13457.334090] [<ffffffff81399679>] pci_device_probe+0xd9/0x130 Mar 20 21:53:39 haswell01 kernel: [13457.334093] [<ffffffff81484795>] driver_probe_device+0x125/0x3b0 Mar 20 21:53:39 haswell01 kernel: [13457.334095] [<ffffffff81484af3>] __driver_attach+0x93/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334098] [<ffffffff81484a60>] ? __device_attach+0x40/0x40 Mar 20 21:53:39 haswell01 kernel: [13457.334100] [<ffffffff81482703>] bus_for_each_dev+0x63/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334102] [<ffffffff8148414e>] driver_attach+0x1e/0x20 Mar 20 21:53:39 haswell01 kernel: [13457.334104] [<ffffffff81483d30>] bus_add_driver+0x180/0x250 Mar 20 21:53:39 haswell01 kernel: [13457.334106] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334109] [<ffffffff81485174>] driver_register+0x64/0xf0 Mar 20 21:53:39 haswell01 kernel: [13457.334110] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334113] [<ffffffff81397c4c>] __pci_register_driver+0x4c/0x50 Mar 20 21:53:39 haswell01 kernel: [13457.334117] [<ffffffffa0215d5a>] drm_pci_init+0x11a/0x130 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334119] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334127] [<ffffffffa03fe066>] i915_init+0x66/0x68 [i915] Mar 20 21:53:39 haswell01 kernel: [13457.334130] [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0 Mar 20 21:53:39 haswell01 kernel: [13457.334133] [<ffffffff810589e3>] ? set_memory_nx+0x43/0x50 Mar 20 21:53:39 haswell01 kernel: [13457.334137] [<ffffffff810e0630>] load_module+0x2050/0x26f0 Mar 20 21:53:39 haswell01 kernel: [13457.334139] [<ffffffff810dbfa0>] ? store_uevent+0x40/0x40 Mar 20 21:53:39 haswell01 kernel: [13457.334141] [<ffffffff810e0e46>] SyS_finit_module+0x86/0xb0 Mar 20 21:53:39 haswell01 kernel: [13457.334144] [<ffffffff8171823f>] tracesys+0xe1/0xe6 Mar 20 21:53:39 haswell01 kernel: [13457.334145] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 <f0> b1 ca 81 ff ff ff ff b0 f4 27 eb 03 88 ff ff ff ff ff 7f 00 Mar 20 21:53:39 haswell01 kernel: [13457.334164] RIP [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334166] RSP <ffff880406395af0> Mar 20 21:53:39 haswell01 kernel: [13457.334167] CR2: ffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334168] ---[ end trace e2598b1fc83a65fd ]---
From: Daniel Vetter daniel.vetter@ffwll.ch on behalf of Daniel Vetter daniel@ffwll.ch Sent: Tuesday, March 25, 2014 12:31 AM To: Dmitry Malkin Cc: dri-devel@lists.freedesktop.org Subject: Re: [DRM] drm_get_connector_name internal static string buffer causes a race
On Mon, Mar 24, 2014 at 12:06:21PM +0000, Dmitry Malkin wrote:
Hello guys,
I've been playing with reloading intel gfx driver (i915) in a cycle, for a while, and at some point I've found a non-deterministic kernel crash with a highly-variable iteration dependency -- 2 to 200 driver reload iterations.
The apparent race is over the shared internal string buffer in drm_get_connector_name(). It is mostly harmless, due to its results being mostly used for log output, but in at least one case -- drm_sysfs_connector_add() -- this leads to a more critical error.
Race scenario:
- drm_sysfs_connector_add()
- drm_get_connector_name()
vs.
- anything that generates log messages involving DRM connectors
- drm_get_connector_name()
(and many other from drm_crtc.c) shares with caller const char* to internal static char buffer. If something call it from other thread, while main thread strore and use returned pointer it may overwrite connector name.
Here are we go: registering HDMI connecter (drm_sysfs_connector_add store and use pointer from drm_get_connector_name) and the same time got VGA connector name down through the stack. (the second thread is upowerd who watch continuously sysfs)
Yeah, in my recent kerneldoc series I've noticed this too and added FIXME comments. There's a lot more than just those you've pointed out. The problem is that fixing these will be an awful lot of work since you need to add proper cleanup code (calling kfree()) to all the required places.
So a full subsystem wide code audit for every single use-site of these. -Daniel
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Tue, Mar 25, 2014 at 11:23:46AM +0000, Dmitry Malkin wrote:
Hi, will you accept bash scripts for reloading driver/X/FLR for intel-gpu-tools to uncover exists and future bugs besides those patches?
i-g-t/tests/drv_module_reload we have already, so basic coverage is there. It it's run in our nightly test runs on drm-intel-nightly.
But that scripts tries very hard not to cause a race so more fancy stuff is still needed. But until we have a more solid driver I don't think there's much point for crazier scripts, it'll just cause headaches. -Daniel
From: Daniel Vetter daniel.vetter@ffwll.ch on behalf of Daniel Vetter daniel@ffwll.ch Sent: Tuesday, March 25, 2014 2:43 PM To: Dmitry Malkin Cc: Daniel Vetter; dri-devel@lists.freedesktop.org Subject: Re: [DRM] drm_get_connector_name internal static string buffer causes a race
On Tue, Mar 25, 2014 at 08:08:23AM +0000, Dmitry Malkin wrote:
Hello, Daniel,
Thank you for response. I've got a couple questions for you:
- What do you think about making integration test with continuous reloading i915 driver and X server (with FLR between iteration)?
Simplified example for ubuntu (root required):
Module reloading is know to be horribly racy atm. Don't do that in any kind of stress-test situation before fixing up piles of bugs ;-)
Will mean: You'll uncover at _lot_ more than just this. Patches for all this highly welcome though.
Thanks, Daniel
#!/bin/bash service lightdm stop rmmod snd_hda_intel echo -n "0000:00:02.0" > /sys/bus/pci/devices/0000:00:02.0/driver/unbind rmmod i915 echo 1 > /sys/bus/pci/devices/0000:00:02.0/reset modprobe i915 service lightdm start
This will uncover next problems:
- sysfs add/remove i2c connectors (parent/child warning)
- drm static buffer races
- NX bit violation crash (see dump below)
- and bunch of DMAR problems
Could you point me out git/branch with FIXME comments?
About kfree() problem: what if put string buffer onto stack of caller for drm_get_connector_name and drm_get_encoder_name?
It will be auto-removed and there is will be the patch about adding new param for these functions. (yes the patch will be big and awful to read)
======================= NX bit violation crash ======================================== Mar 20 21:53:29 haswell01 kernel: [13447.100849] Console: switching to colour dummy device 80x25 Mar 20 21:53:29 haswell01 kernel: [13447.100950] drm_kms_helper: drm: unregistered panic notifier Mar 20 21:53:29 haswell01 kernel: [13447.117785] waiting module removal not supported: please upgrade Mar 20 21:53:29 haswell01 kernel: [13447.117880] [drm] Module unloaded Mar 20 21:53:29 haswell01 kernel: [13447.118244] waiting module removal not supported: please upgrade Mar 20 21:53:29 haswell01 kernel: [13447.118360] waiting module removal not supported: please upgradewaiting module removal not supported: please upgrade Mar 20 21:53:39 haswell01 kernel: [13447.118590] waiting module removal not supported: please upgrade<6>[13457.263808] [drm] Initialized drm 1.1.0 20060810 Mar 20 21:53:39 haswell01 kernel: [13457.333992] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) Mar 20 21:53:39 haswell01 kernel: [13457.333996] BUG: unable to handle kernel paging request at ffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.333998] IP: [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334001] PGD 2fd9067 PUD 3e68c3063 PMD 404902063 PTE 80000003eb27f163 Mar 20 21:53:39 haswell01 kernel: [13457.334004] Oops: 0011 [#1] SMP Mar 20 21:53:39 haswell01 kernel: [13457.334006] Modules linked in: i915(+) video drm_kms_helper drm i2c_algo_bit snd_hda_codec_hdmi rfcomm bnep bluetooth nls_iso8859_1 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek kvm snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq hid_generic crct10dif_pclmul snd_seq_device usbhid crc32_pclmul snd_timer ghash_clmulni_intel snd aesni_intel psmouse hid mei_me aes_x86_64 lrw ppdev gf128mul mei glue_helper parport_pc ablk_helper lp cryptd soundcore parport serio_raw lpc_ich mac_hid e1000e ahci ptp libahci pps_core [last unloaded: video] Mar 20 21:53:39 haswell01 kernel: [13457.334031] CPU: 0 PID: 4974 Comm: modprobe Not tainted 3.13.6 #10 Mar 20 21:53:39 haswell01 kernel: [13457.334032] Hardware name: /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014 Mar 20 21:53:39 haswell01 kernel: [13457.334034] task: ffff88002e6c5fc0 ti: ffff880406394000 task.ti: ffff880406394000 Mar 20 21:53:39 haswell01 kernel: [13457.334035] RIP: 0010:[<ffff8803eb27fcb0>] [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334037] RSP: 0018:ffff880406395af0 EFLAGS: 00010282 Mar 20 21:53:39 haswell01 kernel: [13457.334038] RAX: ffff8803eb27f4b0 RBX: ffff8803f6016000 RCX: ffffffffa0238630 Mar 20 21:53:39 haswell01 kernel: [13457.334039] RDX: ffff8803eb27fcb0 RSI: ffffffffa03ba3c4 RDI: ffff8803f6016000 Mar 20 21:53:39 haswell01 kernel: [13457.334040] RBP: ffff880406395b10 R08: 0000000000000000 R09: ffff88041ea172e0 Mar 20 21:53:39 haswell01 kernel: [13457.334042] R10: ffffea00101d7200 R11: 0000000000000000 R12: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334043] R13: ffffffffa03ba3c4 R14: ffffffffa03dd100 R15: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334044] FS: 00007f3f392d0740(0000) GS:ffff88041ea00000(0000) knlGS:0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 20 21:53:39 haswell01 kernel: [13457.334047] CR2: ffff8803eb27fcb0 CR3: 00000003e7358000 CR4: 00000000001407f0 Mar 20 21:53:39 haswell01 kernel: [13457.334048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 20 21:53:39 haswell01 kernel: [13457.334050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Mar 20 21:53:39 haswell01 kernel: [13457.334051] Stack: Mar 20 21:53:39 haswell01 kernel: [13457.334052] ffffffffa0213cb2 ffff8803f6016000 ffff880408199000 ffff880408199098 Mar 20 21:53:39 haswell01 kernel: [13457.334054] ffff880406395b60 ffffffffa0215b92 0000000000000004 ffff880408199140 Mar 20 21:53:39 haswell01 kernel: [13457.334057] ffffffffa03b9920 ffff880408199000 0000000000000000 ffffffffa03dd000 Mar 20 21:53:39 haswell01 kernel: [13457.334059] Call Trace: Mar 20 21:53:39 haswell01 kernel: [13457.334067] [<ffffffffa0213cb2>] ? drm_dev_register+0xa2/0x1e0 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334073] [<ffffffffa0215b92>] drm_get_pci_dev+0x92/0x140 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334082] [<ffffffffa033d67c>] i915_pci_probe+0x3c/0x90 [i915] Mar 20 21:53:39 haswell01 kernel: [13457.334086] [<ffffffff813982b5>] local_pci_probe+0x45/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334088] [<ffffffff81399555>] ? pci_match_device+0xc5/0xd0 Mar 20 21:53:39 haswell01 kernel: [13457.334090] [<ffffffff81399679>] pci_device_probe+0xd9/0x130 Mar 20 21:53:39 haswell01 kernel: [13457.334093] [<ffffffff81484795>] driver_probe_device+0x125/0x3b0 Mar 20 21:53:39 haswell01 kernel: [13457.334095] [<ffffffff81484af3>] __driver_attach+0x93/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334098] [<ffffffff81484a60>] ? __device_attach+0x40/0x40 Mar 20 21:53:39 haswell01 kernel: [13457.334100] [<ffffffff81482703>] bus_for_each_dev+0x63/0xa0 Mar 20 21:53:39 haswell01 kernel: [13457.334102] [<ffffffff8148414e>] driver_attach+0x1e/0x20 Mar 20 21:53:39 haswell01 kernel: [13457.334104] [<ffffffff81483d30>] bus_add_driver+0x180/0x250 Mar 20 21:53:39 haswell01 kernel: [13457.334106] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334109] [<ffffffff81485174>] driver_register+0x64/0xf0 Mar 20 21:53:39 haswell01 kernel: [13457.334110] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334113] [<ffffffff81397c4c>] __pci_register_driver+0x4c/0x50 Mar 20 21:53:39 haswell01 kernel: [13457.334117] [<ffffffffa0215d5a>] drm_pci_init+0x11a/0x130 [drm] Mar 20 21:53:39 haswell01 kernel: [13457.334119] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff Mar 20 21:53:39 haswell01 kernel: [13457.334127] [<ffffffffa03fe066>] i915_init+0x66/0x68 [i915] Mar 20 21:53:39 haswell01 kernel: [13457.334130] [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0 Mar 20 21:53:39 haswell01 kernel: [13457.334133] [<ffffffff810589e3>] ? set_memory_nx+0x43/0x50 Mar 20 21:53:39 haswell01 kernel: [13457.334137] [<ffffffff810e0630>] load_module+0x2050/0x26f0 Mar 20 21:53:39 haswell01 kernel: [13457.334139] [<ffffffff810dbfa0>] ? store_uevent+0x40/0x40 Mar 20 21:53:39 haswell01 kernel: [13457.334141] [<ffffffff810e0e46>] SyS_finit_module+0x86/0xb0 Mar 20 21:53:39 haswell01 kernel: [13457.334144] [<ffffffff8171823f>] tracesys+0xe1/0xe6 Mar 20 21:53:39 haswell01 kernel: [13457.334145] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 <f0> b1 ca 81 ff ff ff ff b0 f4 27 eb 03 88 ff ff ff ff ff 7f 00 Mar 20 21:53:39 haswell01 kernel: [13457.334164] RIP [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334166] RSP <ffff880406395af0> Mar 20 21:53:39 haswell01 kernel: [13457.334167] CR2: ffff8803eb27fcb0 Mar 20 21:53:39 haswell01 kernel: [13457.334168] ---[ end trace e2598b1fc83a65fd ]---
From: Daniel Vetter daniel.vetter@ffwll.ch on behalf of Daniel Vetter daniel@ffwll.ch Sent: Tuesday, March 25, 2014 12:31 AM To: Dmitry Malkin Cc: dri-devel@lists.freedesktop.org Subject: Re: [DRM] drm_get_connector_name internal static string buffer causes a race
On Mon, Mar 24, 2014 at 12:06:21PM +0000, Dmitry Malkin wrote:
Hello guys,
I've been playing with reloading intel gfx driver (i915) in a cycle, for a while, and at some point I've found a non-deterministic kernel crash with a highly-variable iteration dependency -- 2 to 200 driver reload iterations.
The apparent race is over the shared internal string buffer in drm_get_connector_name(). It is mostly harmless, due to its results being mostly used for log output, but in at least one case -- drm_sysfs_connector_add() -- this leads to a more critical error.
Race scenario:
- drm_sysfs_connector_add()
- drm_get_connector_name()
vs.
- anything that generates log messages involving DRM connectors
- drm_get_connector_name()
(and many other from drm_crtc.c) shares with caller const char* to internal static char buffer. If something call it from other thread, while main thread strore and use returned pointer it may overwrite connector name.
Here are we go: registering HDMI connecter (drm_sysfs_connector_add store and use pointer from drm_get_connector_name) and the same time got VGA connector name down through the stack. (the second thread is upowerd who watch continuously sysfs)
Yeah, in my recent kerneldoc series I've noticed this too and added FIXME comments. There's a lot more than just those you've pointed out. The problem is that fixing these will be an awful lot of work since you need to add proper cleanup code (calling kfree()) to all the required places.
So a full subsystem wide code audit for every single use-site of these. -Daniel
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Mon, 24 Mar 2014, Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Mar 24, 2014 at 12:06:21PM +0000, Dmitry Malkin wrote:
Hello guys,
I've been playing with reloading intel gfx driver (i915) in a cycle, for a while, and at some point I've found a non-deterministic kernel crash with a highly-variable iteration dependency -- 2 to 200 driver reload iterations.
The apparent race is over the shared internal string buffer in drm_get_connector_name(). It is mostly harmless, due to its results being mostly used for log output, but in at least one case -- drm_sysfs_connector_add() -- this leads to a more critical error.
Race scenario:
- drm_sysfs_connector_add()
- drm_get_connector_name()
vs.
- anything that generates log messages involving DRM connectors
- drm_get_connector_name()
(and many other from drm_crtc.c) shares with caller const char* to internal static char buffer. If something call it from other thread, while main thread strore and use returned pointer it may overwrite connector name.
Here are we go: registering HDMI connecter (drm_sysfs_connector_add store and use pointer from drm_get_connector_name) and the same time got VGA connector name down through the stack. (the second thread is upowerd who watch continuously sysfs)
Yeah, in my recent kerneldoc series I've noticed this too and added FIXME comments. There's a lot more than just those you've pointed out. The problem is that fixing these will be an awful lot of work since you need to add proper cleanup code (calling kfree()) to all the required places.
So a full subsystem wide code audit for every single use-site of these.
It would be much easier and much much less error prone to add a name field (either a fixed size or kmalloced buffer) in drm_connector and drm_encoder structs, and initialize them in connector/encoder init. AFAICT the name does not change during the objects' lifetime. None of the use-sites would need to be changed.
The question is whether (number of connectors + number of encoders) * 32 bytes is too high a price to pay for the implementation simplicity. I think not.
Another alternative to returning kmalloced strings is:
diff --git a/drivers/gpu/drm/drm_crtc.c b/drivers/gpu/drm/drm_crtc.c index 16ca28ed5ee8..85c8ed191daa 100644 --- a/drivers/gpu/drm/drm_crtc.c +++ b/drivers/gpu/drm/drm_crtc.c @@ -277,11 +277,10 @@ EXPORT_SYMBOL(drm_get_encoder_name); * * FIXME: This isn't really multithreading safe. */ -const char *drm_get_connector_name(const struct drm_connector *connector) +const char *drm_get_connector_name(char *buf, size_t bufsize, + const struct drm_connector *connector) { - static char buf[32]; - - snprintf(buf, 32, "%s-%d", + snprintf(buf, bufsize, "%s-%d", drm_connector_enum_list[connector->connector_type].name, connector->connector_type_id); return buf;
So the callers may use the kind of buffers they want. Something like this may be needed for drm_get_format_name() anyway.
BR, Jani.
dri-devel@lists.freedesktop.org