At least on an IBM Power 720, this check passes, but several piglit tests will reliably trigger GPU resets due to the ring buffer pointers not being updated. There's probably a better way to limit this to just affected machines though.
Signed-off-by: Adam Jackson ajax@redhat.com --- drivers/gpu/drm/radeon/r600_cp.c | 7 +++++++ drivers/gpu/drm/radeon/radeon_cp.c | 7 +++++++ drivers/gpu/drm/radeon/radeon_device.c | 4 ++-- drivers/gpu/drm/radeon/radeon_drv.c | 2 +- 4 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/radeon/r600_cp.c b/drivers/gpu/drm/radeon/r600_cp.c index 1c51c08..ef28532 100644 --- a/drivers/gpu/drm/radeon/r600_cp.c +++ b/drivers/gpu/drm/radeon/r600_cp.c @@ -552,6 +552,13 @@ static void r600_test_writeback(drm_radeon_private_t *dev_priv) dev_priv->writeback_works = 0; DRM_INFO("writeback test failed\n"); } +#if defined(__ppc__) || defined(__ppc64__) + /* the test might succeed on ppc, but it's usually not reliable */ + if (radeon_no_wb == -1) { + radeon_no_wb = 1; + DRM_INFO("not trusting writeback test due to arch quirk\n"); + } +#endif if (radeon_no_wb == 1) { dev_priv->writeback_works = 0; DRM_INFO("writeback forced off\n"); diff --git a/drivers/gpu/drm/radeon/radeon_cp.c b/drivers/gpu/drm/radeon/radeon_cp.c index efc4f64..a967b33 100644 --- a/drivers/gpu/drm/radeon/radeon_cp.c +++ b/drivers/gpu/drm/radeon/radeon_cp.c @@ -892,6 +892,13 @@ static void radeon_test_writeback(drm_radeon_private_t * dev_priv) dev_priv->writeback_works = 0; DRM_INFO("writeback test failed\n"); } +#if defined(__ppc__) || defined(__ppc64__) + /* the test might succeed on ppc, but it's usually not reliable */ + if (radeon_no_wb == -1) { + radeon_no_wb = 1; + DRM_INFO("not trusting writeback test due to arch quirk\n"); + } +#endif if (radeon_no_wb == 1) { dev_priv->writeback_works = 0; DRM_INFO("writeback forced off\n"); diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index 1899738..524046e 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -322,8 +322,8 @@ int radeon_wb_init(struct radeon_device *rdev) /* disable event_write fences */ rdev->wb.use_event = false; /* disabled via module param */ - if (radeon_no_wb == 1) { - rdev->wb.enabled = false; + if (radeon_no_wb != -1) { + rdev->wb.enabled = !!radeon_no_wb; } else { if (rdev->flags & RADEON_IS_AGP) { /* often unreliable on AGP */ diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c index 094e7e5..04809d4 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -146,7 +146,7 @@ static inline void radeon_register_atpx_handler(void) {} static inline void radeon_unregister_atpx_handler(void) {} #endif
-int radeon_no_wb; +int radeon_no_wb = -1; int radeon_modeset = -1; int radeon_dynclks = -1; int radeon_r4xx_atom = 0;
On Mon, Jun 17, 2013 at 10:06 AM, Adam Jackson ajax@redhat.com wrote:
At least on an IBM Power 720, this check passes, but several piglit tests will reliably trigger GPU resets due to the ring buffer pointers not being updated. There's probably a better way to limit this to just affected machines though.
What radeon chips are you seeing this on? wb is more or less required on r6xx and newer and I'm not sure those generations will even work properly without writeback enabled these days. We force it to always be enabled on APUs and NI and newer asics. With KMS, wb encompasses more than just rptr writeback; it covers pretty much everything involving the GPU writing any status information to memory.
Alex
Signed-off-by: Adam Jackson ajax@redhat.com
drivers/gpu/drm/radeon/r600_cp.c | 7 +++++++ drivers/gpu/drm/radeon/radeon_cp.c | 7 +++++++ drivers/gpu/drm/radeon/radeon_device.c | 4 ++-- drivers/gpu/drm/radeon/radeon_drv.c | 2 +- 4 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/radeon/r600_cp.c b/drivers/gpu/drm/radeon/r600_cp.c index 1c51c08..ef28532 100644 --- a/drivers/gpu/drm/radeon/r600_cp.c +++ b/drivers/gpu/drm/radeon/r600_cp.c @@ -552,6 +552,13 @@ static void r600_test_writeback(drm_radeon_private_t *dev_priv) dev_priv->writeback_works = 0; DRM_INFO("writeback test failed\n"); } +#if defined(__ppc__) || defined(__ppc64__)
/* the test might succeed on ppc, but it's usually not reliable */
if (radeon_no_wb == -1) {
radeon_no_wb = 1;
DRM_INFO("not trusting writeback test due to arch quirk\n");
}
+#endif if (radeon_no_wb == 1) { dev_priv->writeback_works = 0; DRM_INFO("writeback forced off\n"); diff --git a/drivers/gpu/drm/radeon/radeon_cp.c b/drivers/gpu/drm/radeon/radeon_cp.c index efc4f64..a967b33 100644 --- a/drivers/gpu/drm/radeon/radeon_cp.c +++ b/drivers/gpu/drm/radeon/radeon_cp.c @@ -892,6 +892,13 @@ static void radeon_test_writeback(drm_radeon_private_t * dev_priv) dev_priv->writeback_works = 0; DRM_INFO("writeback test failed\n"); } +#if defined(__ppc__) || defined(__ppc64__)
/* the test might succeed on ppc, but it's usually not reliable */
if (radeon_no_wb == -1) {
radeon_no_wb = 1;
DRM_INFO("not trusting writeback test due to arch quirk\n");
}
+#endif if (radeon_no_wb == 1) { dev_priv->writeback_works = 0; DRM_INFO("writeback forced off\n"); diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index 1899738..524046e 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -322,8 +322,8 @@ int radeon_wb_init(struct radeon_device *rdev) /* disable event_write fences */ rdev->wb.use_event = false; /* disabled via module param */
if (radeon_no_wb == 1) {
rdev->wb.enabled = false;
if (radeon_no_wb != -1) {
rdev->wb.enabled = !!radeon_no_wb; } else { if (rdev->flags & RADEON_IS_AGP) { /* often unreliable on AGP */
diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c index 094e7e5..04809d4 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -146,7 +146,7 @@ static inline void radeon_register_atpx_handler(void) {} static inline void radeon_unregister_atpx_handler(void) {} #endif
-int radeon_no_wb; +int radeon_no_wb = -1; int radeon_modeset = -1; int radeon_dynclks = -1; int radeon_r4xx_atom = 0; -- 1.8.2.1
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, 2013-06-17 at 11:04 -0400, Alex Deucher wrote:
On Mon, Jun 17, 2013 at 10:06 AM, Adam Jackson ajax@redhat.com wrote:
At least on an IBM Power 720, this check passes, but several piglit tests will reliably trigger GPU resets due to the ring buffer pointers not being updated. There's probably a better way to limit this to just affected machines though.
What radeon chips are you seeing this on? wb is more or less required on r6xx and newer and I'm not sure those generations will even work properly without writeback enabled these days. We force it to always be enabled on APUs and NI and newer asics. With KMS, wb encompasses more than just rptr writeback; it covers pretty much everything involving the GPU writing any status information to memory.
FirePro 2270, at least. Booting with no_wb=1, piglit runs to completion with no GPU resets or IB submission failures. Booting without no_wb, the following piglits go from pass to fail, all complaining that the kernel rejected CS:
no-wb wb (info) (info) All 7598/9587 7403/9553 glean 362/385 361/385 pointAtten pass fail texture_srgb pass fail shaders 477/533 441/533 glsl-algebraic-add-add-1 pass fail glsl-algebraic-add-add-2 pass fail glsl-algebraic-add-add-3 pass fail glsl-algebraic-sub-zero-3 pass fail glsl-algebraic-sub-zero-4 pass fail glsl-complex-subscript pass fail glsl-copy-propagation-if-1 pass fail glsl-copy-propagation-if-2 pass fail glsl-copy-propagation-if-3 pass fail glsl-copy-propagation-vector-indexing pass fail glsl-fs-atan-2 pass fail glsl-fs-dot-vec2-2 pass fail glsl-fs-log2 pass fail glsl-fs-main-return pass fail glsl-fs-max-3 pass fail glsl-fs-min-2 pass fail glsl-fs-min-3 pass fail glsl-fs-statevar-call pass fail glsl-fs-struct-equal pass fail glsl-function-chain16 pass fail glsl-implicit-conversion-02 pass fail glsl-inout-struct-01 pass fail glsl-inout-struct-02 pass fail glsl-link-varying-TexCoord pass fail glsl-link-varyings-2 pass fail glsl-uniform-initializer-4 pass fail glsl-uniform-initializer-6 pass fail glsl-uniform-initializer-7 pass fail glsl-vs-abs-neg-with-intermediate pass fail glsl-vs-clamp-1 pass fail glsl-vs-deadcode-1 pass fail glsl-vs-deadcode-2 pass fail glsl-vs-f2b pass fail glsl-vs-position-outval pass fail link-uniform-array-size pass fail loopfunc pass fail spec 5921/7801 5763/7767 !OpenGL 1.1 118/229 101/219 depthstencil-default_fb-clear pass fail getteximage-simple pass fail texwrap formats 27/50 12/40 GL_LUMINANCE12 pass fail GL_LUMINANCE16 pass fail GL_LUMINANCE4 pass fail GL_LUMINANCE4_ALPHA4 pass fail GL_LUMINANCE8 pass fail !OpenGL 1.4 11/13 9/13 triangle-rasterization pass fail triangle-rasterization-fbo pass fail ARB_depth_buffer_float 9/63 6/63 fbo-depth-GL_DEPTH32F_STENCIL8-clear pass fail fbo-depth-GL_DEPTH_COMPONENT32F-clear pass fail fbo-stencil-GL_DEPTH32F_STENCIL8-clear pass fail ARB_depth_texture 11/53 10/53 fbo-depth-GL_DEPTH_COMPONENT32-clear pass fail ARB_fragment_program 23/28 22/28 kil-swizzle pass fail ARB_framebuffer_object 14/26 13/26 fbo-deriv pass fail ARB_framebuffer_sRGB 80/81 65/81 blit renderbuffer linear downsample disabled pass fail blit renderbuffer linear msaa enabled pass fail blit renderbuffer linear single_sampled enabled pass fail blit renderbuffer linear_to_srgb scaled disabled pass fail blit renderbuffer srgb msaa disabled pass fail blit renderbuffer srgb msaa enabled pass fail blit renderbuffer srgb scaled enabled pass fail blit renderbuffer srgb single_sampled enabled pass fail blit renderbuffer srgb upsample enabled pass fail blit renderbuffer srgb_to_linear msaa disabled pass fail blit renderbuffer srgb_to_linear msaa enabled pass fail blit texture linear scaled enabled pass fail blit texture linear_to_srgb upsample disabled pass fail blit texture srgb msaa disabled pass fail blit texture srgb msaa enabled pass fail ARB_map_buffer_range 8/8 6/8 MAP_INVALIDATE_RANGE_BIT decrement-offset pass fail MAP_INVALIDATE_RANGE_BIT offset=0 pass fail ARB_sampler_objects 1/4 0/4 sampler-objects pass fail ARB_shading_language_packing 10/10 7/10 execution 10/10 7/10 built-in-functions 10/10 7/10 const-packSnorm4x8 pass fail const-unpackSnorm2x16 pass fail const-unpackUnorm2x16 pass fail ARB_texture_compression 25/40 22/40 texwrap formats bordercolor-swizzled 3/6 0/6 GL_COMPRESSED_INTENSITY, swizzled, border color only pass fail GL_COMPRESSED_LUMINANCE, swizzled, border color only pass fail GL_COMPRESSED_LUMINANCE_ALPHA, swizzled, border color pass fail only ARB_texture_cube_map 7/8 6/8 cubemap pass fail ARB_texture_rectangle 11/21 10/21 texrect_simple_arb_texrect pass fail ARB_texture_rg 42/113 19/97 texwrap formats-int 24/28 0/12 GL_R16UI pass fail GL_R32UI pass fail GL_R8I pass fail GL_R8UI pass fail GL_RG16UI pass fail GL_RG32UI pass fail GL_RG8I pass fail GL_RG8UI pass fail ARB_texture_storage 1/1 0/1 texture-storage pass fail ARB_timer_query 1/3 0/3 query GL_TIMESTAMP pass fail ATI_draw_buffers 3/3 2/3 arbfp-no-option pass fail EXT_framebuffer_multisample 267/287 259/287 bitmap 4 pass fail bitmap 6 pass fail bitmap 8 pass fail clear 2 depth pass fail clear 6 color pass fail clear 6 depth pass fail clear 6 stencil pass fail draw-buffers-alpha-to-one 4 pass fail EXT_framebuffer_object 175/289 173/289 fbo-fragcoord2 pass fail fbo-generatemipmap-filtering pass fail fbo-generatemipmap-formats 2/76 1/76 GL_LUMINANCE4_ALPHA4 pass fail fbo-readpixels-depth-formats 14/24 16/24 GL_DEPTH24_STENCIL8_EXT 3/4 2/4 GL_UNSIGNED_INT pass fail fbo-stencil-GL_STENCIL_INDEX16-clear pass fail EXT_packed_depth_stencil 13/58 11/58 fbo-depthstencil-GL_DEPTH24_STENCIL8-clear pass fail fbo-stencil-GL_DEPTH24_STENCIL8-clear pass fail EXT_texture_compression_rgtc 31/39 11/31 fbo-generatemipmap-formats 8/8 4/8 GL_COMPRESSED_RED pass fail GL_COMPRESSED_RED_GREEN_RGTC2_EXT pass fail GL_COMPRESSED_RED_RGTC1_EXT pass fail GL_COMPRESSED_RG pass fail texwrap formats 12/12 0/4 GL_COMPRESSED_RED_RGTC1 pass fail GL_COMPRESSED_RG_RGTC2 pass fail GL_COMPRESSED_SIGNED_RED_RGTC1 pass fail GL_COMPRESSED_SIGNED_RG_RGTC2 pass fail texwrap formats bordercolor 4/4 0/4 GL_COMPRESSED_RED_RGTC1, border color only pass fail GL_COMPRESSED_RG_RGTC2, border color only pass fail GL_COMPRESSED_SIGNED_RED_RGTC1, border color only pass fail GL_COMPRESSED_SIGNED_RG_RGTC2, border color only pass fail EXT_transform_feedback 45/279 44/279 discard-bitmap pass fail glsl-1.10 1540/1636 1529/1636 execution 1295/1391 1286/1391 fs-inline-notequal pass fail fs-saturate-pow pass fail maximums 12/12 8/12 gl_MaxCombinedTextureImageUnits pass fail gl_MaxFragmentUniformComponents pass fail gl_MaxTextureImageUnits pass fail gl_MaxVertexAttribs pass fail vs-mat2-array-assignment pass fail vs-saturate-exp2 pass fail vs-saturate-pow pass fail linker 18/18 16/18 access-builtin-global-from-fn-unknown-to-main pass fail override-builtin-uniform-01 pass fail glsl-1.20 2124/2183 2097/2183 execution 788/845 761/845 maximums 12/12 7/12 gl_MaxClipPlanes pass fail gl_MaxCombinedTextureImageUnits pass fail gl_MaxTextureCoords pass fail gl_MaxVaryingFloats pass fail gl_MaxVertexTextureImageUnits pass fail uniform-initializer 64/64 44/64 fs-bool-set-by-other-stage pass fail fs-float-array pass fail fs-float-from-const pass fail fs-float-set-by-other-stage pass fail fs-int pass fail fs-int-from-const pass fail fs-int-set-by-other-stage pass fail fs-mat3-set-by-other-stage pass fail fs-mat4-from-const pass fail fs-mat4-set-by-API pass fail fs-mat4-set-by-other-stage pass fail fs-structure-array pass fail vs-bool-from-const pass fail vs-float pass fail vs-float-array pass fail vs-mat2 pass fail vs-mat3 pass fail vs-mat4 pass fail vs-mat4-from-const pass fail vs-mat4-set-by-API pass fail vs-all-equal-bool-array pass fail vs-assign-varied-struct pass fail glsl-1.30 468/604 455/604 execution 465/601 452/601 clipping 20/20 18/20 fs-clip-distance-interpolated pass fail vs-clip-distance-explicitly-sized pass fail fs-discard-exit-1 pass fail fs-discard-exit-2 pass fail fs-increment-int pass fail fs-multiply-const-ivec4 pass fail maximums 13/13 9/13 gl_MaxClipPlanes pass fail gl_MaxCombinedTextureImageUnits pass fail gl_MaxTextureImageUnits pass fail gl_MaxVaryingFloats pass fail qualifiers 1/1 0/1 vs-out-conversion-ivec4-to-vec4 pass fail uniform-initializer 8/8 6/8 fs-uint-from-const pass fail vs-uint pass fail
- ajax
On Mon, Jun 17, 2013 at 12:07 PM, Adam Jackson ajax@redhat.com wrote:
On Mon, 2013-06-17 at 11:04 -0400, Alex Deucher wrote:
On Mon, Jun 17, 2013 at 10:06 AM, Adam Jackson ajax@redhat.com wrote:
At least on an IBM Power 720, this check passes, but several piglit tests will reliably trigger GPU resets due to the ring buffer pointers not being updated. There's probably a better way to limit this to just affected machines though.
What radeon chips are you seeing this on? wb is more or less required on r6xx and newer and I'm not sure those generations will even work properly without writeback enabled these days. We force it to always be enabled on APUs and NI and newer asics. With KMS, wb encompasses more than just rptr writeback; it covers pretty much everything involving the GPU writing any status information to memory.
FirePro 2270, at least. Booting with no_wb=1, piglit runs to completion with no GPU resets or IB submission failures. Booting without no_wb, the following piglits go from pass to fail, all complaining that the kernel rejected CS:
Weird. I wonder if there is an issue with cache snoops on PPC. We currently use the gart in cached mode (GPU snoops CPU cache) with cached pages. I wonder if we need to use uncached pages on PPC.
Alex
On Mon, 2013-06-17 at 18:57 -0400, Alex Deucher wrote:
Weird. I wonder if there is an issue with cache snoops on PPC. We currently use the gart in cached mode (GPU snoops CPU cache) with cached pages. I wonder if we need to use uncached pages on PPC.
There is no such issue and no known bugs with DMA writes on those PCIe host bridges (and they do get hammered pretty bad here).
This needs further investigation by the lab/hw guys to find out what's actually happening on the bus and the host bridge.
Thadeu, Kleber: Jerome suggested writing a test case in userspace that continuously writes to a spare scratch register (thus triggering the corresponding writeback DMA) and checks the memory location to compare the writeback value (using a debugfs file for example, or mmap).
Can you guys do something like that ? Then we need the analyzer on and/or the lab guys to look at the fabric trace & PHB trace.
Ben.
On 11/07/2013 08:29 PM, Benjamin Herrenschmidt wrote:
On Mon, 2013-06-17 at 18:57 -0400, Alex Deucher wrote:
Weird. I wonder if there is an issue with cache snoops on PPC. We currently use the gart in cached mode (GPU snoops CPU cache) with cached pages. I wonder if we need to use uncached pages on PPC.
There is no such issue and no known bugs with DMA writes on those PCIe host bridges (and they do get hammered pretty bad here).
This needs further investigation by the lab/hw guys to find out what's actually happening on the bus and the host bridge.
Thadeu, Kleber: Jerome suggested writing a test case in userspace that continuously writes to a spare scratch register (thus triggering the corresponding writeback DMA) and checks the memory location to compare the writeback value (using a debugfs file for example, or mmap).
I can look into that.
Thanks,
Kleber
Can you guys do something like that ? Then we need the analyzer on and/or the lab guys to look at the fabric trace & PHB trace.
Ben.
On Fri, 2013-11-08 at 11:43 -0200, Kleber Sacilotto de Souza wrote:
On 11/07/2013 08:29 PM, Benjamin Herrenschmidt wrote:
On Mon, 2013-06-17 at 18:57 -0400, Alex Deucher wrote:
Weird. I wonder if there is an issue with cache snoops on PPC. We currently use the gart in cached mode (GPU snoops CPU cache) with cached pages. I wonder if we need to use uncached pages on PPC.
There is no such issue and no known bugs with DMA writes on those PCIe host bridges (and they do get hammered pretty bad here).
This needs further investigation by the lab/hw guys to find out what's actually happening on the bus and the host bridge.
Thadeu, Kleber: Jerome suggested writing a test case in userspace that continuously writes to a spare scratch register (thus triggering the corresponding writeback DMA) and checks the memory location to compare the writeback value (using a debugfs file for example, or mmap).
I can look into that.
Any news ?
Ben.
Thanks,
Kleber
Can you guys do something like that ? Then we need the analyzer on and/or the lab guys to look at the fabric trace & PHB trace.
Ben.
On 11/24/2013 09:15 PM, Benjamin Herrenschmidt wrote:
On Fri, 2013-11-08 at 11:43 -0200, Kleber Sacilotto de Souza wrote:
On 11/07/2013 08:29 PM, Benjamin Herrenschmidt wrote:
On Mon, 2013-06-17 at 18:57 -0400, Alex Deucher wrote:
Weird. I wonder if there is an issue with cache snoops on PPC. We currently use the gart in cached mode (GPU snoops CPU cache) with cached pages. I wonder if we need to use uncached pages on PPC.
There is no such issue and no known bugs with DMA writes on those PCIe host bridges (and they do get hammered pretty bad here).
This needs further investigation by the lab/hw guys to find out what's actually happening on the bus and the host bridge.
Thadeu, Kleber: Jerome suggested writing a test case in userspace that continuously writes to a spare scratch register (thus triggering the corresponding writeback DMA) and checks the memory location to compare the writeback value (using a debugfs file for example, or mmap).
I can look into that.
Any news ?
Only now I've got the bandwidth to actually take a look at this. I will start to write some code and let you guys know of the results.
Thanks,
Kleber
Ben.
Thanks,
Kleber
Can you guys do something like that ? Then we need the analyzer on and/or the lab guys to look at the fabric trace & PHB trace.
Ben.
On 11/25/2013 10:11 PM, Kleber Sacilotto de Souza wrote:
On 11/24/2013 09:15 PM, Benjamin Herrenschmidt wrote:
On Fri, 2013-11-08 at 11:43 -0200, Kleber Sacilotto de Souza wrote:
On 11/07/2013 08:29 PM, Benjamin Herrenschmidt wrote:
On Mon, 2013-06-17 at 18:57 -0400, Alex Deucher wrote:
Weird. I wonder if there is an issue with cache snoops on PPC. We currently use the gart in cached mode (GPU snoops CPU cache) with cached pages. I wonder if we need to use uncached pages on PPC.
There is no such issue and no known bugs with DMA writes on those PCIe host bridges (and they do get hammered pretty bad here).
This needs further investigation by the lab/hw guys to find out what's actually happening on the bus and the host bridge.
Thadeu, Kleber: Jerome suggested writing a test case in userspace that continuously writes to a spare scratch register (thus triggering the corresponding writeback DMA) and checks the memory location to compare the writeback value (using a debugfs file for example, or mmap).
I was not able to reproduce the issue with this method, even after a weekend run.
However, doing some more investigation it seems the problem is here, where we read the ring rptr:
u32 radeon_ring_generic_get_rptr(struct radeon_device *rdev, struct radeon_ring *ring) { u32 rptr;
if (rdev->wb.enabled) rptr = le32_to_cpu(rdev->wb.wb[ring->rptr_offs/4]); else rptr = RREG32(ring->rptr_reg);
return rptr; }
I realized that the DMA'ed rptr value has always the opposite byte order from the MMIO value. Since RREG32 already returns the register value on the CPU byte order, it seems we don't need to byte-swap the DMA'ed value. If I remove the le32_to_cpu() call and use the DMA'ed value directly, I don't get the IB scheduling failures and piglit results are the same as with writeback disabled.
Is the adapter chipset swapping the bytes before doing the DMA to a big-endian host?
On Wed, Dec 4, 2013 at 5:16 PM, Kleber Sacilotto de Souza klebers@linux.vnet.ibm.com wrote:
On 11/25/2013 10:11 PM, Kleber Sacilotto de Souza wrote:
On 11/24/2013 09:15 PM, Benjamin Herrenschmidt wrote:
On Fri, 2013-11-08 at 11:43 -0200, Kleber Sacilotto de Souza wrote:
On 11/07/2013 08:29 PM, Benjamin Herrenschmidt wrote:
On Mon, 2013-06-17 at 18:57 -0400, Alex Deucher wrote:
Weird. I wonder if there is an issue with cache snoops on PPC. We currently use the gart in cached mode (GPU snoops CPU cache) with cached pages. I wonder if we need to use uncached pages on PPC.
There is no such issue and no known bugs with DMA writes on those PCIe host bridges (and they do get hammered pretty bad here).
This needs further investigation by the lab/hw guys to find out what's actually happening on the bus and the host bridge.
Thadeu, Kleber: Jerome suggested writing a test case in userspace that continuously writes to a spare scratch register (thus triggering the corresponding writeback DMA) and checks the memory location to compare the writeback value (using a debugfs file for example, or mmap).
I was not able to reproduce the issue with this method, even after a weekend run.
However, doing some more investigation it seems the problem is here, where we read the ring rptr:
u32 radeon_ring_generic_get_rptr(struct radeon_device *rdev, struct radeon_ring *ring) { u32 rptr;
if (rdev->wb.enabled) rptr = le32_to_cpu(rdev->wb.wb[ring->rptr_offs/4]); else rptr = RREG32(ring->rptr_reg); return rptr;
}
I realized that the DMA'ed rptr value has always the opposite byte order from the MMIO value. Since RREG32 already returns the register value on the CPU byte order, it seems we don't need to byte-swap the DMA'ed value. If I remove the le32_to_cpu() call and use the DMA'ed value directly, I don't get the IB scheduling failures and piglit results are the same as with writeback disabled.
Is the adapter chipset swapping the bytes before doing the DMA to a big-endian host?
Setting CP_RB_CNTL.BUF_SWAP causes the CP to use the selected byte swapping for just about everything accessed by the CP (rptr writeback, indirect buffers, etc.). Looks like the DMA ring supports and enables rptr writeback as well (DMA_RB_CNTL.DMA_RPTR_WRITEBACK_SWAP_ENABLE) so I think we can drop the swapping of the rptr writeback.
Alex
-- Kleber Sacilotto de Souza IBM Linux Technology Center
On Wed, Dec 4, 2013 at 6:56 PM, Alex Deucher alexdeucher@gmail.com wrote:
On Wed, Dec 4, 2013 at 5:16 PM, Kleber Sacilotto de Souza klebers@linux.vnet.ibm.com wrote:
On 11/25/2013 10:11 PM, Kleber Sacilotto de Souza wrote:
On 11/24/2013 09:15 PM, Benjamin Herrenschmidt wrote:
On Fri, 2013-11-08 at 11:43 -0200, Kleber Sacilotto de Souza wrote:
On 11/07/2013 08:29 PM, Benjamin Herrenschmidt wrote:
On Mon, 2013-06-17 at 18:57 -0400, Alex Deucher wrote:
> Weird. I wonder if there is an issue with cache snoops on PPC. We > currently use the gart in cached mode (GPU snoops CPU cache) with > cached pages. I wonder if we need to use uncached pages on PPC.
There is no such issue and no known bugs with DMA writes on those PCIe host bridges (and they do get hammered pretty bad here).
This needs further investigation by the lab/hw guys to find out what's actually happening on the bus and the host bridge.
Thadeu, Kleber: Jerome suggested writing a test case in userspace that continuously writes to a spare scratch register (thus triggering the corresponding writeback DMA) and checks the memory location to compare the writeback value (using a debugfs file for example, or mmap).
I was not able to reproduce the issue with this method, even after a weekend run.
However, doing some more investigation it seems the problem is here, where we read the ring rptr:
u32 radeon_ring_generic_get_rptr(struct radeon_device *rdev, struct radeon_ring *ring) { u32 rptr;
if (rdev->wb.enabled) rptr = le32_to_cpu(rdev->wb.wb[ring->rptr_offs/4]); else rptr = RREG32(ring->rptr_reg); return rptr;
}
I realized that the DMA'ed rptr value has always the opposite byte order from the MMIO value. Since RREG32 already returns the register value on the CPU byte order, it seems we don't need to byte-swap the DMA'ed value. If I remove the le32_to_cpu() call and use the DMA'ed value directly, I don't get the IB scheduling failures and piglit results are the same as with writeback disabled.
Is the adapter chipset swapping the bytes before doing the DMA to a big-endian host?
Setting CP_RB_CNTL.BUF_SWAP causes the CP to use the selected byte swapping for just about everything accessed by the CP (rptr writeback, indirect buffers, etc.). Looks like the DMA ring supports and enables rptr writeback as well (DMA_RB_CNTL.DMA_RPTR_WRITEBACK_SWAP_ENABLE) so I think we can drop the swapping of the rptr writeback.
Obvious patch attached.
Alex
Alex
-- Kleber Sacilotto de Souza IBM Linux Technology Center
On Wed, 2013-12-04 at 19:05 -0500, Alex Deucher wrote:
Setting CP_RB_CNTL.BUF_SWAP causes the CP to use the selected byte swapping for just about everything accessed by the CP (rptr writeback, indirect buffers, etc.). Looks like the DMA ring supports and enables rptr writeback as well (DMA_RB_CNTL.DMA_RPTR_WRITEBACK_SWAP_ENABLE) so I think we can drop the swapping of the rptr writeback.
Obvious patch attached.
This works all the way back to r300 ?
Cheers, Ben.
On Don, 2013-12-05 at 12:39 +1100, Benjamin Herrenschmidt wrote:
On Wed, 2013-12-04 at 19:05 -0500, Alex Deucher wrote:
Setting CP_RB_CNTL.BUF_SWAP causes the CP to use the selected byte swapping for just about everything accessed by the CP (rptr writeback, indirect buffers, etc.). Looks like the DMA ring supports and enables rptr writeback as well (DMA_RB_CNTL.DMA_RPTR_WRITEBACK_SWAP_ENABLE) so I think we can drop the swapping of the rptr writeback.
Obvious patch attached.
This works all the way back to r300 ?
I don't think so, as I have writeback working without this patch on the RV350 in this PowerBook. So I think this function needs to be split, probably between R600 and older.
Also, there's more code at least potentially affected by this, e.g. in:
* cik_compute_ring_get_rptr(), cik_compute_ring_get_wptr(), cik_compute_ring_set_wptr(), cik_get_ih_wptr() * si_get_ih_wptr() * evergreen_get_ih_wptr() * r600_get_ih_wptr() * radeon_fence_write(), radeon_fence_read() * radeon_ring_backup()
On Thu, 2013-12-05 at 11:29 +0900, Michel Dänzer wrote:
On Don, 2013-12-05 at 12:39 +1100, Benjamin Herrenschmidt wrote:
On Wed, 2013-12-04 at 19:05 -0500, Alex Deucher wrote:
Setting CP_RB_CNTL.BUF_SWAP causes the CP to use the selected byte swapping for just about everything accessed by the CP (rptr writeback, indirect buffers, etc.). Looks like the DMA ring supports and enables rptr writeback as well (DMA_RB_CNTL.DMA_RPTR_WRITEBACK_SWAP_ENABLE) so I think we can drop the swapping of the rptr writeback.
Obvious patch attached.
This works all the way back to r300 ?
I don't think so, as I have writeback working without this patch on the RV350 in this PowerBook. So I think this function needs to be split, probably between R600 and older.
Or can we tell it to not swap (setup code) and continue doing le32_to_cpu in the kernel ? Sounds better to me.
Also, there's more code at least potentially affected by this, e.g. in:
* cik_compute_ring_get_rptr(), cik_compute_ring_get_wptr(), cik_compute_ring_set_wptr(), cik_get_ih_wptr() * si_get_ih_wptr() * evergreen_get_ih_wptr() * r600_get_ih_wptr() * radeon_fence_write(), radeon_fence_read() * radeon_ring_backup()
Yeah I'd say just don't swap, write LE to memory and let the kernel use the right leXX_to_cpu.
HW swapping is evil :-)
Cheers, Ben.
On Wed, Dec 4, 2013 at 11:06 PM, Benjamin Herrenschmidt benh@kernel.crashing.org wrote:
On Thu, 2013-12-05 at 11:29 +0900, Michel Dänzer wrote:
On Don, 2013-12-05 at 12:39 +1100, Benjamin Herrenschmidt wrote:
On Wed, 2013-12-04 at 19:05 -0500, Alex Deucher wrote:
Setting CP_RB_CNTL.BUF_SWAP causes the CP to use the selected byte swapping for just about everything accessed by the CP (rptr writeback, indirect buffers, etc.). Looks like the DMA ring supports and enables rptr writeback as well (DMA_RB_CNTL.DMA_RPTR_WRITEBACK_SWAP_ENABLE) so I think we can drop the swapping of the rptr writeback.
Obvious patch attached.
This works all the way back to r300 ?
I don't think so, as I have writeback working without this patch on the RV350 in this PowerBook. So I think this function needs to be split, probably between R600 and older.
Or can we tell it to not swap (setup code) and continue doing le32_to_cpu in the kernel ? Sounds better to me.
Also, there's more code at least potentially affected by this, e.g. in:
* cik_compute_ring_get_rptr(), cik_compute_ring_get_wptr(), cik_compute_ring_set_wptr(), cik_get_ih_wptr() * si_get_ih_wptr() * evergreen_get_ih_wptr() * r600_get_ih_wptr() * radeon_fence_write(), radeon_fence_read() * radeon_ring_backup()
Yeah I'd say just don't swap, write LE to memory and let the kernel use the right leXX_to_cpu.
HW swapping is evil :-)
Well, we'd need to start swapping indirect buffers and the ring as well then which would get tricky as the CP at least does not have separate swapping controls for different things. Probably easier to fix up as appropriate for different asic families. We have function pointers for the rtpr and wptr fetchers now, so we can do this pretty cleanly.
Alex
Cheers, Ben.
On 12/05/2013 12:42 PM, Alex Deucher wrote:
Well, we'd need to start swapping indirect buffers and the ring as well then which would get tricky as the CP at least does not have separate swapping controls for different things. Probably easier to fix up as appropriate for different asic families. We have function pointers for the rtpr and wptr fetchers now, so we can do this pretty cleanly.
Alex
Alex,
Are you going to send a patch to fix this?
If not, I don't have the knowledge of which asic families will need this fix, but if you inform me what needs to be done I can write the patch.
Thanks,
On Fri, Dec 6, 2013 at 8:58 AM, Kleber Sacilotto de Souza klebers@linux.vnet.ibm.com wrote:
On 12/05/2013 12:42 PM, Alex Deucher wrote:
Well, we'd need to start swapping indirect buffers and the ring as well then which would get tricky as the CP at least does not have separate swapping controls for different things. Probably easier to fix up as appropriate for different asic families. We have function pointers for the rtpr and wptr fetchers now, so we can do this pretty cleanly.
Alex
Alex,
Are you going to send a patch to fix this?
If not, I don't have the knowledge of which asic families will need this fix, but if you inform me what needs to be done I can write the patch.
I'll try and send a patch out in the next few days.
Alex
Thanks,
-- Kleber Sacilotto de Souza IBM Linux Technology Center
On Fri, Dec 6, 2013 at 10:59 AM, Alex Deucher alexdeucher@gmail.com wrote:
On Fri, Dec 6, 2013 at 8:58 AM, Kleber Sacilotto de Souza klebers@linux.vnet.ibm.com wrote:
On 12/05/2013 12:42 PM, Alex Deucher wrote:
Well, we'd need to start swapping indirect buffers and the ring as well then which would get tricky as the CP at least does not have separate swapping controls for different things. Probably easier to fix up as appropriate for different asic families. We have function pointers for the rtpr and wptr fetchers now, so we can do this pretty cleanly.
Alex
Alex,
Are you going to send a patch to fix this?
If not, I don't have the knowledge of which asic families will need this fix, but if you inform me what needs to be done I can write the patch.
I'll try and send a patch out in the next few days.
Patch attached. Compile tested only at the moment.
Alex
Alex
Thanks,
-- Kleber Sacilotto de Souza IBM Linux Technology Center
On Mon, 2013-12-09 at 19:48 -0500, Alex Deucher wrote:
-u32 cik_compute_ring_get_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
+u32 cik_compute_get_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
{ u32 wptr;
if (rdev->wb.enabled) {
wptr = le32_to_cpu(rdev->wb.wb[ring->wptr_offs/4]);
wptr = rdev->wb.wb[ring->wptr_offs/4]; } else { mutex_lock(&rdev->srbm_mutex); cik_srbm_select(rdev, ring->me, ring->pipe,
ring->queue, 0); @@ -4053,8 +4081,8 @@ u32 cik_compute_ring_get_wptr(struct radeon_device *rdev, return wptr; }
-void cik_compute_ring_set_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
+void cik_compute_set_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
{ rdev->wb.wb[ring->wptr_offs/4] = cpu_to_le32(ring->wptr);
I think this cpu_to_le32() needs to be dropped as well to match cik_compute_ring_get_wptr().
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index b1f990d..e7c02a7 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -779,13 +779,11 @@ struct radeon_ring { volatile uint32_t *ring; unsigned rptr; unsigned rptr_offs;
unsigned rptr_reg; unsigned rptr_save_reg; u64 next_rptr_gpu_addr; volatile u32 *next_rptr_cpu_addr; unsigned wptr; unsigned wptr_old;
unsigned wptr_reg;
What's the motivation for removing these? Seems like keeping them would allow keeping this patch and the resulting code smaller (fewer function variants) and cleaner (no if/else/... for the register offsets in the variants).
On Mon, Dec 9, 2013 at 9:20 PM, Michel Dänzer michel@daenzer.net wrote:
On Mon, 2013-12-09 at 19:48 -0500, Alex Deucher wrote:
-u32 cik_compute_ring_get_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
+u32 cik_compute_get_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
{ u32 wptr;
if (rdev->wb.enabled) {
wptr = le32_to_cpu(rdev->wb.wb[ring->wptr_offs/4]);
wptr = rdev->wb.wb[ring->wptr_offs/4]; } else { mutex_lock(&rdev->srbm_mutex); cik_srbm_select(rdev, ring->me, ring->pipe,
ring->queue, 0); @@ -4053,8 +4081,8 @@ u32 cik_compute_ring_get_wptr(struct radeon_device *rdev, return wptr; }
-void cik_compute_ring_set_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
+void cik_compute_set_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
{ rdev->wb.wb[ring->wptr_offs/4] = cpu_to_le32(ring->wptr);
I think this cpu_to_le32() needs to be dropped as well to match cik_compute_ring_get_wptr().
whoops, yeah, missed that one.
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index b1f990d..e7c02a7 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -779,13 +779,11 @@ struct radeon_ring { volatile uint32_t *ring; unsigned rptr; unsigned rptr_offs;
unsigned rptr_reg; unsigned rptr_save_reg; u64 next_rptr_gpu_addr; volatile u32 *next_rptr_cpu_addr; unsigned wptr; unsigned wptr_old;
unsigned wptr_reg;
What's the motivation for removing these? Seems like keeping them would allow keeping this patch and the resulting code smaller (fewer function variants) and cleaner (no if/else/... for the register offsets in the variants).
I was trying to remove the asic specific info out of the ring struct and into the asic specific functions.
Alex
On Tue, Dec 10, 2013 at 10:04 AM, Alex Deucher alexdeucher@gmail.com wrote:
On Mon, Dec 9, 2013 at 9:20 PM, Michel Dänzer michel@daenzer.net wrote:
On Mon, 2013-12-09 at 19:48 -0500, Alex Deucher wrote:
-u32 cik_compute_ring_get_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
+u32 cik_compute_get_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
{ u32 wptr;
if (rdev->wb.enabled) {
wptr = le32_to_cpu(rdev->wb.wb[ring->wptr_offs/4]);
wptr = rdev->wb.wb[ring->wptr_offs/4]; } else { mutex_lock(&rdev->srbm_mutex); cik_srbm_select(rdev, ring->me, ring->pipe,
ring->queue, 0); @@ -4053,8 +4081,8 @@ u32 cik_compute_ring_get_wptr(struct radeon_device *rdev, return wptr; }
-void cik_compute_ring_set_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
+void cik_compute_set_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
{ rdev->wb.wb[ring->wptr_offs/4] = cpu_to_le32(ring->wptr);
I think this cpu_to_le32() needs to be dropped as well to match cik_compute_ring_get_wptr().
whoops, yeah, missed that one.
Updated patch attached.
Alex
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index b1f990d..e7c02a7 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -779,13 +779,11 @@ struct radeon_ring { volatile uint32_t *ring; unsigned rptr; unsigned rptr_offs;
unsigned rptr_reg; unsigned rptr_save_reg; u64 next_rptr_gpu_addr; volatile u32 *next_rptr_cpu_addr; unsigned wptr; unsigned wptr_old;
unsigned wptr_reg;
What's the motivation for removing these? Seems like keeping them would allow keeping this patch and the resulting code smaller (fewer function variants) and cleaner (no if/else/... for the register offsets in the variants).
I was trying to remove the asic specific info out of the ring struct and into the asic specific functions.
Alex
On 12/10/2013 01:12 PM, Alex Deucher wrote:
On Tue, Dec 10, 2013 at 10:04 AM, Alex Deucher alexdeucher@gmail.com wrote:
On Mon, Dec 9, 2013 at 9:20 PM, Michel Dänzer michel@daenzer.net wrote:
On Mon, 2013-12-09 at 19:48 -0500, Alex Deucher wrote:
-u32 cik_compute_ring_get_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
+u32 cik_compute_get_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
{ u32 wptr;
if (rdev->wb.enabled) {
wptr = le32_to_cpu(rdev->wb.wb[ring->wptr_offs/4]);
wptr = rdev->wb.wb[ring->wptr_offs/4]; } else { mutex_lock(&rdev->srbm_mutex); cik_srbm_select(rdev, ring->me, ring->pipe,
ring->queue, 0); @@ -4053,8 +4081,8 @@ u32 cik_compute_ring_get_wptr(struct radeon_device *rdev, return wptr; }
-void cik_compute_ring_set_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
+void cik_compute_set_wptr(struct radeon_device *rdev,
{ rdev->wb.wb[ring->wptr_offs/4] = cpu_to_le32(ring->wptr);struct radeon_ring *ring)
I think this cpu_to_le32() needs to be dropped as well to match cik_compute_ring_get_wptr().
whoops, yeah, missed that one.
Updated patch attached.
Alex
Hi, Alex.
Has this patch been accepted upstream? I've tested it successfully on a ppc64 system with a FirePro 2270 adapter.
Thanks,
On Thu, Jan 2, 2014 at 3:54 PM, Kleber Sacilotto de Souza klebers@linux.vnet.ibm.com wrote:
On 12/10/2013 01:12 PM, Alex Deucher wrote:
On Tue, Dec 10, 2013 at 10:04 AM, Alex Deucher alexdeucher@gmail.com wrote:
On Mon, Dec 9, 2013 at 9:20 PM, Michel Dänzer michel@daenzer.net wrote:
On Mon, 2013-12-09 at 19:48 -0500, Alex Deucher wrote:
-u32 cik_compute_ring_get_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
+u32 cik_compute_get_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
{ u32 wptr;
if (rdev->wb.enabled) {
wptr = le32_to_cpu(rdev->wb.wb[ring->wptr_offs/4]);
wptr = rdev->wb.wb[ring->wptr_offs/4]; } else { mutex_lock(&rdev->srbm_mutex); cik_srbm_select(rdev, ring->me, ring->pipe,
ring->queue, 0); @@ -4053,8 +4081,8 @@ u32 cik_compute_ring_get_wptr(struct radeon_device *rdev, return wptr; }
-void cik_compute_ring_set_wptr(struct radeon_device *rdev,
struct radeon_ring *ring)
+void cik_compute_set_wptr(struct radeon_device *rdev,
{ rdev->wb.wb[ring->wptr_offs/4] = cpu_to_le32(ring->wptr);struct radeon_ring *ring)
I think this cpu_to_le32() needs to be dropped as well to match cik_compute_ring_get_wptr().
whoops, yeah, missed that one.
Updated patch attached.
Alex
Hi, Alex.
Has this patch been accepted upstream? I've tested it successfully on a ppc64 system with a FirePro 2270 adapter.
Not yet, I was planning to upstream it for 3.14 unless there are objections. http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.14-wip
Alex
Thanks,
-- Kleber Sacilotto de Souza IBM Linux Technology Center
dri-devel@lists.freedesktop.org