https://bugs.freedesktop.org/show_bug.cgi?id=100105
Bug ID: 100105 Summary: Make Theano OpenCL support work on Clover and RadeonSI Product: Mesa Version: git Hardware: Other OS: All Status: NEW Severity: major Priority: medium Component: Drivers/Gallium/radeonsi Assignee: dri-devel@lists.freedesktop.org Reporter: vedran@miletic.net QA Contact: dri-devel@lists.freedesktop.org
$ DEVICE="opencl0:0" python -c "import pygpu;pygpu.test()" pygpu is installed in /usr/lib64/python2.7/site-packages/pygpu-0.6.2-py2.7-linux-x86_64.egg/pygpu NumPy version 1.11.2 NumPy relaxed strides checking option: False NumPy is installed in /usr/lib64/python2.7/site-packages/numpy Python version 2.7.13 (default, Jan 12 2017, 17:59:37) [GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] nose version 1.3.7 *** Testing for AMD FIJI (DRM 3.8.0 / 4.9.13-200.fc25.x86_64, LLVM 5.0.0)
========================================================
AN INTERNAL KERNEL BUILD ERROR OCCURRED! device name = AMD FIJI (DRM 3.8.0 / 4.9.13-200.fc25.x86_64, LLVM 5.0.0) error = -43 memory pattern = Register accumulation based swap, computing kernel generator Subproblem dimensions: dims[0].itemY = 32, dims[0].itemX = 32, dims[0].y = 32, dims[0].x = 32, dims[0].bwidth = 64; ; dims[1].itemY = 4, dims[1].itemX = 4, dims[1].y = 4, dims[1].x = 4, dims[1].bwidth = 8; ; Parallelism granularity: pgran->wgDim = 1, pgran->wgSize[0] = 64, pgran->wgSize[1] = 1, pgran->wfSize = 64 Kernel extra flags: 369130144 Source:
#ifdef DOUBLE_PRECISION #ifdef cl_khr_fp64 #pragma OPENCL EXTENSION cl_khr_fp64 : enable #else #pragma OPENCL EXTENSION cl_amd_fp64 : enable #endif #endif
__kernel void Sdot_kernel( __global float *_X, __global float *_Y, __global float *scratchBuff, uint N, uint offx, int incx, uint offy, int incy, int doConj ) { __global float *X = _X + offx; __global float *Y = _Y + offy; float dotP = (float) 0.0;
if ( incx < 0 ) { X = X + (N - 1) * abs(incx); } if ( incy < 0 ) { Y = Y + (N - 1) * abs(incy); }
int gOffset; for( gOffset=(get_global_id(0) * 4); (gOffset + 4 - 1)<N; gOffset+=( get_global_size(0) * 4 ) ) { float4 vReg1, vReg2, res;
#ifdef INCX_NONUNITY vReg1 = (float4)( (X + (gOffset*incx))[0 + ( incx * 0)], (X + (gOffset*incx))[0 + ( incx * 1)], (X + (gOffset*incx))[0 + ( incx * 2)], (X + (gOffset*incx))[0 + ( incx * 3)]); #else vReg1 = vload4( 0, (__global float *) (X + gOffset) ); #endif
#ifdef INCY_NONUNITY vReg2 = (float4)( (Y + (gOffset*incy))[0 + ( incy * 0)], (Y + (gOffset*incy))[0 + ( incy * 1)], (Y + (gOffset*incy))[0 + ( incy * 2)], (Y + (gOffset*incy))[0 + ( incy * 3)]); #else vReg2 = vload4( 0, (__global float *) (Y + gOffset) ); #endif
; res = vReg1 * vReg2 ; dotP += res .S0 + res .S1 + res .S2 + res .S3; ; // Add-up elements in the vector to give a scalar }
// Loop for the last thread to handle the tail part of the vector // Using the same gOffset used above for( ; gOffset<N; gOffset++ ) { float sReg1, sReg2, res; sReg1 = X[gOffset * incx]; sReg2 = Y[gOffset * incy];
; res = sReg1 * sReg2 ; dotP = dotP + res ; }
// Note: this has to be called outside any if-conditions- because REDUCTION uses barrier // dotP of work-item 0 will have the final reduced item of the work-group __local float viraW [ 64 ]; uint kFbwL = get_local_id(0); viraW [ kFbwL ] = dotP ; barrier(CLK_LOCAL_MEM_FENCE);
if( kFbwL < 32 ) { viraW [ kFbwL ] = viraW [ kFbwL ] + viraW [ kFbwL + 32 ]; } barrier(CLK_LOCAL_MEM_FENCE);
if( kFbwL < 16 ) { viraW [ kFbwL ] = viraW [ kFbwL ] + viraW [ kFbwL + 16 ]; } barrier(CLK_LOCAL_MEM_FENCE);
if( kFbwL < 8 ) { viraW [ kFbwL ] = viraW [ kFbwL ] + viraW [ kFbwL + 8 ]; } barrier(CLK_LOCAL_MEM_FENCE);
if( kFbwL < 4 ) { viraW [ kFbwL ] = viraW [ kFbwL ] + viraW [ kFbwL + 4 ]; } barrier(CLK_LOCAL_MEM_FENCE);
if( kFbwL < 2 ) { viraW [ kFbwL ] = viraW [ kFbwL ] + viraW [ kFbwL + 2 ]; } barrier(CLK_LOCAL_MEM_FENCE);
if( kFbwL == 0 ) { dotP = viraW [0] + viraW [1]; }
if( (get_local_id(0)) == 0 ) { scratchBuff[ get_group_id(0) ] = dotP; } }
--------------------------------------------------------
Build log:
========================================================
Segmentation fault (core dumped)
https://bugs.freedesktop.org/show_bug.cgi?id=100105
Vedran Miletić vedran@miletic.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Blocks| |99553 URL| |http://deeplearning.net/sof | |tware/libgpuarray/installat | |ion.html
Referenced Bugs:
https://bugs.freedesktop.org/show_bug.cgi?id=99553 [Bug 99553] Tracker bug for runnning OpenCL applications on Clover
https://bugs.freedesktop.org/show_bug.cgi?id=100105
Vedran Miletić vedran@miletic.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Depends on| |100212
Referenced Bugs:
https://bugs.freedesktop.org/show_bug.cgi?id=100212 [Bug 100212] Implement vload_* and vstore_* to make Theano/libgpuarray working
https://bugs.freedesktop.org/show_bug.cgi?id=100105
Vedran Miletić vedran@miletic.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Depends on| |94273
Referenced Bugs:
https://bugs.freedesktop.org/show_bug.cgi?id=94273 [Bug 94273] Clover on RadeonSI OpenCL segfault during testing of clBLAS
https://bugs.freedesktop.org/show_bug.cgi?id=100105
--- Comment #1 from Jan Vesely jan.vesely@rutgers.edu --- *** Testing for AMD Radeon R7 Graphics (CARRIZO / DRM 3.18.0 / 4.11.0-ROC, LLVM 5.0.0)
Ran 6670 tests in 785.274s
FAILED (SKIP=12, errors=580, failures=12)
all errors are caused by: TypeError: This is for CUDA arrays.
I haven't investigated the failures.
There are couple of patches needed: https://github.com/Theano/libgpuarray/pull/534 https://github.com/Theano/libgpuarray/pull/535
http://lists.llvm.org/pipermail/libclc-dev/2017-September/002449.html
and: diff --git a/src/cluda_opencl.h b/src/cluda_opencl.h index 6e0095c..e93aa8b 100644 --- a/src/cluda_opencl.h +++ b/src/cluda_opencl.h @@ -48,9 +48,9 @@ typedef struct _ga_half { } ga_half;
#define ga_half2float(p) vload_half(0, &((p).data)) -static inline ga_half ga_float2half(ga_float f) { +inline ga_half ga_float2half(ga_float f) { ga_half r; - vstore_half_rtn(f, 0, &r.data); + vstore_half(f, 0, &r.data); return r; } diff --git a/src/gpuarray_buffer_opencl.c b/src/gpuarray_buffer_opencl.c index 8f12811..2041ca2 100644 --- a/src/gpuarray_buffer_opencl.c +++ b/src/gpuarray_buffer_opencl.c @@ -146,7 +146,7 @@ cl_ctx *cl_make_ctx(cl_context ctx, gpucontext_props *p) { CL_CHECKN(global_err, clGetDeviceInfo(id, CL_DEVICE_VERSION, device_version_size, device_version, NULL)); - if (device_version[7] == '1' && device_version[9] < '2') { + if (device_version[7] == '1' && device_version[9] < '1') { error_set(global_err, GA_UNSUPPORTED_ERROR, "We only support OpenCL 1.2 and up"); return NULL;
https://bugs.freedesktop.org/show_bug.cgi?id=100105 Bug 100105 depends on bug 100212, which changed state.
Bug 100212 Summary: Implement vload_* and vstore_* to make Theano/libgpuarray working https://bugs.freedesktop.org/show_bug.cgi?id=100212
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED
https://bugs.freedesktop.org/show_bug.cgi?id=100105
--- Comment #2 from Jan Vesely jan.vesely@rutgers.edu --- Latest update: diff --git a/src/cluda_opencl.h b/src/cluda_opencl.h index 6e0095c..8ba2d14 100644 --- a/src/cluda_opencl.h +++ b/src/cluda_opencl.h @@ -48,7 +48,7 @@ typedef struct _ga_half { } ga_half;
#define ga_half2float(p) vload_half(0, &((p).data)) -static inline ga_half ga_float2half(ga_float f) { +inline ga_half ga_float2half(ga_float f) { ga_half r; vstore_half_rtn(f, 0, &r.data); return r; diff --git a/src/gpuarray_buffer_opencl.c b/src/gpuarray_buffer_opencl.c index 8f12811..2041ca2 100644 --- a/src/gpuarray_buffer_opencl.c +++ b/src/gpuarray_buffer_opencl.c @@ -146,7 +146,7 @@ cl_ctx *cl_make_ctx(cl_context ctx, gpucontext_props *p) { CL_CHECKN(global_err, clGetDeviceInfo(id, CL_DEVICE_VERSION, device_version_size, device_version, NULL)); - if (device_version[7] == '1' && device_version[9] < '2') { + if (device_version[7] == '1' && device_version[9] < '1') { error_set(global_err, GA_UNSUPPORTED_ERROR, "We only support OpenCL 1.2 and up"); return NULL
pygpu.test()
pygpu is installed in /home/jvesely/.local/lib/python3.6/site-packages/pygpu-0.7.5+12.g6f0132c.dirty-py3.6-linux-x86_64.egg/pygpu NumPy version 1.13.3 NumPy relaxed strides checking option: True NumPy is installed in /usr/lib64/python3.6/site-packages/numpy Python version 3.6.4 (default, Mar 13 2018, 18:18:20) [GCC 7.3.1 20180303 (Red Hat 7.3.1-5)] nose version 1.3.7 *** Testing for AMD Radeon R7 Graphics (CARRIZO / DRM 3.23.0 / 4.15.14-300.fc27.x86_64, LLVM 6.0.0)
---------------------------------------------------------------------- Ran 6670 tests in 995.728s
FAILED (SKIP=12, errors=580, failures=2)
All errors are: TypeError: This is for CUDA arrays. The two failures are: FAIL: pygpu.tests.test_elemwise.test_elemwise_f16(<built-in function add>, 'float16', 'float16', (50,)) FAIL: pygpu.tests.test_elemwise.test_elemwise_f16(<built-in function iadd>, 'float16', 'float16', (50,))
Which fail on half precision rounding error. for example: 7.0390625+7.20703125 is expected to be 14.25 but gpu returns 14.2421875 the fp32 result is 14.24609375.
The GPU result is rounded down (towards zero) The CPU result is rounded up (away from zero)
It looks like our vstore_half_rtn is not working as expected, which is weird because it passes CTS.
https://bugs.freedesktop.org/show_bug.cgi?id=100105
--- Comment #3 from Jan Vesely jan.vesely@rutgers.edu --- (In reply to Jan Vesely from comment #2)
It looks like our vstore_half_rtn is not working as expected, which is weird because it passes CTS.
I take this back.
vstore_half_rtn rounds to negative infinity (towards 0 for positive numbers). Changing line 53 in cluda_opencl.h: - vstore_half_rtn(f, 0, &r.data); + vstore_half_rte(f, 0, &r.data);
fixes the two failures.
Other than advertising OCL1.2 the remaining failures are NOTOURBUG.
https://bugs.freedesktop.org/show_bug.cgi?id=100105
Jan Vesely jan.vesely@rutgers.edu changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://github.com/Theano/l | |ibgpuarray/issues/491
https://bugs.freedesktop.org/show_bug.cgi?id=100105
Jan Vesely jan.vesely@rutgers.edu changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://github.com/Theano/l | |ibgpuarray/issues/462
https://bugs.freedesktop.org/show_bug.cgi?id=100105
--- Comment #4 from Jan Vesely jan.vesely@rutgers.edu --- Lowering CL requirements combined with the following pull requests: https://github.com/Theano/libgpuarray/pull/571 https://github.com/Theano/libgpuarray/pull/570
Results in: Ran 4970 tests in 1158.909s
OK (SKIP=12)
https://bugs.freedesktop.org/show_bug.cgi?id=100105
--- Comment #5 from Jan Vesely jan.vesely@rutgers.edu --- (In reply to Jan Vesely from comment #4)
Lowering CL requirements combined with the following pull requests: https://github.com/Theano/libgpuarray/pull/571 https://github.com/Theano/libgpuarray/pull/570
Both above pull requests have been merged with slight modifications. running CLOVER_DEVICE_VERSION_OVERRIDE=1.2 CLOVER_DEVICE_CLC_VERSION_OVERRIDE=1.2
results in:
Ran 6670 tests in 991.622s
OK (SKIP=12)
https://bugs.freedesktop.org/show_bug.cgi?id=100105
Timothy Arceri t_arceri@yahoo.com.au changed:
What |Removed |Added ---------------------------------------------------------------------------- Component|Drivers/Gallium/radeonsi |Gallium/StateTracker/Clover Assignee|dri-devel@lists.freedesktop |mesa-dev@lists.freedesktop. |.org |org QA Contact|dri-devel@lists.freedesktop |mesa-dev@lists.freedesktop. |.org |org
dri-devel@lists.freedesktop.org