Looks like clEnqueueNDRangeKernel is re-compiling (or at least optimising?) the kernel, and taking a looooooooong time to do so. Isn't compiling supposed to be done *once* by clBuildProgram?