The DDK blob has the ability to mark only certain areas of memory as coherent for performance reasons. For simple things like kmscube I would expect that it's basically write-only from the CPU and almost all memory the GPU touches isn't touched by the CPU. I.e. coherency isn't helping and the coherency traffic is probably expensive. Whether the complexity is worth it for "real" content I don't know - it may just be silly benchmarks that benefit.
Right, Panfrost userspace specifically assumes GPU reads to be expensive and treats GPU memory as write-only *except* for a few special cases (compute-like workloads, glReadPixels, some blits, etc).
The vast majority of the GPU memory - everything used in kmscube - will be write-only to the CPU and fed directly into the display zero-copy (or read by the GPU later as a dmabuf).