It might be related to a hardware bug, or the algorithm is flawed in a way I currently don't see. Anyway the old code we had wasn't so picky about such problems and the patch just tries to make the current code as robust as the old code was, which indeed seems to solve the problems we see.
The wrap around detection still works (tested by setting the initial fence value to 0xfffffff0 and letting it wrap around shortly after start), so I think it we can safely commit this.
Can we start fences off so we wrap around after say 15-20 minutes? that would ensure
a) its tested b) we see failure in a lifetime.
Dave.