OK i found -mtune=generic is culprit for performance :). Played a little with -mtune to found what is minimum this code wants to work fast: -mtune=i586 = slow -mtune=pentium = slow -mtune=pentium-mmx = slow -mtune=pentium-pro = fast -mtune=i686 = fast -mtune=pentium3 = fast -mtune=pentium-pro = fast etc... So -mtune=generic seems to set lower cpu target than this code needed to perform fast.