c - gcc (6.1.0) using 'wrong' instructions in SSE intrinsics -
background: develop computationally intensive tool, written in c/c++, has able run on variety of different x86_64 processors. speed calculations both float , integer, code contains rather lot of sse* intrinsics different paths tailored different cpu sse capabilities. (as cpu flags detected @ start of program , used set booleans, i've assumed branch prediction tailored blocks of code work effectively).
for simplicity i've assumed sse2 through sse4.2 need considered.
in order access sse4.2 intrinsics fpr 4.2 paths, need use gcc's -msse4.2 option.
the problem issue i'm having that, @ least 6.1.0, gcc goes , implements sse2 intrinsic, mm_cvtsi32_si128, sse4.2 instruction, pinsrd.
if limit compilation using -msse2, use sse2 instruction, movd, ie. 1 intel "intrinsics guide" says it's supposed use.
this annoying on 2 counts.
1) critical problem program crashes illegal instruction when gets run on pre4.2 cpu. don't have control on hw used executable needs compatible older machines, yet needs take advantage of features on newer hw available.
2) according intel intrinsics guide, pinsrd instruction quite lot slower mov replaces. (pinsrd more general not needed).
does know how make gcc just use instructions intrinsics guide says should used yet still allow access sse2 through sse4* in same compilation unit?
update: should note same code compiled under linux, windows , osx using variety of different compilers rather avoid or @ least have fewest compiler-specific extensions if possible.
update2: (thanks @petercordes) seems if optimisation enabled, gcc revert using movd pinsrd appropriate.
if give -msse4.2
flag gcc's command line during compilation step, assume free use sse 4.2 instruction set entire translation unit. can lead behavior described. if need code only uses sse2 , below code, using -msse2
(or no flag @ if you're building x86_64) required.
some options can think of are:
if can break down code @ function level, gcc's multiversioning feature can help. requires relatively recent version of compiler, allows things (taken link above):
__attribute__ ((target ("default"))) int foo () { // default version of foo. return 0; } __attribute__ ((target ("sse4.2"))) int foo () { // foo version sse4.2 return 1; } __attribute__ ((target ("arch=atom"))) int foo () { // foo version intel atom processor return 2; } __attribute__ ((target ("arch=amdfam10"))) int foo () { // foo version amd family 0x10 processors. return 3; } int main () { int (*p)() = &foo; assert ((*p) () == foo ()); return 0; }
in example, gcc automatically compile different versions of
foo()
, dispatch appropriate 1 @ runtime based on cpu's capabilities.you can break different implementations (sse2, sse4.2, etc.) different translation units, dispatch appropriately right implementation @ runtime.
you can put of simd code shared library , build shared library multiple times different compiler flags. @ runtime, can detect cpu's capabilities , load appropriate version of shared library. approach taken libraries intel's math kernel library.
Comments
Post a Comment