Approximate Math Library
#1
Posted 16 January 2007 - 03:18 PM
I'm looking for the Intel Approximate Math Library. It used to be at www.intel.com/design/Pentium4/devtools/ but that page moved and I couldn't locate the library.
So, does anyone know where to find it, or can send me a copy? That would be much appreciated. :worthy: I'm also interested in other approximation libraries/functions. The more options between precision and performance the better...
Thanks a lot,
Nick
#2
Posted 17 January 2007 - 04:21 PM
#3
Posted 17 January 2007 - 05:22 PM
So the Intel Approximate Math Library contains approximations of transcendental functions, optimized for SSE. For some time I have used my own implementations (sin/cos) but now I'm looking for more precision with minimal performance impact...
#4
Posted 17 January 2007 - 06:17 PM
#5
Posted 18 January 2007 - 10:29 AM
#6
Posted 18 January 2007 - 05:46 PM
float exp2(float x)
{
static const float one = 1.0f;
static const float half = 0.5f;
static const float lim_two = 1.9999999f;
static const float C0[256] = {.547404266246f, .548061348142f, .548725333807f, .549398399167f, .550075922598f, .550760241024f, .551454706663f, .552154084359f, .552858883435f, .553574967496f, .554297163206f, .555026093302f, .555760495106f, .556505137841f, .557256046635f, .558017083881f, .558781729040f, .559558847539f, .560337005182f, .561128551344f, .561926919212f, .562731360892f, .563548838504f, .564371785305f, .565202403662f, .566041175842f, .566885148265f, .567741995681f, .568603271104f, .569474540778f, .570356611167f, .571248305001f, .572143490366f, .573051865152f, .573965673107f, .574892291723f, .575822287504f, .576764054387f, .577713794250f, .578675521802f, .579646563842f, .580622066161f, .581609491764f, .582606143121f, .583612970496f, .584629192183f, .585655505258f, .586690228227f, .587733473599f, .588788523879f, .589851795687f, .590925279595f, .592009141821f, .593102901511f, .594205792172f, .595321583429f, .596446634193f, .597577914584f, .598723732752f, .599879808291f, .601047527774f, .602222182051f, .603410536625f, .604607039548f, .605814947112f, .607033833104f, .608266709013f, .609506083140f, .610760886405f, .612025725384f, .613302240932f, .614585463030f, .615883610698f, .617191217751f, .618511752134f, .619845226500f, .621187546929f, .622542829136f, .623914626393f, .625290774906f, .626682429570f, .628088718711f, .629505305967f, .630928111994f, .632369622796f, .633820437496f, .635288816972f, .636766410268f, .638255823256f, .639759511520f, .641275658394f, .642806215397f, .644347352744f, .645904478891f, .647470860480f, .649053754624f, .650645737483f, .652253552238f, .653875847907f, .655506204107f, .657158529182f, .658824473497f, .660502429383f, .662188430285f, .663896943735f, .665614322456f, .667344320807f, .669091861220f, .670854384730f, .672633065213f, .674418265280f, .676229143542f, .678048816547f, .679881705149f, .681730321995f, .683593925186f, .685478115237f, .687372988183f, .689280701315f, .691208589068f, .693153490063f, .695110683979f, .697086170558f, .699075467598f, .701082480523f, .703098111035f, .705136926814f, .707188064752f, .709255034137f, .711344041865f, .713449959172f, .715567879355f, .717699596625f, .719857427125f, .722028307800f, .724216235913f, .726413096106f, .728639561732f, .730875455600f, .733132357639f, .735405868218f, .737702530438f, .740009520055f, .742332281356f, .744679668196f, .747037600146f, .749420315605f, .751820376930f, .754243487814f, .756683762441f, .759133959936f, .761605643019f, .764099729889f, .766614919527f, .769146093637f, .771693145114f, .774267141070f, .776858393243f, .779466882269f, .782096835996f, .784742415436f, .787410123430f, .790101570173f, .792815520645f, .795542431752f, .798294870351f, .801063242968f, .803849366342f, .806665638513f, .809494760355f, .812353037567f, .815226352727f, .818124218289f, .821037431411f, .823977286773f, .826945359586f, .829920082804f, .832926883189f, .835951538712f, .839006712543f, .842080344842f, .845177216775f, .848284765471f, .851423322415f, .854588979701f, .857776791512f, .860984910999f, .864215879261f, .867469536588f, .870748683758f, .874051542963f, .877378717421f, .880731452223f, .884101609574f, .887502615964f, .890925768871f, .894372845122f, .897843135408f, .901337846858f, .904862544340f, .908410015848f, .911984035138f, .915576759348f, .919196625760f, .922845709434f, .926529189028f, .930225124843f, .933945305924f, .937700927963f, .941479193749f, .945282181080f, .949117765062f, .952980179587f, .956873412255f, .960779020905f, .964718014705f, .968687508318f, .972685203315f, .976705892948f, .980760930721f, .984836290213f, .988951866922f, .993080243943f, .997245671924f, 1.00144630340f, 1.00567323430f, 1.00991602499f, 1.01419270811f, 1.01850230704f, 1.02285032212f, 1.02721871206f, 1.03161636643f, 1.03604653022f, 1.04050699395f, 1.04499850272f, 1.04951852054f, 1.05407459978f, 1.05865921545f, 1.06326856892f, 1.06791602645f, 1.07259384061f, 1.07729671165f, 1.08203878536f, 1.08680950686f, 1.09161248555f, 1.09645085758f, 1.10132609310f, 1.10623766041f, 1.11114750589f, 1.11612095277f, 1.12113714360f, 1.12616425808f, 1.13122641447f, 1.13631798693f, 1.14146805990f, 1.14663629983f};
static const float C1[256] = {.212044759469f, .210735702190f, .209418015443f, .208087475652f, .206753275519f, .205410853969f, .204053729978f, .202692211082f, .201325322173f, .199941790541f, .198551704706f, .197153917470f, .195750881257f, .194333564491f, .192909615777f, .191471793032f, .190032461265f, .188575010632f, .187120933209f, .185647204506f, .184166170344f, .182679245802f, .181173672353f, .179663467005f, .178144635485f, .176616349360f, .175084037800f, .173533838064f, .171981116491f, .170415875372f, .168836776605f, .167246022818f, .165654589240f, .164045289244f, .162431953647f, .160801620569f, .159170951952f, .157525268876f, .155871296686f, .154202124016f, .152522495456f, .150840826888f, .149144311266f, .147437665224f, .145719346014f, .143990758331f, .142250789144f, .140502354043f, .138745315445f, .136974218191f, .135195151140f, .133404847135f, .131603099825f, .129790785291f, .127969230244f, .126132298677f, .124286064322f, .122435531964f, .120567182576f, .118688092798f, .116796095743f, .114898862963f, .112985536212f, .111065131312f, .109132474142f, .107188321781f, .105227969151f, .103263383910f, .101280483950f, .99287883117e-1f, .97283061235e-1f, .95273867908e-1f, .93247492774e-1f, .91212558811e-1f, .89163726456e-1f, .87101072766e-1f, .85030989812e-1f, .82947185421e-1f, .80844315542e-1f, .78741073956e-1f, .76620455758e-1f, .74483906833e-1f, .72338082587e-1f, .70189189871e-1f, .68018435591e-1f, .65840082252e-1f, .63641802316e-1f, .61436186248e-1f, .59219384845e-1f, .56987821889e-1f, .54744275251e-1f, .52485937632e-1f, .50218520627e-1f, .47934149447e-1f, .45642766126e-1f, .43333822826e-1f, .41018216474e-1f, .38686201364e-1f, .36339829169e-1f, .33988431451e-1f, .31612040002e-1f, .29222787772e-1f, .26823040651e-1f, .24418495167e-1f, .21988609993e-1f, .19552890922e-1f, .17106039552e-1f, .14641181181e-1f, .12162020545e-1f, .9666998685e-2f, .7169663523e-2f, .4643309730e-2f, .2111596679e-2f, -.431598824e-3f, -.2989690662e-2f, -.5561577943e-2f, -.8154880822e-2f, -.10755886294e-1f, -.13367524684e-1f, -.15999743658e-1f, -.18648123903e-1f, -.21306166293e-1f, -.23981953789e-1f, -.26669330748e-1f, -.29373502333e-1f, -.32082162062e-1f, -.34814814297e-1f, -.37556805223e-1f, -.40312772251e-1f, -.43090887470e-1f, -.45884226576e-1f, -.48686220012e-1f, -.51499213346e-1f, -.54339341155e-1f, -.57189308948e-1f, -.6005430929e-1f, -.6292367512e-1f, -.6582431504e-1f, -.6872984049e-1f, -.7165524503e-1f, -.7459473297e-1f, -.7755666791e-1f, -.8052444140e-1f, -.8350502578e-1f, -.8650968268e-1f, -.8952031711e-1f, -.9255503332e-1f, -.9560425465e-1f, -.9867513068e-1f, -.10176011141f, -.10485000844f, -.10795935979f, -.11108920339f, -.11423780756f, -.11739868933f, -.12057166547f, -.12377042647f, -.12698282286f, -.13020878013f, -.13345344254f, -.13670954099f, -.13998500792f, -.14328171413f, -.14659803793f, -.14992226374f, -.15326963975f, -.15662841273f, -.16000074335f, -.16340152411f, -.16680978718f, -.17024508632f, -.17369037027f, -.17715697361f, -.18063381546f, -.18413430337f, -.18766017755f, -.19118577445f, -.19474116819f, -.19830943646f, -.20190541522f, -.20551481061f, -.20914316989f, -.21277573100f, -.21643619837f, -.22011987890f, -.22382091824f, -.22753710375f, -.23127131148f, -.23502327513f, -.23879613644f, -.24258775687f, -.24639874443f, -.25023043386f, -.25407345559f, -.25794304031f, -.26182918290f, -.26573382630f, -.26965609165f, -.27359726148f, -.27756351145f, -.28154663028f, -.28555077208f, -.28956708486f, -.29360493441f, -.29766652900f, -.30175749946f, -.30585341675f, -.30996731111f, -.31411145322f, -.31827160793f, -.32245000439f, -.32665519190f, -.33088073838f, -.33513090577f, -.33938551392f, -.34366739446f, -.34797328516f, -.35230059730f, -.35664361881f, -.36101452016f, -.36539810091f, -.36981565376f, -.37423767670f, -.37869008365f, -.38317075577f, -.38767009071f, -.39217694488f, -.39671041296f, -.40126933906f, -.40585941045f, -.41046149518f, -.41508490180f, -.41973294431f, -.42440320388f, -.42909637229f, -.43380970842f, -.43855098623f, -.44331228018f, -.44808957363f, -.45289663497f, -.45772533171f, -.46257013252f, -.46744551144f, -.47234051123f, -.47725875550f, -.48220335041f, -.48717569030f, -.49217510643f, -.49716286338f, -.50220524890f, -.50728090192f, -.51235755815f, -.51745955485f, -.52258113148f, -.52775139108f, -.53292971984f};
static const float C2[256] = {.240550974386f, .241202959893f, .241856701004f, .242514265929f, .243171103188f, .243829457574f, .244492480331f, .245155117376f, .245817854428f, .246486128053f, .247155039241f, .247825134276f, .248495241281f, .249169654431f, .249844713280f, .250523830778f, .251201163433f, .251884509323f, .252563787406f, .253249747676f, .253936606109f, .254623711603f, .255316926931f, .256009778763f, .256704095772f, .257400249962f, .258095764201f, .258796915826f, .259496734037f, .260199725451f, .260906458608f, .261615922477f, .262323223326f, .263035992751f, .263748082312f, .264465202209f, .265180012111f, .265898946252f, .266619044658f, .267343298130f, .268069619098f, .268794376437f, .269523080941f, .270253687830f, .270986837738f, .271721917527f, .272459385827f, .273197995608f, .273937798870f, .274681077710f, .275425261899f, .276171707452f, .276920486684f, .277671219387f, .278423348235f, .279179385328f, .279936813823f, .280693582599f, .281455206542f, .282218776699f, .282985153731f, .283751228711f, .284521373431f, .285291943163f, .286065008210f, .286840250780f, .287619522875f, .288398060167f, .289181428600f, .289966204157f, .290753368791f, .291539837984f, .292330618012f, .293122323170f, .293917023197f, .294714665277f, .295512769094f, .296313753856f, .297119642359f, .297923266738f, .298731122037f, .299542626941f, .300355242358f, .301166621541f, .301983850061f, .302801532724f, .303624282505f, .304447368111f, .305272224846f, .306100168085f, .306930150725f, .307763195441f, .308597186775f, .309435004672f, .310272993070f, .311114998221f, .311957035187f, .312802638726f, .313651045018f, .314498876947f, .315353314930f, .316209965286f, .317067971632f, .317925302128f, .318789261216f, .319652892561f, .320518078729f, .321387232845f, .322259028085f, .323133993420f, .324007380664f, .324888508986f, .325769102908f, .326651295684f, .327536259333f, .328423600216f, .329315922731f, .330208493193f, .331102319414f, .332000786889f, .332902364781f, .333804829061f, .334710914663f, .335618521194f, .336529395375f, .337439388370f, .338355041149f, .339271424585f, .340190082999f, .341113718319f, .342040006406f, .342966760909f, .343894760691f, .344829302014f, .345764673269f, .346702572919f, .347639507987f, .348584246701f, .349528173351f, .350476153452f, .351426290935f, .352381270258f, .353335726296f, .354291903376f, .355253394261f, .356214397492f, .357180679851f, .358149170924f, .359122122768f, .360097126490f, .361071279072f, .362049162020f, .363031076715f, .364016460615f, .365003272896f, .365991452648f, .366985245318f, .367980853922f, .368978251405f, .369979013823f, .370980890602f, .371986312192f, .372995831811f, .374008931067f, .375022025929f, .376039753227f, .377058524791f, .378078994230f, .379105644931f, .380132134704f, .381164337027f, .382197114688f, .383233855329f, .384271234863f, .385313243354f, .386360369799f, .387404990922f, .388456010655f, .389508406520f, .390566534906f, .391626171784f, .392688937114f, .393750505484f, .394817797166f, .395889415356f, .396963639187f, .398039817634f, .399118775070f, .400200421917f, .401285649507f, .402373826907f, .403465115321f, .404559881707f, .405655440162f, .406756120792f, .407859059092f, .408964796284f, .410073072953f, .411184239170f, .412300018644f, .413418085218f, .414539592392f, .415662053303f, .416788077635f, .417918262200f, .419054147477f, .420188943978f, .421326262837f, .422469476617f, .423614637508f, .424762353579f, .425914955357f, .427070660204f, .428230616859f, .429389315399f, .430552969348f, .431720668445f, .432891695257f, .434064493914f, .435242336232f, .436421114570f, .437606536221f, .438790675206f, .439980464677f, .441175310294f, .442372633912f, .443569472135f, .444770890044f, .445976559606f, .447187960487f, .448400030952f, .449615218011f, .450834377285f, .452056858160f, .453282828803f, .454511559547f, .455745060764f, .456981256106f, .458219094270f, .459462130949f, .460708242561f, .461955996149f, .463209104131f, .464464733125f, .465723802255f, .466987089818f, .468254934518f, .469527143677f, .470793869973f, .472071940011f, .473355895138f, .474637566791f, .475923104541f, .477211044044f, .478508675487f, .479805784911f};
#define EXP2_CHECK_FOR_LIMITS 1
__asm
{
movss xmm0, x
movss xmm1, xmm0
subss xmm1, half
cvtss2si ecx, xmm1
cvtsi2ss xmm2, ecx
add ecx, 127
#if EXP2_CHECK_FOR_LIMITS
test ecx, 0xFFFFFF00
jnz overflow
#endif
shl ecx, 23
movd xmm7, ecx
subss xmm0, xmm2
addss xmm0, one
minss xmm0, lim_two
movd eax, xmm0
and eax, 0x007F8000
shr eax, 13
movss xmm1, C2[eax]
mulss xmm1, xmm0
addss xmm1, C1[eax]
mulss xmm1, xmm0
addss xmm1, C0[eax]
mulss xmm1, xmm7
movss x, xmm1
#if EXP2_CHECK_FOR_LIMITS
jmp end
overflow:
comiss xmm0, one
jp not_a_number
jb underflow
mov x, 0x7F800000
jmp end
underflow:
mov x, 0x00000000
jmp end
not_a_number:
movss x, xmm0
end:
#endif
}
return x;
}
It's accurate to 2 ulps. Basically it uses the 8 upper mantissa bits to lookup the coefficients of a quadratic polynomial, determined with the minimax algorithm. It handles all inputs (denormals, infinity, NaN), but flushes denormal results to 0.0.All ideas to make it more accurate and/or faster are welcome!
#7
Posted 18 January 2007 - 10:46 PM
#8
Posted 18 January 2007 - 11:43 PM
#10
Posted 20 January 2007 - 09:48 AM
Here's my own vector implementation:
#include <emmintrin.h>
__m128 exp2(__m128 x)
{
static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000}; // 129.00000e+0f
static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF}; // -126.99999e+0f
static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000}; // 0.5e+0f
static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F}; // 127
static const __declspec(align(16)) int C5[4] = {0x3AF61905, 0x3AF61905, 0x3AF61905, 0x3AF61905}; // 1.8775767e-3f
static const __declspec(align(16)) int C4[4] = {0x3C134806, 0x3C134806, 0x3C134806, 0x3C134806}; // 8.9893397e-3f
static const __declspec(align(16)) int C3[4] = {0x3D64AA23, 0x3D64AA23, 0x3D64AA23, 0x3D64AA23}; // 5.5826318e-2f
static const __declspec(align(16)) int C2[4] = {0x3E75EAD4, 0x3E75EAD4, 0x3E75EAD4, 0x3E75EAD4}; // 2.4015361e-1f
static const __declspec(align(16)) int C1[4] = {0x3F31727B, 0x3F31727B, 0x3F31727B, 0x3F31727B}; // 6.9315308e-1f
static const __declspec(align(16)) int C0[4] = {0x3F7FFFFF, 0x3F7FFFFF, 0x3F7FFFFF, 0x3F7FFFFF}; // 9.9999994e-1f
__asm
{
movaps xmm0, x
minps xmm0, lim1
maxps xmm0, lim2
movaps xmm1, xmm0
subps xmm1, half
cvtps2dq xmm2, xmm1
cvtdq2ps xmm1, xmm2
paddd xmm2, base
pslld xmm2, 23
subps xmm0, xmm1
movaps xmm1, C5
mulps xmm1, xmm0
addps xmm1, C4
mulps xmm1, xmm0
addps xmm1, C3
mulps xmm1, xmm0
addps xmm1, C2
mulps xmm1, xmm0
addps xmm1, C1
mulps xmm1, xmm0
addps xmm1, C0
mulps xmm1, xmm2
movaps x, xmm1
}
return x;
}
Unlike the previous version it requires SSE2. It doesn't use any lookup tables, but instead uses a polynomial of degree five. This creates a fairly long dependency chain, wich means a high latency. But at 21 instructions this is compensated with a high throughput. In English this means it's very suited for computing lots of 2x values in a loop.It has a relative accuracy of 2-22.33. It returns infinity on overflow, 0.0 on underflow, and keeps it accracy with denormals. For NaN inputs it returns infinity.
#11
Posted 20 January 2007 - 02:29 PM
Here's a version with a weighted minimax polynomial of order four, yielding 2-18.54 relative precision:
#include <emmintrin.h>
__m128 exp2(__m128 x)
{
static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000}; // 129.00000e+0f
static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF}; // -126.99999e+0f
static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000}; // 0.5e+0f
static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F}; // 127
static const __declspec(align(16)) int C4[4] = {0x3C5DBE69, 0x3C5DBE69, 0x3C5DBE69, 0x3C5DBE69}; // 1.3534167e-2f
static const __declspec(align(16)) int C3[4] = {0x3D5509F9, 0x3D5509F9, 0x3D5509F9, 0x3D5509F9}; // 5.2011464e-2f
static const __declspec(align(16)) int C2[4] = {0x3E773CC5, 0x3E773CC5, 0x3E773CC5, 0x3E773CC5}; // 2.4144275e-1f
static const __declspec(align(16)) int C1[4] = {0x3F3168B3, 0x3F3168B3, 0x3F3168B3, 0x3F3168B3}; // 6.9300383e-1f
static const __declspec(align(16)) int C0[4] = {0x3F800016, 0x3F800016, 0x3F800016, 0x3F800016}; // 1.0000026f
__asm
{
movaps xmm0, x
minps xmm0, lim1
maxps xmm0, lim2
movaps xmm1, xmm0
subps xmm1, half
cvtps2dq xmm2, xmm1
cvtdq2ps xmm1, xmm2
paddd xmm2, base
pslld xmm2, 23
subps xmm0, xmm1
movaps xmm1, C4
mulps xmm1, xmm0
addps xmm1, C3
mulps xmm1, xmm0
addps xmm1, C2
mulps xmm1, xmm0
addps xmm1, C1
mulps xmm1, xmm0
addps xmm1, C0
mulps xmm1, xmm2
movaps x, xmm1
}
return x;
}
Here's a version with a weighted minimax polynomial of order three, yielding 2-13.70 relative precision:
#include <emmintrin.h>
__m128 exp2(__m128 x)
{
static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000}; // 129.00000e+0f
static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF}; // -126.99999e+0f
static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000}; // 0.5e+0f
static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F}; // 127
static const __declspec(align(16)) int C3[4] = {0x3D9FCB52, 0x3D9FCB52, 0x3D9FCB52, 0x3D9FCB52}; // 7.8024521e-2f
static const __declspec(align(16)) int C2[4] = {0x3E677E26, 0x3E677E26, 0x3E677E26, 0x3E677E26}; // 2.2606716e-1f
static const __declspec(align(16)) int C1[4] = {0x3F322226, 0x3F322226, 0x3F322226, 0x3F322226}; // 6.9583356e-1f
static const __declspec(align(16)) int C0[4] = {0x3F7FFB19, 0x3F7FFB19, 0x3F7FFB19, 0x3F7FFB19}; // 9.9992520e-1f
__asm
{
movaps xmm0, x
minps xmm0, lim1
maxps xmm0, lim2
movaps xmm1, xmm0
subps xmm1, half
cvtps2dq xmm2, xmm1
cvtdq2ps xmm1, xmm2
paddd xmm2, base
pslld xmm2, 23
subps xmm0, xmm1
movaps xmm1, C3
mulps xmm1, xmm0
addps xmm1, C2
mulps xmm1, xmm0
addps xmm1, C1
mulps xmm1, xmm0
addps xmm1, C0
mulps xmm1, xmm2
movaps x, xmm1
}
return x;
}
Here's a version with a weighted minimax polynomial of order two, yielding 2-9.17 relative precision:
#include <emmintrin.h>
__m128 exp2(__m128 x)
{
static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000}; // 129.00000e+0f
static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF}; // -126.99999e+0f
static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000}; // 0.5e+0f
static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F}; // 127
static const __declspec(align(16)) int C2[4] = {0x3EACA418, 0x3EACA418, 0x3EACA418, 0x3EACA418}; // 3.3718944e-1f
static const __declspec(align(16)) int C1[4] = {0x3F285ADA, 0x3F285ADA, 0x3F285ADA, 0x3F285ADA}; // 6.5763628e-1f
static const __declspec(align(16)) int C0[4] = {0x3F803884, 0x3F803884, 0x3F803884, 0x3F803884}; // 1.0017247f
__asm
{
movaps xmm0, x
minps xmm0, lim1
maxps xmm0, lim2
movaps xmm1, xmm0
subps xmm1, half
cvtps2dq xmm2, xmm1
cvtdq2ps xmm1, xmm2
paddd xmm2, base
pslld xmm2, 23
subps xmm0, xmm1
movaps xmm1, C2
mulps xmm1, xmm0
addps xmm1, C1
mulps xmm1, xmm0
addps xmm1, C0
mulps xmm1, xmm2
movaps x, xmm1
}
return x;
}
Every floating-point number has been tested, using an exhaustive loop. For the record, Intel's library claims an average relative error of 0.0018, but it peaks at a terrible 0.29 relative error.
#12
Posted 20 January 2007 - 03:56 PM
But I don't think I'll touch C++ in the sooner time at all anymore... And I don't have my code lying around, else, it would be a simple copy-paste.
But I'll remember :)
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....
#13
Posted 20 January 2007 - 06:09 PM
davepermen said:
#14
Posted 25 January 2007 - 04:25 PM
__m128 log2(__m128 x)
{
static const __declspec(align(16)) int exp[4] = {0x7F800000, 0x7F800000, 0x7F800000, 0x7F800000};
static const __declspec(align(16)) int one[4] = {0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000}; // 1.0f
static const __declspec(align(16)) int off[4] = {0x3FBF8000, 0x3FBF8000, 0x3FBF8000, 0x3FBF8000}; // 1.4960938f
static const __declspec(align(16)) int mant[4] = {0x007FFFFF, 0x007FFFFF, 0x007FFFFF, 0x007FFFFF};
static const __declspec(align(16)) int sign[4] = {0x80000000, 0x80000000, 0x80000000, 0x80000000};
static const __declspec(align(16)) int base[4] = {0x43800000, 0x43800000, 0x43800000, 0x43800000}; // 256.0f
static const __declspec(align(16)) int C5[4] = {0xBD0D0CC5, 0xBD0D0CC5, 0xBD0D0CC5, 0xBD0D0CC5}; // -3.4436006e-2f
static const __declspec(align(16)) int C4[4] = {0x3EA2ECDD, 0x3EA2ECDD, 0x3EA2ECDD, 0x3EA2ECDD}; // 3.1821337e-1f
static const __declspec(align(16)) int C3[4] = {0xBF9dA2C9, 0xBF9dA2C9, 0xBF9dA2C9, 0xBF9dA2C9}; // -1.2315303f
static const __declspec(align(16)) int C2[4] = {0x4026537B, 0x4026537B, 0x4026537B, 0x4026537B}; // 2.5988452f
static const __declspec(align(16)) int C1[4] = {0xC054bFAD, 0xC054bFAD, 0xC054bFAD, 0xC054bFAD}; // -3.3241990f
static const __declspec(align(16)) int C0[4] = {0x4047691A, 0x4047691A, 0x4047691A, 0x4047691A}; // 3.1157899f
__asm
{
movaps xmm0, x
movaps xmm1, xmm0
andps xmm1, exp
psrld xmm1, 8
orps xmm1, one
subps xmm1, off
mulps xmm1, base
andps xmm0, mant
orps xmm0, one
movaps xmm2, C5
mulps xmm2, xmm0
addps xmm2, C4
mulps xmm2, xmm0
addps xmm2, C3
mulps xmm2, xmm0
addps xmm2, C2
mulps xmm2, xmm0
addps xmm2, C1
mulps xmm2, xmm0
addps xmm2, C0
subps xmm0, one
mulps xmm0, xmm2
addps xmm1, xmm0
movaps x, xmm1
}
return x;
}
Unfortunately logarithms don't converge very fast. It's also not that easy to get good behaviour for the limits. :sad:
#15
Posted 25 January 2007 - 04:41 PM
#16
Posted 25 January 2007 - 08:33 PM
Kenneth Gorking said:
Anyway, I'm considering making it an article for the 'Daily Code Gem', including benchmark results. I can't promise when I have the time for it though...
#17
Posted 26 January 2007 - 12:23 AM
Nick said:
Yes. You and you're work are the only reason to regret this... But sorry, thats not enough... :)
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....
#18
Posted 26 January 2007 - 03:40 AM
I'll quote from "AMaths.pdf", page 5:
Quote
Compared to the packed AM Library functions, the scalar ones are slower per operand but
faster per call. Therefore, when just one result is needed, it is faster to call the scalar version.
Otherwise (for two or more results), the packed version should be used.
Regards,
Ciao ciao : )
(readin' this? you ought to get out more)
#19
Posted 26 January 2007 - 06:27 AM
davepermen said:
A native C# compiler with inline assembly would convince me in a second though...
#20
Posted 26 January 2007 - 06:47 AM
Nautilus said:
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users












