0
102 Jan 16, 2007 at 15:18

Hi all,

I’m looking for the Intel Approximate Math Library. It used to be at www.intel.com/design/Pentium4/devtools/ but that page moved and I couldn’t locate the library.

So, does anyone know where to find it, or can send me a copy? That would be much appreciated. :worthy: I’m also interested in other approximation libraries/functions. The more options between precision and performance the better…

Thanks a lot,

Nick

#### 24 Replies

0
101 Jan 17, 2007 at 16:21

I have never heard of this library, but AMD has a library of their. I’m sure you know this already, but I’ll post it anyway just in case: http://developer.amd.com/acml.jsp

0
102 Jan 17, 2007 at 17:22

AMD’s library doesn’t really have what I’m looking for. I seek fast implementations of transcendental functions (exp, log, sin, etc). Thanks anyway, it was worth checking out.

So the Intel Approximate Math Library contains approximations of transcendental functions, optimized for SSE. For some time I have used my own implementations (sin/cos) but now I’m looking for more precision with minimal performance impact…

0
101 Jan 17, 2007 at 18:17

The CRT has a few functions implemented using SSE2 (see here), but they are most likely built for precision. The CRT source does however come with VS2005, so you could dig up those files and enhance them for speed :)

0
102 Jan 18, 2007 at 10:29

Interesting! It seems pretty hard to derive a faster (single-precision) version from it, but it’s always good to have a reference.

0
102 Jan 18, 2007 at 17:46

Since the Intel library is still nowhere to be found, I started writing my own approximations. Here’s a 2\^x implementation (first attempt):

float exp2(float x)
{
static const float one = 1.0f;
static const float half = 0.5f;
static const float lim_two = 1.9999999f;

static const float C0[256] = {.547404266246f, .548061348142f, .548725333807f, .549398399167f, .550075922598f, .550760241024f, .551454706663f, .552154084359f, .552858883435f, .553574967496f, .554297163206f, .555026093302f, .555760495106f, .556505137841f, .557256046635f, .558017083881f, .558781729040f, .559558847539f, .560337005182f, .561128551344f, .561926919212f, .562731360892f, .563548838504f, .564371785305f, .565202403662f, .566041175842f, .566885148265f, .567741995681f, .568603271104f, .569474540778f, .570356611167f, .571248305001f, .572143490366f, .573051865152f, .573965673107f, .574892291723f, .575822287504f, .576764054387f, .577713794250f, .578675521802f, .579646563842f, .580622066161f, .581609491764f, .582606143121f, .583612970496f, .584629192183f, .585655505258f, .586690228227f, .587733473599f, .588788523879f, .589851795687f, .590925279595f, .592009141821f, .593102901511f, .594205792172f, .595321583429f, .596446634193f, .597577914584f, .598723732752f, .599879808291f, .601047527774f, .602222182051f, .603410536625f, .604607039548f, .605814947112f, .607033833104f, .608266709013f, .609506083140f, .610760886405f, .612025725384f, .613302240932f, .614585463030f, .615883610698f, .617191217751f, .618511752134f, .619845226500f, .621187546929f, .622542829136f, .623914626393f, .625290774906f, .626682429570f, .628088718711f, .629505305967f, .630928111994f, .632369622796f, .633820437496f, .635288816972f, .636766410268f, .638255823256f, .639759511520f, .641275658394f, .642806215397f, .644347352744f, .645904478891f, .647470860480f, .649053754624f, .650645737483f, .652253552238f, .653875847907f, .655506204107f, .657158529182f, .658824473497f, .660502429383f, .662188430285f, .663896943735f, .665614322456f, .667344320807f, .669091861220f, .670854384730f, .672633065213f, .674418265280f, .676229143542f, .678048816547f, .679881705149f, .681730321995f, .683593925186f, .685478115237f, .687372988183f, .689280701315f, .691208589068f, .693153490063f, .695110683979f, .697086170558f, .699075467598f, .701082480523f, .703098111035f, .705136926814f, .707188064752f, .709255034137f, .711344041865f, .713449959172f, .715567879355f, .717699596625f, .719857427125f, .722028307800f, .724216235913f, .726413096106f, .728639561732f, .730875455600f, .733132357639f, .735405868218f, .737702530438f, .740009520055f, .742332281356f, .744679668196f, .747037600146f, .749420315605f, .751820376930f, .754243487814f, .756683762441f, .759133959936f, .761605643019f, .764099729889f, .766614919527f, .769146093637f, .771693145114f, .774267141070f, .776858393243f, .779466882269f, .782096835996f, .784742415436f, .787410123430f, .790101570173f, .792815520645f, .795542431752f, .798294870351f, .801063242968f, .803849366342f, .806665638513f, .809494760355f, .812353037567f, .815226352727f, .818124218289f, .821037431411f, .823977286773f, .826945359586f, .829920082804f, .832926883189f, .835951538712f, .839006712543f, .842080344842f, .845177216775f, .848284765471f, .851423322415f, .854588979701f, .857776791512f, .860984910999f, .864215879261f, .867469536588f, .870748683758f, .874051542963f, .877378717421f, .880731452223f, .884101609574f, .887502615964f, .890925768871f, .894372845122f, .897843135408f, .901337846858f, .904862544340f, .908410015848f, .911984035138f, .915576759348f, .919196625760f, .922845709434f, .926529189028f, .930225124843f, .933945305924f, .937700927963f, .941479193749f, .945282181080f, .949117765062f, .952980179587f, .956873412255f, .960779020905f, .964718014705f, .968687508318f, .972685203315f, .976705892948f, .980760930721f, .984836290213f, .988951866922f, .993080243943f, .997245671924f, 1.00144630340f, 1.00567323430f, 1.00991602499f, 1.01419270811f, 1.01850230704f, 1.02285032212f, 1.02721871206f, 1.03161636643f, 1.03604653022f, 1.04050699395f, 1.04499850272f, 1.04951852054f, 1.05407459978f, 1.05865921545f, 1.06326856892f, 1.06791602645f, 1.07259384061f, 1.07729671165f, 1.08203878536f, 1.08680950686f, 1.09161248555f, 1.09645085758f, 1.10132609310f, 1.10623766041f, 1.11114750589f, 1.11612095277f, 1.12113714360f, 1.12616425808f, 1.13122641447f, 1.13631798693f, 1.14146805990f, 1.14663629983f};
static const float C1[256] = {.212044759469f, .210735702190f, .209418015443f, .208087475652f, .206753275519f, .205410853969f, .204053729978f, .202692211082f, .201325322173f, .199941790541f, .198551704706f, .197153917470f, .195750881257f, .194333564491f, .192909615777f, .191471793032f, .190032461265f, .188575010632f, .187120933209f, .185647204506f, .184166170344f, .182679245802f, .181173672353f, .179663467005f, .178144635485f, .176616349360f, .175084037800f, .173533838064f, .171981116491f, .170415875372f, .168836776605f, .167246022818f, .165654589240f, .164045289244f, .162431953647f, .160801620569f, .159170951952f, .157525268876f, .155871296686f, .154202124016f, .152522495456f, .150840826888f, .149144311266f, .147437665224f, .145719346014f, .143990758331f, .142250789144f, .140502354043f, .138745315445f, .136974218191f, .135195151140f, .133404847135f, .131603099825f, .129790785291f, .127969230244f, .126132298677f, .124286064322f, .122435531964f, .120567182576f, .118688092798f, .116796095743f, .114898862963f, .112985536212f, .111065131312f, .109132474142f, .107188321781f, .105227969151f, .103263383910f, .101280483950f, .99287883117e-1f, .97283061235e-1f, .95273867908e-1f, .93247492774e-1f, .91212558811e-1f, .89163726456e-1f, .87101072766e-1f, .85030989812e-1f, .82947185421e-1f, .80844315542e-1f, .78741073956e-1f, .76620455758e-1f, .74483906833e-1f, .72338082587e-1f, .70189189871e-1f, .68018435591e-1f, .65840082252e-1f, .63641802316e-1f, .61436186248e-1f, .59219384845e-1f, .56987821889e-1f, .54744275251e-1f, .52485937632e-1f, .50218520627e-1f, .47934149447e-1f, .45642766126e-1f, .43333822826e-1f, .41018216474e-1f, .38686201364e-1f, .36339829169e-1f, .33988431451e-1f, .31612040002e-1f, .29222787772e-1f, .26823040651e-1f, .24418495167e-1f, .21988609993e-1f, .19552890922e-1f, .17106039552e-1f, .14641181181e-1f, .12162020545e-1f, .9666998685e-2f, .7169663523e-2f, .4643309730e-2f, .2111596679e-2f, -.431598824e-3f, -.2989690662e-2f, -.5561577943e-2f, -.8154880822e-2f, -.10755886294e-1f, -.13367524684e-1f, -.15999743658e-1f, -.18648123903e-1f, -.21306166293e-1f, -.23981953789e-1f, -.26669330748e-1f, -.29373502333e-1f, -.32082162062e-1f, -.34814814297e-1f, -.37556805223e-1f, -.40312772251e-1f, -.43090887470e-1f, -.45884226576e-1f, -.48686220012e-1f, -.51499213346e-1f, -.54339341155e-1f, -.57189308948e-1f, -.6005430929e-1f, -.6292367512e-1f, -.6582431504e-1f, -.6872984049e-1f, -.7165524503e-1f, -.7459473297e-1f, -.7755666791e-1f, -.8052444140e-1f, -.8350502578e-1f, -.8650968268e-1f, -.8952031711e-1f, -.9255503332e-1f, -.9560425465e-1f, -.9867513068e-1f, -.10176011141f, -.10485000844f, -.10795935979f, -.11108920339f, -.11423780756f, -.11739868933f, -.12057166547f, -.12377042647f, -.12698282286f, -.13020878013f, -.13345344254f, -.13670954099f, -.13998500792f, -.14328171413f, -.14659803793f, -.14992226374f, -.15326963975f, -.15662841273f, -.16000074335f, -.16340152411f, -.16680978718f, -.17024508632f, -.17369037027f, -.17715697361f, -.18063381546f, -.18413430337f, -.18766017755f, -.19118577445f, -.19474116819f, -.19830943646f, -.20190541522f, -.20551481061f, -.20914316989f, -.21277573100f, -.21643619837f, -.22011987890f, -.22382091824f, -.22753710375f, -.23127131148f, -.23502327513f, -.23879613644f, -.24258775687f, -.24639874443f, -.25023043386f, -.25407345559f, -.25794304031f, -.26182918290f, -.26573382630f, -.26965609165f, -.27359726148f, -.27756351145f, -.28154663028f, -.28555077208f, -.28956708486f, -.29360493441f, -.29766652900f, -.30175749946f, -.30585341675f, -.30996731111f, -.31411145322f, -.31827160793f, -.32245000439f, -.32665519190f, -.33088073838f, -.33513090577f, -.33938551392f, -.34366739446f, -.34797328516f, -.35230059730f, -.35664361881f, -.36101452016f, -.36539810091f, -.36981565376f, -.37423767670f, -.37869008365f, -.38317075577f, -.38767009071f, -.39217694488f, -.39671041296f, -.40126933906f, -.40585941045f, -.41046149518f, -.41508490180f, -.41973294431f, -.42440320388f, -.42909637229f, -.43380970842f, -.43855098623f, -.44331228018f, -.44808957363f, -.45289663497f, -.45772533171f, -.46257013252f, -.46744551144f, -.47234051123f, -.47725875550f, -.48220335041f, -.48717569030f, -.49217510643f, -.49716286338f, -.50220524890f, -.50728090192f, -.51235755815f, -.51745955485f, -.52258113148f, -.52775139108f, -.53292971984f};
static const float C2[256] = {.240550974386f, .241202959893f, .241856701004f, .242514265929f, .243171103188f, .243829457574f, .244492480331f, .245155117376f, .245817854428f, .246486128053f, .247155039241f, .247825134276f, .248495241281f, .249169654431f, .249844713280f, .250523830778f, .251201163433f, .251884509323f, .252563787406f, .253249747676f, .253936606109f, .254623711603f, .255316926931f, .256009778763f, .256704095772f, .257400249962f, .258095764201f, .258796915826f, .259496734037f, .260199725451f, .260906458608f, .261615922477f, .262323223326f, .263035992751f, .263748082312f, .264465202209f, .265180012111f, .265898946252f, .266619044658f, .267343298130f, .268069619098f, .268794376437f, .269523080941f, .270253687830f, .270986837738f, .271721917527f, .272459385827f, .273197995608f, .273937798870f, .274681077710f, .275425261899f, .276171707452f, .276920486684f, .277671219387f, .278423348235f, .279179385328f, .279936813823f, .280693582599f, .281455206542f, .282218776699f, .282985153731f, .283751228711f, .284521373431f, .285291943163f, .286065008210f, .286840250780f, .287619522875f, .288398060167f, .289181428600f, .289966204157f, .290753368791f, .291539837984f, .292330618012f, .293122323170f, .293917023197f, .294714665277f, .295512769094f, .296313753856f, .297119642359f, .297923266738f, .298731122037f, .299542626941f, .300355242358f, .301166621541f, .301983850061f, .302801532724f, .303624282505f, .304447368111f, .305272224846f, .306100168085f, .306930150725f, .307763195441f, .308597186775f, .309435004672f, .310272993070f, .311114998221f, .311957035187f, .312802638726f, .313651045018f, .314498876947f, .315353314930f, .316209965286f, .317067971632f, .317925302128f, .318789261216f, .319652892561f, .320518078729f, .321387232845f, .322259028085f, .323133993420f, .324007380664f, .324888508986f, .325769102908f, .326651295684f, .327536259333f, .328423600216f, .329315922731f, .330208493193f, .331102319414f, .332000786889f, .332902364781f, .333804829061f, .334710914663f, .335618521194f, .336529395375f, .337439388370f, .338355041149f, .339271424585f, .340190082999f, .341113718319f, .342040006406f, .342966760909f, .343894760691f, .344829302014f, .345764673269f, .346702572919f, .347639507987f, .348584246701f, .349528173351f, .350476153452f, .351426290935f, .352381270258f, .353335726296f, .354291903376f, .355253394261f, .356214397492f, .357180679851f, .358149170924f, .359122122768f, .360097126490f, .361071279072f, .362049162020f, .363031076715f, .364016460615f, .365003272896f, .365991452648f, .366985245318f, .367980853922f, .368978251405f, .369979013823f, .370980890602f, .371986312192f, .372995831811f, .374008931067f, .375022025929f, .376039753227f, .377058524791f, .378078994230f, .379105644931f, .380132134704f, .381164337027f, .382197114688f, .383233855329f, .384271234863f, .385313243354f, .386360369799f, .387404990922f, .388456010655f, .389508406520f, .390566534906f, .391626171784f, .392688937114f, .393750505484f, .394817797166f, .395889415356f, .396963639187f, .398039817634f, .399118775070f, .400200421917f, .401285649507f, .402373826907f, .403465115321f, .404559881707f, .405655440162f, .406756120792f, .407859059092f, .408964796284f, .410073072953f, .411184239170f, .412300018644f, .413418085218f, .414539592392f, .415662053303f, .416788077635f, .417918262200f, .419054147477f, .420188943978f, .421326262837f, .422469476617f, .423614637508f, .424762353579f, .425914955357f, .427070660204f, .428230616859f, .429389315399f, .430552969348f, .431720668445f, .432891695257f, .434064493914f, .435242336232f, .436421114570f, .437606536221f, .438790675206f, .439980464677f, .441175310294f, .442372633912f, .443569472135f, .444770890044f, .445976559606f, .447187960487f, .448400030952f, .449615218011f, .450834377285f, .452056858160f, .453282828803f, .454511559547f, .455745060764f, .456981256106f, .458219094270f, .459462130949f, .460708242561f, .461955996149f, .463209104131f, .464464733125f, .465723802255f, .466987089818f, .468254934518f, .469527143677f, .470793869973f, .472071940011f, .473355895138f, .474637566791f, .475923104541f, .477211044044f, .478508675487f, .479805784911f};

#define EXP2_CHECK_FOR_LIMITS 1

__asm
{
movss xmm0, x

movss xmm1, xmm0
subss xmm1, half
cvtss2si ecx, xmm1
cvtsi2ss xmm2, ecx

#if EXP2_CHECK_FOR_LIMITS
test ecx, 0xFFFFFF00
jnz overflow
#endif

shl ecx, 23
movd xmm7, ecx

subss xmm0, xmm2
minss xmm0, lim_two
movd eax, xmm0

and eax, 0x007F8000
shr eax, 13
movss xmm1, C2[eax]
mulss xmm1, xmm0
mulss xmm1, xmm0

mulss xmm1, xmm7

movss x, xmm1

#if EXP2_CHECK_FOR_LIMITS
jmp end
overflow:
comiss xmm0, one
jp not_a_number
jb underflow
mov x, 0x7F800000
jmp end
underflow:
mov x, 0x00000000
jmp end
not_a_number:
movss x, xmm0
end:
#endif
}

return x;
}


It’s accurate to 2 ulps. Basically it uses the 8 upper mantissa bits to lookup the coefficients of a quadratic polynomial, determined with the minimax algorithm. It handles all inputs (denormals, infinity, NaN), but flushes denormal results to 0.0.

All ideas to make it more accurate and/or faster are welcome!

0
165 Jan 18, 2007 at 22:46

What are ulps? (Couldn’t find the term with google…)

0
157 Jan 18, 2007 at 23:43

Nick: Is this what you’re looking for? ;)

btw, ulps stands for Units in the Last Place.

0
102 Jan 19, 2007 at 00:11

@Dia

Is this what you’re looking for? :)

Thanks! :w00t:

0
102 Jan 20, 2007 at 09:48

The accuracy/performance balance of the Intel library failed to impress me. At least for the exp2 implementation…

Here’s my own vector implementation:

#include <emmintrin.h>

__m128 exp2(__m128 x)
{
static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f
static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f
static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f
static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127

static const __declspec(align(16)) int C5[4] = {0x3AF61905, 0x3AF61905, 0x3AF61905, 0x3AF61905};   // 1.8775767e-3f
static const __declspec(align(16)) int C4[4] = {0x3C134806, 0x3C134806, 0x3C134806, 0x3C134806};   // 8.9893397e-3f
static const __declspec(align(16)) int C3[4] = {0x3D64AA23, 0x3D64AA23, 0x3D64AA23, 0x3D64AA23};   // 5.5826318e-2f
static const __declspec(align(16)) int C1[4] = {0x3F31727B, 0x3F31727B, 0x3F31727B, 0x3F31727B};   // 6.9315308e-1f
static const __declspec(align(16)) int C0[4] = {0x3F7FFFFF, 0x3F7FFFFF, 0x3F7FFFFF, 0x3F7FFFFF};   // 9.9999994e-1f

__asm
{
movaps xmm0, x

minps xmm0, lim1
maxps xmm0, lim2
movaps xmm1, xmm0
subps xmm1, half
cvtps2dq xmm2, xmm1
cvtdq2ps xmm1, xmm2
pslld xmm2, 23
subps xmm0, xmm1
movaps xmm1, C5
mulps xmm1, xmm0
mulps xmm1, xmm0
mulps xmm1, xmm0
mulps xmm1, xmm0
mulps xmm1, xmm0
mulps xmm1, xmm2

movaps x, xmm1
}

return x;
}


Unlike the previous version it requires SSE2. It doesn’t use any lookup tables, but instead uses a polynomial of degree five. This creates a fairly long dependency chain, wich means a high latency. But at 21 instructions this is compensated with a high throughput. In English this means it’s very suited for computing lots of 2x values in a loop.

It has a relative accuracy of 2-22.33. It returns infinity on overflow, 0.0 on underflow, and keeps it accracy with denormals. For NaN inputs it returns infinity.

0
102 Jan 20, 2007 at 14:29

For those interested:

Here’s a version with a weighted minimax polynomial of order four, yielding 2-18.54 relative precision:

#include <emmintrin.h>

__m128 exp2(__m128 x)
{
static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f
static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f
static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f
static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127

static const __declspec(align(16)) int C4[4] = {0x3C5DBE69, 0x3C5DBE69, 0x3C5DBE69, 0x3C5DBE69};   // 1.3534167e-2f
static const __declspec(align(16)) int C3[4] = {0x3D5509F9, 0x3D5509F9, 0x3D5509F9, 0x3D5509F9};   // 5.2011464e-2f
static const __declspec(align(16)) int C2[4] = {0x3E773CC5, 0x3E773CC5, 0x3E773CC5, 0x3E773CC5};   // 2.4144275e-1f
static const __declspec(align(16)) int C1[4] = {0x3F3168B3, 0x3F3168B3, 0x3F3168B3, 0x3F3168B3};   // 6.9300383e-1f
static const __declspec(align(16)) int C0[4] = {0x3F800016, 0x3F800016, 0x3F800016, 0x3F800016};   // 1.0000026f

__asm
{
movaps xmm0, x

minps xmm0, lim1
maxps xmm0, lim2
movaps xmm1, xmm0
subps xmm1, half
cvtps2dq xmm2, xmm1
cvtdq2ps xmm1, xmm2
pslld xmm2, 23
subps xmm0, xmm1
movaps xmm1, C4
mulps xmm1, xmm0
mulps xmm1, xmm0
mulps xmm1, xmm0
mulps xmm1, xmm0
mulps xmm1, xmm2

movaps x, xmm1
}

return x;
}


Here’s a version with a weighted minimax polynomial of order three, yielding 2-13.70 relative precision:

#include <emmintrin.h>

__m128 exp2(__m128 x)
{
static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f
static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f
static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f
static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127

static const __declspec(align(16)) int C3[4] = {0x3D9FCB52, 0x3D9FCB52, 0x3D9FCB52, 0x3D9FCB52};   // 7.8024521e-2f
static const __declspec(align(16)) int C2[4] = {0x3E677E26, 0x3E677E26, 0x3E677E26, 0x3E677E26};   // 2.2606716e-1f
static const __declspec(align(16)) int C1[4] = {0x3F322226, 0x3F322226, 0x3F322226, 0x3F322226};   // 6.9583356e-1f
static const __declspec(align(16)) int C0[4] = {0x3F7FFB19, 0x3F7FFB19, 0x3F7FFB19, 0x3F7FFB19};   // 9.9992520e-1f

__asm
{
movaps xmm0, x

minps xmm0, lim1
maxps xmm0, lim2
movaps xmm1, xmm0
subps xmm1, half
cvtps2dq xmm2, xmm1
cvtdq2ps xmm1, xmm2
pslld xmm2, 23
subps xmm0, xmm1
movaps xmm1, C3
mulps xmm1, xmm0
mulps xmm1, xmm0
mulps xmm1, xmm0
mulps xmm1, xmm2

movaps x, xmm1
}

return x;
}


Here’s a version with a weighted minimax polynomial of order two, yielding 2-9.17 relative precision:

#include <emmintrin.h>

__m128 exp2(__m128 x)
{
static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f
static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f
static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f
static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127

static const __declspec(align(16)) int C2[4] = {0x3EACA418, 0x3EACA418, 0x3EACA418, 0x3EACA418};   // 3.3718944e-1f
static const __declspec(align(16)) int C0[4] = {0x3F803884, 0x3F803884, 0x3F803884, 0x3F803884};   // 1.0017247f

__asm
{
movaps xmm0, x

minps xmm0, lim1
maxps xmm0, lim2
movaps xmm1, xmm0
subps xmm1, half
cvtps2dq xmm2, xmm1
cvtdq2ps xmm1, xmm2
pslld xmm2, 23
subps xmm0, xmm1
movaps xmm1, C2
mulps xmm1, xmm0
mulps xmm1, xmm0
mulps xmm1, xmm2

movaps x, xmm1
}

return x;
}


Every floating-point number has been tested, using an exhaustive loop. For the record, Intel’s library claims an average relative error of 0.0018, but it peaks at a terrible 0.29 relative error.

0
101 Jan 20, 2007 at 15:56

Hehe, looks cool. If i would work currently in C++, I’d sure plug this into my code directly and look how it affects the image, depending on the quality of the function :)

But I don’t think I’ll touch C++ in the sooner time at all anymore… And I don’t have my code lying around, else, it would be a simple copy-paste.

But I’ll remember :)

0
102 Jan 20, 2007 at 18:09

@davepermen

But I don’t think I’ll touch C++ in the sooner time at all anymore…

Totally fallen in love with C#? ;)

0
102 Jan 25, 2007 at 16:25

A log2 implementation (first attempt):

__m128 log2(__m128 x)
{
static const __declspec(align(16)) int exp[4] = {0x7F800000, 0x7F800000, 0x7F800000, 0x7F800000};
static const __declspec(align(16)) int one[4] = {0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000};   // 1.0f
static const __declspec(align(16)) int off[4] = {0x3FBF8000, 0x3FBF8000, 0x3FBF8000, 0x3FBF8000};   // 1.4960938f
static const __declspec(align(16)) int mant[4] = {0x007FFFFF, 0x007FFFFF, 0x007FFFFF, 0x007FFFFF};
static const __declspec(align(16)) int sign[4] = {0x80000000, 0x80000000, 0x80000000, 0x80000000};
static const __declspec(align(16)) int base[4] = {0x43800000, 0x43800000, 0x43800000, 0x43800000};   // 256.0f

static const __declspec(align(16)) int C5[4] = {0xBD0D0CC5, 0xBD0D0CC5, 0xBD0D0CC5, 0xBD0D0CC5};   // -3.4436006e-2f
static const __declspec(align(16)) int C4[4] = {0x3EA2ECDD, 0x3EA2ECDD, 0x3EA2ECDD, 0x3EA2ECDD};   //  3.1821337e-1f
static const __declspec(align(16)) int C3[4] = {0xBF9dA2C9, 0xBF9dA2C9, 0xBF9dA2C9, 0xBF9dA2C9};   // -1.2315303f
static const __declspec(align(16)) int C2[4] = {0x4026537B, 0x4026537B, 0x4026537B, 0x4026537B};   //  2.5988452f
static const __declspec(align(16)) int C0[4] = {0x4047691A, 0x4047691A, 0x4047691A, 0x4047691A};   //  3.1157899f

__asm
{
movaps xmm0, x

movaps xmm1, xmm0
andps xmm1, exp
psrld xmm1, 8
orps xmm1, one
subps xmm1, off
mulps xmm1, base
andps xmm0, mant
orps xmm0, one
movaps xmm2, C5
mulps xmm2, xmm0
mulps xmm2, xmm0
mulps xmm2, xmm0
mulps xmm2, xmm0
mulps xmm2, xmm0
subps xmm0, one
mulps xmm0, xmm2

movaps x, xmm1
}

return x;
}


Unfortunately logarithms don’t converge very fast. It’s also not that easy to get good behaviour for the limits. :sad:

0
101 Jan 25, 2007 at 16:41

Any benchmarks to go with that? Not that I’m doubting it will be faster, but it’s always nice to know how much faster it is. :)

0
102 Jan 25, 2007 at 20:33

@Kenneth Gorking

Any benchmarks to go with that? Not that I’m doubting it will be faster, but it’s always nice to know how much faster it is. :)

I’m not so sure it’s faster. The dependency chains are quite bad, while the Intel library appears to use properly scheduled instructions.

Anyway, I’m considering making it an article for the ‘Daily Code Gem’, including benchmark results. I can’t promise when I have the time for it though…

0
101 Jan 26, 2007 at 00:23

@Nick

Totally fallen in love with C#? ;)

Yes. You and you’re work are the only reason to regret this… But sorry, thats not enough… :)

0
103 Jan 26, 2007 at 03:40

There’s one thing I don’t understand.
I’ll quote from “AMaths.pdf”, page 5:

Packed vs. Scalar AM Library Functions
Compared to the packed AM Library functions, the scalar ones are slower per operand but
faster per call. Therefore, when just one result is needed, it is faster to call the scalar version.
Otherwise (for two or more results), the packed version should be used.

What does it mean, exactly?

Regards,
Ciao ciao : )

0
102 Jan 26, 2007 at 06:27

@davepermen

Yes. You and you’re work are the only reason to regret this… But sorry, thats not enough… :)

Yeah, C# is a very nice language. I’m actually considering writing my high-level code in it, and leave the performance critical parts in C++/assembly. I’m only afraid it could become messier than just sticking with C++. For the things I work on, high-level code can become low-level code the next day. :lol:

A native C# compiler with inline assembly would convince me in a second though…

0
102 Jan 26, 2007 at 06:47

@Nautilus

What does it mean, exactly?

Take for example am_sin_ss and am_sin_ps. The first computes one sine (a scalar), while the second computes four sines (a vector). The scalar version takes 48 clock cycles, and the vector version 66 clock cycles. So the scalar version is faster, but it only computes one sine! If you need only one sine (for example for constructing a rotation matrix), then the scalar version is optimal. But if you need to compute sines for a whole array of angles (for example for a Fourier series), then the vector version is more efficient. Computing four sines separately with the scalar version is slower than computing four in parallel with the vector version.

0
103 Jan 26, 2007 at 13:42

Thank you very much, Nick :)

0
101 Dec 16, 2007 at 01:13

Hi Nick,

I’m a novice at SSE programming – I’m teaching myself SSE by using a mix of SSE and FP to implement a linear algebra algorithm. I’m using FP because the algorithm involves the logarithm function. I’ve determined that this is the main bottleneck (according to gprof, I spend 86% of the time in the logarithm function), so I’m looking for ways to implement the logarithm function in pure SSE.

I’m reluctant to reuse your log2 code verbatim; I think that by redesigning the algorithm with my particular application in mind (for example, my required interval of convergence, precision, etc.), I can better optimize the overall performance of my application.

Do you have any pointers on how to design a logarithm approximator?

From what I can see in your code, you’re extracting the exponent field and approximating the logarithm of the mantissa using a 5th-order polynomial.

How did you arrive at the coefficients? I would either use 5 terms of the Taylor series expansion for log, or pick 5 points (perhaps evenly-spaced) on the interval of interest and obtain a 5th-order polynomial that passes through those points (and perhaps visibly verify that the error between those points is acceptable).

Anyway, I’m not very experienced in this domain, which is why I’m hoping you could provide some pointers.

Thanks,

Catalin Patulea

0
101 Jan 20, 2011 at 14:05

I am reactivating that old post.

I was wondering if anyone could explain to me how the values C0,C1 etc. are determined.

I want to use the same system that is is presented here for 2\^x for the logistic function 1/(1+e\^-x). i used the code shown by nick by adding a multiplication by 1/ln2 and add the addps xmm1,one; rcpps xmm1, xmm1

it works but if we could skip these it would gain in speed and in precision.

0
102 Jan 20, 2011 at 14:28

@XavierDC

I was wondering if anyone could explain to me how the values C0,C1 etc. are determined.

The coefficients of the polynomial are determined using the minimax algorithm. I believe I used Maple’s implementation of it.

0
101 Jan 21, 2011 at 00:05

@Nick

The coefficients of the polynomial are determined using the minimax algorithm. I believe I used Maple’s implementation of it.

Thanks Nick.

I’ve sent you a private message in this forum. If you can get back to me by email i would appreciate. thank you.