Approximate Math Library

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 16, 2007 at 15:18

Hi all,

I’m looking for the Intel Approximate Math Library. It used to be at www.intel.com/design/Pentium4/devtools/ but that page moved and I couldn’t locate the library.

So, does anyone know where to find it, or can send me a copy? That would be much appreciated. :worthy: I’m also interested in other approximation libraries/functions. The more options between precision and performance the better…

Thanks a lot,

Nick

24 Replies

Please log in or register to post a reply.

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jan 17, 2007 at 16:21

I have never heard of this library, but AMD has a library of their. I’m sure you know this already, but I’ll post it anyway just in case: http://developer.amd.com/acml.jsp

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 17, 2007 at 17:22

AMD’s library doesn’t really have what I’m looking for. I seek fast implementations of transcendental functions (exp, log, sin, etc). Thanks anyway, it was worth checking out.

So the Intel Approximate Math Library contains approximations of transcendental functions, optimized for SSE. For some time I have used my own implementations (sin/cos) but now I’m looking for more precision with minimal performance impact…

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jan 17, 2007 at 18:17

The CRT has a few functions implemented using SSE2 (see here), but they are most likely built for precision. The CRT source does however come with VS2005, so you could dig up those files and enhance them for speed :)

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 18, 2007 at 10:29

Interesting! It seems pretty hard to derive a faster (single-precision) version from it, but it’s always good to have a reference.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 18, 2007 at 17:46

Since the Intel library is still nowhere to be found, I started writing my own approximations. Here’s a 2\^x implementation (first attempt):

float exp2(float x)
{
    static const float one = 1.0f;
    static const float half = 0.5f;
    static const float lim_two = 1.9999999f;

    static const float C0[256] = {.547404266246f, .548061348142f, .548725333807f, .549398399167f, .550075922598f, .550760241024f, .551454706663f, .552154084359f, .552858883435f, .553574967496f, .554297163206f, .555026093302f, .555760495106f, .556505137841f, .557256046635f, .558017083881f, .558781729040f, .559558847539f, .560337005182f, .561128551344f, .561926919212f, .562731360892f, .563548838504f, .564371785305f, .565202403662f, .566041175842f, .566885148265f, .567741995681f, .568603271104f, .569474540778f, .570356611167f, .571248305001f, .572143490366f, .573051865152f, .573965673107f, .574892291723f, .575822287504f, .576764054387f, .577713794250f, .578675521802f, .579646563842f, .580622066161f, .581609491764f, .582606143121f, .583612970496f, .584629192183f, .585655505258f, .586690228227f, .587733473599f, .588788523879f, .589851795687f, .590925279595f, .592009141821f, .593102901511f, .594205792172f, .595321583429f, .596446634193f, .597577914584f, .598723732752f, .599879808291f, .601047527774f, .602222182051f, .603410536625f, .604607039548f, .605814947112f, .607033833104f, .608266709013f, .609506083140f, .610760886405f, .612025725384f, .613302240932f, .614585463030f, .615883610698f, .617191217751f, .618511752134f, .619845226500f, .621187546929f, .622542829136f, .623914626393f, .625290774906f, .626682429570f, .628088718711f, .629505305967f, .630928111994f, .632369622796f, .633820437496f, .635288816972f, .636766410268f, .638255823256f, .639759511520f, .641275658394f, .642806215397f, .644347352744f, .645904478891f, .647470860480f, .649053754624f, .650645737483f, .652253552238f, .653875847907f, .655506204107f, .657158529182f, .658824473497f, .660502429383f, .662188430285f, .663896943735f, .665614322456f, .667344320807f, .669091861220f, .670854384730f, .672633065213f, .674418265280f, .676229143542f, .678048816547f, .679881705149f, .681730321995f, .683593925186f, .685478115237f, .687372988183f, .689280701315f, .691208589068f, .693153490063f, .695110683979f, .697086170558f, .699075467598f, .701082480523f, .703098111035f, .705136926814f, .707188064752f, .709255034137f, .711344041865f, .713449959172f, .715567879355f, .717699596625f, .719857427125f, .722028307800f, .724216235913f, .726413096106f, .728639561732f, .730875455600f, .733132357639f, .735405868218f, .737702530438f, .740009520055f, .742332281356f, .744679668196f, .747037600146f, .749420315605f, .751820376930f, .754243487814f, .756683762441f, .759133959936f, .761605643019f, .764099729889f, .766614919527f, .769146093637f, .771693145114f, .774267141070f, .776858393243f, .779466882269f, .782096835996f, .784742415436f, .787410123430f, .790101570173f, .792815520645f, .795542431752f, .798294870351f, .801063242968f, .803849366342f, .806665638513f, .809494760355f, .812353037567f, .815226352727f, .818124218289f, .821037431411f, .823977286773f, .826945359586f, .829920082804f, .832926883189f, .835951538712f, .839006712543f, .842080344842f, .845177216775f, .848284765471f, .851423322415f, .854588979701f, .857776791512f, .860984910999f, .864215879261f, .867469536588f, .870748683758f, .874051542963f, .877378717421f, .880731452223f, .884101609574f, .887502615964f, .890925768871f, .894372845122f, .897843135408f, .901337846858f, .904862544340f, .908410015848f, .911984035138f, .915576759348f, .919196625760f, .922845709434f, .926529189028f, .930225124843f, .933945305924f, .937700927963f, .941479193749f, .945282181080f, .949117765062f, .952980179587f, .956873412255f, .960779020905f, .964718014705f, .968687508318f, .972685203315f, .976705892948f, .980760930721f, .984836290213f, .988951866922f, .993080243943f, .997245671924f, 1.00144630340f, 1.00567323430f, 1.00991602499f, 1.01419270811f, 1.01850230704f, 1.02285032212f, 1.02721871206f, 1.03161636643f, 1.03604653022f, 1.04050699395f, 1.04499850272f, 1.04951852054f, 1.05407459978f, 1.05865921545f, 1.06326856892f, 1.06791602645f, 1.07259384061f, 1.07729671165f, 1.08203878536f, 1.08680950686f, 1.09161248555f, 1.09645085758f, 1.10132609310f, 1.10623766041f, 1.11114750589f, 1.11612095277f, 1.12113714360f, 1.12616425808f, 1.13122641447f, 1.13631798693f, 1.14146805990f, 1.14663629983f};
    static const float C1[256] = {.212044759469f, .210735702190f, .209418015443f, .208087475652f, .206753275519f, .205410853969f, .204053729978f, .202692211082f, .201325322173f, .199941790541f, .198551704706f, .197153917470f, .195750881257f, .194333564491f, .192909615777f, .191471793032f, .190032461265f, .188575010632f, .187120933209f, .185647204506f, .184166170344f, .182679245802f, .181173672353f, .179663467005f, .178144635485f, .176616349360f, .175084037800f, .173533838064f, .171981116491f, .170415875372f, .168836776605f, .167246022818f, .165654589240f, .164045289244f, .162431953647f, .160801620569f, .159170951952f, .157525268876f, .155871296686f, .154202124016f, .152522495456f, .150840826888f, .149144311266f, .147437665224f, .145719346014f, .143990758331f, .142250789144f, .140502354043f, .138745315445f, .136974218191f, .135195151140f, .133404847135f, .131603099825f, .129790785291f, .127969230244f, .126132298677f, .124286064322f, .122435531964f, .120567182576f, .118688092798f, .116796095743f, .114898862963f, .112985536212f, .111065131312f, .109132474142f, .107188321781f, .105227969151f, .103263383910f, .101280483950f, .99287883117e-1f, .97283061235e-1f, .95273867908e-1f, .93247492774e-1f, .91212558811e-1f, .89163726456e-1f, .87101072766e-1f, .85030989812e-1f, .82947185421e-1f, .80844315542e-1f, .78741073956e-1f, .76620455758e-1f, .74483906833e-1f, .72338082587e-1f, .70189189871e-1f, .68018435591e-1f, .65840082252e-1f, .63641802316e-1f, .61436186248e-1f, .59219384845e-1f, .56987821889e-1f, .54744275251e-1f, .52485937632e-1f, .50218520627e-1f, .47934149447e-1f, .45642766126e-1f, .43333822826e-1f, .41018216474e-1f, .38686201364e-1f, .36339829169e-1f, .33988431451e-1f, .31612040002e-1f, .29222787772e-1f, .26823040651e-1f, .24418495167e-1f, .21988609993e-1f, .19552890922e-1f, .17106039552e-1f, .14641181181e-1f, .12162020545e-1f, .9666998685e-2f, .7169663523e-2f, .4643309730e-2f, .2111596679e-2f, -.431598824e-3f, -.2989690662e-2f, -.5561577943e-2f, -.8154880822e-2f, -.10755886294e-1f, -.13367524684e-1f, -.15999743658e-1f, -.18648123903e-1f, -.21306166293e-1f, -.23981953789e-1f, -.26669330748e-1f, -.29373502333e-1f, -.32082162062e-1f, -.34814814297e-1f, -.37556805223e-1f, -.40312772251e-1f, -.43090887470e-1f, -.45884226576e-1f, -.48686220012e-1f, -.51499213346e-1f, -.54339341155e-1f, -.57189308948e-1f, -.6005430929e-1f, -.6292367512e-1f, -.6582431504e-1f, -.6872984049e-1f, -.7165524503e-1f, -.7459473297e-1f, -.7755666791e-1f, -.8052444140e-1f, -.8350502578e-1f, -.8650968268e-1f, -.8952031711e-1f, -.9255503332e-1f, -.9560425465e-1f, -.9867513068e-1f, -.10176011141f, -.10485000844f, -.10795935979f, -.11108920339f, -.11423780756f, -.11739868933f, -.12057166547f, -.12377042647f, -.12698282286f, -.13020878013f, -.13345344254f, -.13670954099f, -.13998500792f, -.14328171413f, -.14659803793f, -.14992226374f, -.15326963975f, -.15662841273f, -.16000074335f, -.16340152411f, -.16680978718f, -.17024508632f, -.17369037027f, -.17715697361f, -.18063381546f, -.18413430337f, -.18766017755f, -.19118577445f, -.19474116819f, -.19830943646f, -.20190541522f, -.20551481061f, -.20914316989f, -.21277573100f, -.21643619837f, -.22011987890f, -.22382091824f, -.22753710375f, -.23127131148f, -.23502327513f, -.23879613644f, -.24258775687f, -.24639874443f, -.25023043386f, -.25407345559f, -.25794304031f, -.26182918290f, -.26573382630f, -.26965609165f, -.27359726148f, -.27756351145f, -.28154663028f, -.28555077208f, -.28956708486f, -.29360493441f, -.29766652900f, -.30175749946f, -.30585341675f, -.30996731111f, -.31411145322f, -.31827160793f, -.32245000439f, -.32665519190f, -.33088073838f, -.33513090577f, -.33938551392f, -.34366739446f, -.34797328516f, -.35230059730f, -.35664361881f, -.36101452016f, -.36539810091f, -.36981565376f, -.37423767670f, -.37869008365f, -.38317075577f, -.38767009071f, -.39217694488f, -.39671041296f, -.40126933906f, -.40585941045f, -.41046149518f, -.41508490180f, -.41973294431f, -.42440320388f, -.42909637229f, -.43380970842f, -.43855098623f, -.44331228018f, -.44808957363f, -.45289663497f, -.45772533171f, -.46257013252f, -.46744551144f, -.47234051123f, -.47725875550f, -.48220335041f, -.48717569030f, -.49217510643f, -.49716286338f, -.50220524890f, -.50728090192f, -.51235755815f, -.51745955485f, -.52258113148f, -.52775139108f, -.53292971984f};
    static const float C2[256] = {.240550974386f, .241202959893f, .241856701004f, .242514265929f, .243171103188f, .243829457574f, .244492480331f, .245155117376f, .245817854428f, .246486128053f, .247155039241f, .247825134276f, .248495241281f, .249169654431f, .249844713280f, .250523830778f, .251201163433f, .251884509323f, .252563787406f, .253249747676f, .253936606109f, .254623711603f, .255316926931f, .256009778763f, .256704095772f, .257400249962f, .258095764201f, .258796915826f, .259496734037f, .260199725451f, .260906458608f, .261615922477f, .262323223326f, .263035992751f, .263748082312f, .264465202209f, .265180012111f, .265898946252f, .266619044658f, .267343298130f, .268069619098f, .268794376437f, .269523080941f, .270253687830f, .270986837738f, .271721917527f, .272459385827f, .273197995608f, .273937798870f, .274681077710f, .275425261899f, .276171707452f, .276920486684f, .277671219387f, .278423348235f, .279179385328f, .279936813823f, .280693582599f, .281455206542f, .282218776699f, .282985153731f, .283751228711f, .284521373431f, .285291943163f, .286065008210f, .286840250780f, .287619522875f, .288398060167f, .289181428600f, .289966204157f, .290753368791f, .291539837984f, .292330618012f, .293122323170f, .293917023197f, .294714665277f, .295512769094f, .296313753856f, .297119642359f, .297923266738f, .298731122037f, .299542626941f, .300355242358f, .301166621541f, .301983850061f, .302801532724f, .303624282505f, .304447368111f, .305272224846f, .306100168085f, .306930150725f, .307763195441f, .308597186775f, .309435004672f, .310272993070f, .311114998221f, .311957035187f, .312802638726f, .313651045018f, .314498876947f, .315353314930f, .316209965286f, .317067971632f, .317925302128f, .318789261216f, .319652892561f, .320518078729f, .321387232845f, .322259028085f, .323133993420f, .324007380664f, .324888508986f, .325769102908f, .326651295684f, .327536259333f, .328423600216f, .329315922731f, .330208493193f, .331102319414f, .332000786889f, .332902364781f, .333804829061f, .334710914663f, .335618521194f, .336529395375f, .337439388370f, .338355041149f, .339271424585f, .340190082999f, .341113718319f, .342040006406f, .342966760909f, .343894760691f, .344829302014f, .345764673269f, .346702572919f, .347639507987f, .348584246701f, .349528173351f, .350476153452f, .351426290935f, .352381270258f, .353335726296f, .354291903376f, .355253394261f, .356214397492f, .357180679851f, .358149170924f, .359122122768f, .360097126490f, .361071279072f, .362049162020f, .363031076715f, .364016460615f, .365003272896f, .365991452648f, .366985245318f, .367980853922f, .368978251405f, .369979013823f, .370980890602f, .371986312192f, .372995831811f, .374008931067f, .375022025929f, .376039753227f, .377058524791f, .378078994230f, .379105644931f, .380132134704f, .381164337027f, .382197114688f, .383233855329f, .384271234863f, .385313243354f, .386360369799f, .387404990922f, .388456010655f, .389508406520f, .390566534906f, .391626171784f, .392688937114f, .393750505484f, .394817797166f, .395889415356f, .396963639187f, .398039817634f, .399118775070f, .400200421917f, .401285649507f, .402373826907f, .403465115321f, .404559881707f, .405655440162f, .406756120792f, .407859059092f, .408964796284f, .410073072953f, .411184239170f, .412300018644f, .413418085218f, .414539592392f, .415662053303f, .416788077635f, .417918262200f, .419054147477f, .420188943978f, .421326262837f, .422469476617f, .423614637508f, .424762353579f, .425914955357f, .427070660204f, .428230616859f, .429389315399f, .430552969348f, .431720668445f, .432891695257f, .434064493914f, .435242336232f, .436421114570f, .437606536221f, .438790675206f, .439980464677f, .441175310294f, .442372633912f, .443569472135f, .444770890044f, .445976559606f, .447187960487f, .448400030952f, .449615218011f, .450834377285f, .452056858160f, .453282828803f, .454511559547f, .455745060764f, .456981256106f, .458219094270f, .459462130949f, .460708242561f, .461955996149f, .463209104131f, .464464733125f, .465723802255f, .466987089818f, .468254934518f, .469527143677f, .470793869973f, .472071940011f, .473355895138f, .474637566791f, .475923104541f, .477211044044f, .478508675487f, .479805784911f};

    #define EXP2_CHECK_FOR_LIMITS 1

    __asm
    {
        movss xmm0, x

        movss xmm1, xmm0
        subss xmm1, half
        cvtss2si ecx, xmm1
        cvtsi2ss xmm2, ecx
        add ecx, 127

    #if EXP2_CHECK_FOR_LIMITS
        test ecx, 0xFFFFFF00
        jnz overflow
    #endif

        shl ecx, 23
        movd xmm7, ecx

        subss xmm0, xmm2
        addss xmm0, one
        minss xmm0, lim_two
        movd eax, xmm0

        and eax, 0x007F8000
        shr eax, 13
        movss xmm1, C2[eax]
        mulss xmm1, xmm0
        addss xmm1, C1[eax]
        mulss xmm1, xmm0
        addss xmm1, C0[eax]
        
        mulss xmm1, xmm7
        
        movss x, xmm1

    #if EXP2_CHECK_FOR_LIMITS
        jmp end
    overflow:
        comiss xmm0, one
        jp not_a_number
        jb underflow
        mov x, 0x7F800000
        jmp end
    underflow:
        mov x, 0x00000000
        jmp end
    not_a_number:
        movss x, xmm0
    end:
    #endif
    }

    return x;
}

It’s accurate to 2 ulps. Basically it uses the 8 upper mantissa bits to lookup the coefficients of a quadratic polynomial, determined with the minimax algorithm. It handles all inputs (denormals, infinity, NaN), but flushes denormal results to 0.0.

All ideas to make it more accurate and/or faster are welcome!

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 Jan 18, 2007 at 22:46

What are ulps? (Couldn’t find the term with google…)

Fdbdc4176840d77fe6a8deca457595ab
0
dk 158 Jan 18, 2007 at 23:43

Nick: Is this what you’re looking for? ;)

btw, ulps stands for Units in the Last Place.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 19, 2007 at 00:11

@Dia

Is this what you’re looking for? :)

Thanks! :w00t:

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 20, 2007 at 09:48

The accuracy/performance balance of the Intel library failed to impress me. At least for the exp2 implementation…

Here’s my own vector implementation:

#include <emmintrin.h>

__m128 exp2(__m128 x)
{
    static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f
    static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f
    static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f
    static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127

    static const __declspec(align(16)) int C5[4] = {0x3AF61905, 0x3AF61905, 0x3AF61905, 0x3AF61905};   // 1.8775767e-3f
    static const __declspec(align(16)) int C4[4] = {0x3C134806, 0x3C134806, 0x3C134806, 0x3C134806};   // 8.9893397e-3f
    static const __declspec(align(16)) int C3[4] = {0x3D64AA23, 0x3D64AA23, 0x3D64AA23, 0x3D64AA23};   // 5.5826318e-2f
    static const __declspec(align(16)) int C2[4] = {0x3E75EAD4, 0x3E75EAD4, 0x3E75EAD4, 0x3E75EAD4};   // 2.4015361e-1f
    static const __declspec(align(16)) int C1[4] = {0x3F31727B, 0x3F31727B, 0x3F31727B, 0x3F31727B};   // 6.9315308e-1f
    static const __declspec(align(16)) int C0[4] = {0x3F7FFFFF, 0x3F7FFFFF, 0x3F7FFFFF, 0x3F7FFFFF};   // 9.9999994e-1f

    __asm
    {
        movaps xmm0, x

        minps xmm0, lim1
        maxps xmm0, lim2
        movaps xmm1, xmm0
        subps xmm1, half
        cvtps2dq xmm2, xmm1
        cvtdq2ps xmm1, xmm2
        paddd xmm2, base
        pslld xmm2, 23
        subps xmm0, xmm1
        movaps xmm1, C5
        mulps xmm1, xmm0
        addps xmm1, C4
        mulps xmm1, xmm0
        addps xmm1, C3
        mulps xmm1, xmm0
        addps xmm1, C2
        mulps xmm1, xmm0
        addps xmm1, C1
        mulps xmm1, xmm0
        addps xmm1, C0
        mulps xmm1, xmm2
        
        movaps x, xmm1
    }

    return x;
}

Unlike the previous version it requires SSE2. It doesn’t use any lookup tables, but instead uses a polynomial of degree five. This creates a fairly long dependency chain, wich means a high latency. But at 21 instructions this is compensated with a high throughput. In English this means it’s very suited for computing lots of 2x values in a loop.

It has a relative accuracy of 2-22.33. It returns infinity on overflow, 0.0 on underflow, and keeps it accracy with denormals. For NaN inputs it returns infinity.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 20, 2007 at 14:29

For those interested:

Here’s a version with a weighted minimax polynomial of order four, yielding 2-18.54 relative precision:

#include <emmintrin.h>

__m128 exp2(__m128 x)
{
    static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f
    static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f
    static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f
    static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127

    static const __declspec(align(16)) int C4[4] = {0x3C5DBE69, 0x3C5DBE69, 0x3C5DBE69, 0x3C5DBE69};   // 1.3534167e-2f
    static const __declspec(align(16)) int C3[4] = {0x3D5509F9, 0x3D5509F9, 0x3D5509F9, 0x3D5509F9};   // 5.2011464e-2f
    static const __declspec(align(16)) int C2[4] = {0x3E773CC5, 0x3E773CC5, 0x3E773CC5, 0x3E773CC5};   // 2.4144275e-1f
    static const __declspec(align(16)) int C1[4] = {0x3F3168B3, 0x3F3168B3, 0x3F3168B3, 0x3F3168B3};   // 6.9300383e-1f
    static const __declspec(align(16)) int C0[4] = {0x3F800016, 0x3F800016, 0x3F800016, 0x3F800016};   // 1.0000026f

    __asm
    {
        movaps xmm0, x

        minps xmm0, lim1
        maxps xmm0, lim2
        movaps xmm1, xmm0
        subps xmm1, half
        cvtps2dq xmm2, xmm1
        cvtdq2ps xmm1, xmm2
        paddd xmm2, base
        pslld xmm2, 23
        subps xmm0, xmm1
        movaps xmm1, C4
        mulps xmm1, xmm0
        addps xmm1, C3
        mulps xmm1, xmm0
        addps xmm1, C2
        mulps xmm1, xmm0
        addps xmm1, C1
        mulps xmm1, xmm0
        addps xmm1, C0
        mulps xmm1, xmm2
        
        movaps x, xmm1
    }

    return x;
}

Here’s a version with a weighted minimax polynomial of order three, yielding 2-13.70 relative precision:

#include <emmintrin.h>

__m128 exp2(__m128 x)
{
    static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f
    static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f
    static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f
    static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127

    static const __declspec(align(16)) int C3[4] = {0x3D9FCB52, 0x3D9FCB52, 0x3D9FCB52, 0x3D9FCB52};   // 7.8024521e-2f
    static const __declspec(align(16)) int C2[4] = {0x3E677E26, 0x3E677E26, 0x3E677E26, 0x3E677E26};   // 2.2606716e-1f
    static const __declspec(align(16)) int C1[4] = {0x3F322226, 0x3F322226, 0x3F322226, 0x3F322226};   // 6.9583356e-1f
    static const __declspec(align(16)) int C0[4] = {0x3F7FFB19, 0x3F7FFB19, 0x3F7FFB19, 0x3F7FFB19};   // 9.9992520e-1f

    __asm
    {
        movaps xmm0, x

        minps xmm0, lim1
        maxps xmm0, lim2
        movaps xmm1, xmm0
        subps xmm1, half
        cvtps2dq xmm2, xmm1
        cvtdq2ps xmm1, xmm2
        paddd xmm2, base
        pslld xmm2, 23
        subps xmm0, xmm1
        movaps xmm1, C3
        mulps xmm1, xmm0
        addps xmm1, C2
        mulps xmm1, xmm0
        addps xmm1, C1
        mulps xmm1, xmm0
        addps xmm1, C0
        mulps xmm1, xmm2
        
        movaps x, xmm1
    }

    return x;
}

Here’s a version with a weighted minimax polynomial of order two, yielding 2-9.17 relative precision:

#include <emmintrin.h>

__m128 exp2(__m128 x)
{
    static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f
    static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f
    static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f
    static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127

    static const __declspec(align(16)) int C2[4] = {0x3EACA418, 0x3EACA418, 0x3EACA418, 0x3EACA418};   // 3.3718944e-1f
    static const __declspec(align(16)) int C1[4] = {0x3F285ADA, 0x3F285ADA, 0x3F285ADA, 0x3F285ADA};   // 6.5763628e-1f
    static const __declspec(align(16)) int C0[4] = {0x3F803884, 0x3F803884, 0x3F803884, 0x3F803884};   // 1.0017247f

    __asm
    {
        movaps xmm0, x

        minps xmm0, lim1
        maxps xmm0, lim2
        movaps xmm1, xmm0
        subps xmm1, half
        cvtps2dq xmm2, xmm1
        cvtdq2ps xmm1, xmm2
        paddd xmm2, base
        pslld xmm2, 23
        subps xmm0, xmm1
        movaps xmm1, C2
        mulps xmm1, xmm0
        addps xmm1, C1
        mulps xmm1, xmm0
        addps xmm1, C0
        mulps xmm1, xmm2
        
        movaps x, xmm1
    }

    return x;
}

Every floating-point number has been tested, using an exhaustive loop. For the record, Intel’s library claims an average relative error of 0.0018, but it peaks at a terrible 0.29 relative error.

6ad5f8c742f1e8ec61000e2b0900fc76
0
davepermen 101 Jan 20, 2007 at 15:56

Hehe, looks cool. If i would work currently in C++, I’d sure plug this into my code directly and look how it affects the image, depending on the quality of the function :)

But I don’t think I’ll touch C++ in the sooner time at all anymore… And I don’t have my code lying around, else, it would be a simple copy-paste.

But I’ll remember :)

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 20, 2007 at 18:09

@davepermen

But I don’t think I’ll touch C++ in the sooner time at all anymore…

Totally fallen in love with C#? ;)

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 25, 2007 at 16:25

A log2 implementation (first attempt):

__m128 log2(__m128 x)
{
    static const __declspec(align(16)) int exp[4] = {0x7F800000, 0x7F800000, 0x7F800000, 0x7F800000};
    static const __declspec(align(16)) int one[4] = {0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000};   // 1.0f
    static const __declspec(align(16)) int off[4] = {0x3FBF8000, 0x3FBF8000, 0x3FBF8000, 0x3FBF8000};   // 1.4960938f
    static const __declspec(align(16)) int mant[4] = {0x007FFFFF, 0x007FFFFF, 0x007FFFFF, 0x007FFFFF};
    static const __declspec(align(16)) int sign[4] = {0x80000000, 0x80000000, 0x80000000, 0x80000000};
    static const __declspec(align(16)) int base[4] = {0x43800000, 0x43800000, 0x43800000, 0x43800000};   // 256.0f

    static const __declspec(align(16)) int C5[4] = {0xBD0D0CC5, 0xBD0D0CC5, 0xBD0D0CC5, 0xBD0D0CC5};   // -3.4436006e-2f
    static const __declspec(align(16)) int C4[4] = {0x3EA2ECDD, 0x3EA2ECDD, 0x3EA2ECDD, 0x3EA2ECDD};   //  3.1821337e-1f
    static const __declspec(align(16)) int C3[4] = {0xBF9dA2C9, 0xBF9dA2C9, 0xBF9dA2C9, 0xBF9dA2C9};   // -1.2315303f
    static const __declspec(align(16)) int C2[4] = {0x4026537B, 0x4026537B, 0x4026537B, 0x4026537B};   //  2.5988452f
    static const __declspec(align(16)) int C1[4] = {0xC054bFAD, 0xC054bFAD, 0xC054bFAD, 0xC054bFAD};   // -3.3241990f
    static const __declspec(align(16)) int C0[4] = {0x4047691A, 0x4047691A, 0x4047691A, 0x4047691A};   //  3.1157899f

    __asm
    {
        movaps xmm0, x

        movaps xmm1, xmm0
        andps xmm1, exp
        psrld xmm1, 8
        orps xmm1, one
        subps xmm1, off
        mulps xmm1, base
        andps xmm0, mant
        orps xmm0, one
        movaps xmm2, C5
        mulps xmm2, xmm0
        addps xmm2, C4
        mulps xmm2, xmm0
        addps xmm2, C3
        mulps xmm2, xmm0
        addps xmm2, C2
        mulps xmm2, xmm0
        addps xmm2, C1
        mulps xmm2, xmm0
        addps xmm2, C0
        subps xmm0, one
        mulps xmm0, xmm2
        addps xmm1, xmm0
        
        movaps x, xmm1
    }

    return x;
}

Unfortunately logarithms don’t converge very fast. It’s also not that easy to get good behaviour for the limits. :sad:

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jan 25, 2007 at 16:41

Any benchmarks to go with that? Not that I’m doubting it will be faster, but it’s always nice to know how much faster it is. :)

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 25, 2007 at 20:33

@Kenneth Gorking

Any benchmarks to go with that? Not that I’m doubting it will be faster, but it’s always nice to know how much faster it is. :)

I’m not so sure it’s faster. The dependency chains are quite bad, while the Intel library appears to use properly scheduled instructions.

Anyway, I’m considering making it an article for the ‘Daily Code Gem’, including benchmark results. I can’t promise when I have the time for it though…

6ad5f8c742f1e8ec61000e2b0900fc76
0
davepermen 101 Jan 26, 2007 at 00:23

@Nick

Totally fallen in love with C#? ;)

Yes. You and you’re work are the only reason to regret this… But sorry, thats not enough… :)

D619d95cddb1edb227f51ef539d15cdc
0
Nautilus 103 Jan 26, 2007 at 03:40

There’s one thing I don’t understand.
I’ll quote from “AMaths.pdf”, page 5:

Packed vs. Scalar AM Library Functions
Compared to the packed AM Library functions, the scalar ones are slower per operand but
faster per call. Therefore, when just one result is needed, it is faster to call the scalar version.
Otherwise (for two or more results), the packed version should be used.

What does it mean, exactly?

Regards,
Ciao ciao : )

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 26, 2007 at 06:27

@davepermen

Yes. You and you’re work are the only reason to regret this… But sorry, thats not enough… :)

Yeah, C# is a very nice language. I’m actually considering writing my high-level code in it, and leave the performance critical parts in C++/assembly. I’m only afraid it could become messier than just sticking with C++. For the things I work on, high-level code can become low-level code the next day. :lol:

A native C# compiler with inline assembly would convince me in a second though…

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 26, 2007 at 06:47

@Nautilus

What does it mean, exactly?

Take for example am_sin_ss and am_sin_ps. The first computes one sine (a scalar), while the second computes four sines (a vector). The scalar version takes 48 clock cycles, and the vector version 66 clock cycles. So the scalar version is faster, but it only computes one sine! If you need only one sine (for example for constructing a rotation matrix), then the scalar version is optimal. But if you need to compute sines for a whole array of angles (for example for a Fourier series), then the vector version is more efficient. Computing four sines separately with the scalar version is slower than computing four in parallel with the vector version.

D619d95cddb1edb227f51ef539d15cdc
0
Nautilus 103 Jan 26, 2007 at 13:42

Thank you very much, Nick :)

396faee21f28eba5a8b500fb7d62ba79
0
eigma 101 Dec 16, 2007 at 01:13

Hi Nick,

I’m a novice at SSE programming – I’m teaching myself SSE by using a mix of SSE and FP to implement a linear algebra algorithm. I’m using FP because the algorithm involves the logarithm function. I’ve determined that this is the main bottleneck (according to gprof, I spend 86% of the time in the logarithm function), so I’m looking for ways to implement the logarithm function in pure SSE.

I’m reluctant to reuse your log2 code verbatim; I think that by redesigning the algorithm with my particular application in mind (for example, my required interval of convergence, precision, etc.), I can better optimize the overall performance of my application.

Do you have any pointers on how to design a logarithm approximator?

From what I can see in your code, you’re extracting the exponent field and approximating the logarithm of the mantissa using a 5th-order polynomial.

How did you arrive at the coefficients? I would either use 5 terms of the Taylor series expansion for log, or pick 5 points (perhaps evenly-spaced) on the interval of interest and obtain a 5th-order polynomial that passes through those points (and perhaps visibly verify that the error between those points is acceptable).

Anyway, I’m not very experienced in this domain, which is why I’m hoping you could provide some pointers.

Thanks,

Catalin Patulea

Dd1afef0875400de13b717d123eec3b5
0
XavierDC 101 Jan 20, 2011 at 14:05

I am reactivating that old post.

I was wondering if anyone could explain to me how the values C0,C1 etc. are determined.

I want to use the same system that is is presented here for 2\^x for the logistic function 1/(1+e\^-x). i used the code shown by nick by adding a multiplication by 1/ln2 and add the addps xmm1,one; rcpps xmm1, xmm1

it works but if we could skip these it would gain in speed and in precision.

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jan 20, 2011 at 14:28

@XavierDC

I was wondering if anyone could explain to me how the values C0,C1 etc. are determined.

The coefficients of the polynomial are determined using the minimax algorithm. I believe I used Maple’s implementation of it.

Dd1afef0875400de13b717d123eec3b5
0
XavierDC 101 Jan 21, 2011 at 00:05

@Nick

The coefficients of the polynomial are determined using the minimax algorithm. I believe I used Maple’s implementation of it.

Thanks Nick.

I’ve sent you a private message in this forum. If you can get back to me by email i would appreciate. thank you.