Jump to content


Approximate Math Library


24 replies to this topic

#1 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 16 January 2007 - 03:18 PM

Hi all,

I'm looking for the Intel Approximate Math Library. It used to be at www.intel.com/design/Pentium4/devtools/ but that page moved and I couldn't locate the library.

So, does anyone know where to find it, or can send me a copy? That would be much appreciated. :worthy: I'm also interested in other approximation libraries/functions. The more options between precision and performance the better...

Thanks a lot,

Nick

#2 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 17 January 2007 - 04:21 PM

I have never heard of this library, but AMD has a library of their. I'm sure you know this already, but I'll post it anyway just in case: http://developer.amd.com/acml.jsp
"Stupid bug! You go squish now!!" - Homer Simpson

#3 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 17 January 2007 - 05:22 PM

AMD's library doesn't really have what I'm looking for. I seek fast implementations of transcendental functions (exp, log, sin, etc). Thanks anyway, it was worth checking out.

So the Intel Approximate Math Library contains approximations of transcendental functions, optimized for SSE. For some time I have used my own implementations (sin/cos) but now I'm looking for more precision with minimal performance impact...

#4 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 17 January 2007 - 06:17 PM

The CRT has a few functions implemented using SSE2 (see here), but they are most likely built for precision. The CRT source does however come with VS2005, so you could dig up those files and enhance them for speed :)
"Stupid bug! You go squish now!!" - Homer Simpson

#5 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 18 January 2007 - 10:29 AM

Interesting! It seems pretty hard to derive a faster (single-precision) version from it, but it's always good to have a reference.

#6 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 18 January 2007 - 05:46 PM

Since the Intel library is still nowhere to be found, I started writing my own approximations. Here's a 2^x implementation (first attempt):

float exp2(float x)

{

    static const float one = 1.0f;

    static const float half = 0.5f;

    static const float lim_two = 1.9999999f;


    static const float C0[256] = {.547404266246f, .548061348142f, .548725333807f, .549398399167f, .550075922598f, .550760241024f, .551454706663f, .552154084359f, .552858883435f, .553574967496f, .554297163206f, .555026093302f, .555760495106f, .556505137841f, .557256046635f, .558017083881f, .558781729040f, .559558847539f, .560337005182f, .561128551344f, .561926919212f, .562731360892f, .563548838504f, .564371785305f, .565202403662f, .566041175842f, .566885148265f, .567741995681f, .568603271104f, .569474540778f, .570356611167f, .571248305001f, .572143490366f, .573051865152f, .573965673107f, .574892291723f, .575822287504f, .576764054387f, .577713794250f, .578675521802f, .579646563842f, .580622066161f, .581609491764f, .582606143121f, .583612970496f, .584629192183f, .585655505258f, .586690228227f, .587733473599f, .588788523879f, .589851795687f, .590925279595f, .592009141821f, .593102901511f, .594205792172f, .595321583429f, .596446634193f, .597577914584f, .598723732752f, .599879808291f, .601047527774f, .602222182051f, .603410536625f, .604607039548f, .605814947112f, .607033833104f, .608266709013f, .609506083140f, .610760886405f, .612025725384f, .613302240932f, .614585463030f, .615883610698f, .617191217751f, .618511752134f, .619845226500f, .621187546929f, .622542829136f, .623914626393f, .625290774906f, .626682429570f, .628088718711f, .629505305967f, .630928111994f, .632369622796f, .633820437496f, .635288816972f, .636766410268f, .638255823256f, .639759511520f, .641275658394f, .642806215397f, .644347352744f, .645904478891f, .647470860480f, .649053754624f, .650645737483f, .652253552238f, .653875847907f, .655506204107f, .657158529182f, .658824473497f, .660502429383f, .662188430285f, .663896943735f, .665614322456f, .667344320807f, .669091861220f, .670854384730f, .672633065213f, .674418265280f, .676229143542f, .678048816547f, .679881705149f, .681730321995f, .683593925186f, .685478115237f, .687372988183f, .689280701315f, .691208589068f, .693153490063f, .695110683979f, .697086170558f, .699075467598f, .701082480523f, .703098111035f, .705136926814f, .707188064752f, .709255034137f, .711344041865f, .713449959172f, .715567879355f, .717699596625f, .719857427125f, .722028307800f, .724216235913f, .726413096106f, .728639561732f, .730875455600f, .733132357639f, .735405868218f, .737702530438f, .740009520055f, .742332281356f, .744679668196f, .747037600146f, .749420315605f, .751820376930f, .754243487814f, .756683762441f, .759133959936f, .761605643019f, .764099729889f, .766614919527f, .769146093637f, .771693145114f, .774267141070f, .776858393243f, .779466882269f, .782096835996f, .784742415436f, .787410123430f, .790101570173f, .792815520645f, .795542431752f, .798294870351f, .801063242968f, .803849366342f, .806665638513f, .809494760355f, .812353037567f, .815226352727f, .818124218289f, .821037431411f, .823977286773f, .826945359586f, .829920082804f, .832926883189f, .835951538712f, .839006712543f, .842080344842f, .845177216775f, .848284765471f, .851423322415f, .854588979701f, .857776791512f, .860984910999f, .864215879261f, .867469536588f, .870748683758f, .874051542963f, .877378717421f, .880731452223f, .884101609574f, .887502615964f, .890925768871f, .894372845122f, .897843135408f, .901337846858f, .904862544340f, .908410015848f, .911984035138f, .915576759348f, .919196625760f, .922845709434f, .926529189028f, .930225124843f, .933945305924f, .937700927963f, .941479193749f, .945282181080f, .949117765062f, .952980179587f, .956873412255f, .960779020905f, .964718014705f, .968687508318f, .972685203315f, .976705892948f, .980760930721f, .984836290213f, .988951866922f, .993080243943f, .997245671924f, 1.00144630340f, 1.00567323430f, 1.00991602499f, 1.01419270811f, 1.01850230704f, 1.02285032212f, 1.02721871206f, 1.03161636643f, 1.03604653022f, 1.04050699395f, 1.04499850272f, 1.04951852054f, 1.05407459978f, 1.05865921545f, 1.06326856892f, 1.06791602645f, 1.07259384061f, 1.07729671165f, 1.08203878536f, 1.08680950686f, 1.09161248555f, 1.09645085758f, 1.10132609310f, 1.10623766041f, 1.11114750589f, 1.11612095277f, 1.12113714360f, 1.12616425808f, 1.13122641447f, 1.13631798693f, 1.14146805990f, 1.14663629983f};

    static const float C1[256] = {.212044759469f, .210735702190f, .209418015443f, .208087475652f, .206753275519f, .205410853969f, .204053729978f, .202692211082f, .201325322173f, .199941790541f, .198551704706f, .197153917470f, .195750881257f, .194333564491f, .192909615777f, .191471793032f, .190032461265f, .188575010632f, .187120933209f, .185647204506f, .184166170344f, .182679245802f, .181173672353f, .179663467005f, .178144635485f, .176616349360f, .175084037800f, .173533838064f, .171981116491f, .170415875372f, .168836776605f, .167246022818f, .165654589240f, .164045289244f, .162431953647f, .160801620569f, .159170951952f, .157525268876f, .155871296686f, .154202124016f, .152522495456f, .150840826888f, .149144311266f, .147437665224f, .145719346014f, .143990758331f, .142250789144f, .140502354043f, .138745315445f, .136974218191f, .135195151140f, .133404847135f, .131603099825f, .129790785291f, .127969230244f, .126132298677f, .124286064322f, .122435531964f, .120567182576f, .118688092798f, .116796095743f, .114898862963f, .112985536212f, .111065131312f, .109132474142f, .107188321781f, .105227969151f, .103263383910f, .101280483950f, .99287883117e-1f, .97283061235e-1f, .95273867908e-1f, .93247492774e-1f, .91212558811e-1f, .89163726456e-1f, .87101072766e-1f, .85030989812e-1f, .82947185421e-1f, .80844315542e-1f, .78741073956e-1f, .76620455758e-1f, .74483906833e-1f, .72338082587e-1f, .70189189871e-1f, .68018435591e-1f, .65840082252e-1f, .63641802316e-1f, .61436186248e-1f, .59219384845e-1f, .56987821889e-1f, .54744275251e-1f, .52485937632e-1f, .50218520627e-1f, .47934149447e-1f, .45642766126e-1f, .43333822826e-1f, .41018216474e-1f, .38686201364e-1f, .36339829169e-1f, .33988431451e-1f, .31612040002e-1f, .29222787772e-1f, .26823040651e-1f, .24418495167e-1f, .21988609993e-1f, .19552890922e-1f, .17106039552e-1f, .14641181181e-1f, .12162020545e-1f, .9666998685e-2f, .7169663523e-2f, .4643309730e-2f, .2111596679e-2f, -.431598824e-3f, -.2989690662e-2f, -.5561577943e-2f, -.8154880822e-2f, -.10755886294e-1f, -.13367524684e-1f, -.15999743658e-1f, -.18648123903e-1f, -.21306166293e-1f, -.23981953789e-1f, -.26669330748e-1f, -.29373502333e-1f, -.32082162062e-1f, -.34814814297e-1f, -.37556805223e-1f, -.40312772251e-1f, -.43090887470e-1f, -.45884226576e-1f, -.48686220012e-1f, -.51499213346e-1f, -.54339341155e-1f, -.57189308948e-1f, -.6005430929e-1f, -.6292367512e-1f, -.6582431504e-1f, -.6872984049e-1f, -.7165524503e-1f, -.7459473297e-1f, -.7755666791e-1f, -.8052444140e-1f, -.8350502578e-1f, -.8650968268e-1f, -.8952031711e-1f, -.9255503332e-1f, -.9560425465e-1f, -.9867513068e-1f, -.10176011141f, -.10485000844f, -.10795935979f, -.11108920339f, -.11423780756f, -.11739868933f, -.12057166547f, -.12377042647f, -.12698282286f, -.13020878013f, -.13345344254f, -.13670954099f, -.13998500792f, -.14328171413f, -.14659803793f, -.14992226374f, -.15326963975f, -.15662841273f, -.16000074335f, -.16340152411f, -.16680978718f, -.17024508632f, -.17369037027f, -.17715697361f, -.18063381546f, -.18413430337f, -.18766017755f, -.19118577445f, -.19474116819f, -.19830943646f, -.20190541522f, -.20551481061f, -.20914316989f, -.21277573100f, -.21643619837f, -.22011987890f, -.22382091824f, -.22753710375f, -.23127131148f, -.23502327513f, -.23879613644f, -.24258775687f, -.24639874443f, -.25023043386f, -.25407345559f, -.25794304031f, -.26182918290f, -.26573382630f, -.26965609165f, -.27359726148f, -.27756351145f, -.28154663028f, -.28555077208f, -.28956708486f, -.29360493441f, -.29766652900f, -.30175749946f, -.30585341675f, -.30996731111f, -.31411145322f, -.31827160793f, -.32245000439f, -.32665519190f, -.33088073838f, -.33513090577f, -.33938551392f, -.34366739446f, -.34797328516f, -.35230059730f, -.35664361881f, -.36101452016f, -.36539810091f, -.36981565376f, -.37423767670f, -.37869008365f, -.38317075577f, -.38767009071f, -.39217694488f, -.39671041296f, -.40126933906f, -.40585941045f, -.41046149518f, -.41508490180f, -.41973294431f, -.42440320388f, -.42909637229f, -.43380970842f, -.43855098623f, -.44331228018f, -.44808957363f, -.45289663497f, -.45772533171f, -.46257013252f, -.46744551144f, -.47234051123f, -.47725875550f, -.48220335041f, -.48717569030f, -.49217510643f, -.49716286338f, -.50220524890f, -.50728090192f, -.51235755815f, -.51745955485f, -.52258113148f, -.52775139108f, -.53292971984f};

    static const float C2[256] = {.240550974386f, .241202959893f, .241856701004f, .242514265929f, .243171103188f, .243829457574f, .244492480331f, .245155117376f, .245817854428f, .246486128053f, .247155039241f, .247825134276f, .248495241281f, .249169654431f, .249844713280f, .250523830778f, .251201163433f, .251884509323f, .252563787406f, .253249747676f, .253936606109f, .254623711603f, .255316926931f, .256009778763f, .256704095772f, .257400249962f, .258095764201f, .258796915826f, .259496734037f, .260199725451f, .260906458608f, .261615922477f, .262323223326f, .263035992751f, .263748082312f, .264465202209f, .265180012111f, .265898946252f, .266619044658f, .267343298130f, .268069619098f, .268794376437f, .269523080941f, .270253687830f, .270986837738f, .271721917527f, .272459385827f, .273197995608f, .273937798870f, .274681077710f, .275425261899f, .276171707452f, .276920486684f, .277671219387f, .278423348235f, .279179385328f, .279936813823f, .280693582599f, .281455206542f, .282218776699f, .282985153731f, .283751228711f, .284521373431f, .285291943163f, .286065008210f, .286840250780f, .287619522875f, .288398060167f, .289181428600f, .289966204157f, .290753368791f, .291539837984f, .292330618012f, .293122323170f, .293917023197f, .294714665277f, .295512769094f, .296313753856f, .297119642359f, .297923266738f, .298731122037f, .299542626941f, .300355242358f, .301166621541f, .301983850061f, .302801532724f, .303624282505f, .304447368111f, .305272224846f, .306100168085f, .306930150725f, .307763195441f, .308597186775f, .309435004672f, .310272993070f, .311114998221f, .311957035187f, .312802638726f, .313651045018f, .314498876947f, .315353314930f, .316209965286f, .317067971632f, .317925302128f, .318789261216f, .319652892561f, .320518078729f, .321387232845f, .322259028085f, .323133993420f, .324007380664f, .324888508986f, .325769102908f, .326651295684f, .327536259333f, .328423600216f, .329315922731f, .330208493193f, .331102319414f, .332000786889f, .332902364781f, .333804829061f, .334710914663f, .335618521194f, .336529395375f, .337439388370f, .338355041149f, .339271424585f, .340190082999f, .341113718319f, .342040006406f, .342966760909f, .343894760691f, .344829302014f, .345764673269f, .346702572919f, .347639507987f, .348584246701f, .349528173351f, .350476153452f, .351426290935f, .352381270258f, .353335726296f, .354291903376f, .355253394261f, .356214397492f, .357180679851f, .358149170924f, .359122122768f, .360097126490f, .361071279072f, .362049162020f, .363031076715f, .364016460615f, .365003272896f, .365991452648f, .366985245318f, .367980853922f, .368978251405f, .369979013823f, .370980890602f, .371986312192f, .372995831811f, .374008931067f, .375022025929f, .376039753227f, .377058524791f, .378078994230f, .379105644931f, .380132134704f, .381164337027f, .382197114688f, .383233855329f, .384271234863f, .385313243354f, .386360369799f, .387404990922f, .388456010655f, .389508406520f, .390566534906f, .391626171784f, .392688937114f, .393750505484f, .394817797166f, .395889415356f, .396963639187f, .398039817634f, .399118775070f, .400200421917f, .401285649507f, .402373826907f, .403465115321f, .404559881707f, .405655440162f, .406756120792f, .407859059092f, .408964796284f, .410073072953f, .411184239170f, .412300018644f, .413418085218f, .414539592392f, .415662053303f, .416788077635f, .417918262200f, .419054147477f, .420188943978f, .421326262837f, .422469476617f, .423614637508f, .424762353579f, .425914955357f, .427070660204f, .428230616859f, .429389315399f, .430552969348f, .431720668445f, .432891695257f, .434064493914f, .435242336232f, .436421114570f, .437606536221f, .438790675206f, .439980464677f, .441175310294f, .442372633912f, .443569472135f, .444770890044f, .445976559606f, .447187960487f, .448400030952f, .449615218011f, .450834377285f, .452056858160f, .453282828803f, .454511559547f, .455745060764f, .456981256106f, .458219094270f, .459462130949f, .460708242561f, .461955996149f, .463209104131f, .464464733125f, .465723802255f, .466987089818f, .468254934518f, .469527143677f, .470793869973f, .472071940011f, .473355895138f, .474637566791f, .475923104541f, .477211044044f, .478508675487f, .479805784911f};


    #define EXP2_CHECK_FOR_LIMITS 1


    __asm

    {

        movss xmm0, x


        movss xmm1, xmm0

        subss xmm1, half

        cvtss2si ecx, xmm1

        cvtsi2ss xmm2, ecx

        add ecx, 127


    #if EXP2_CHECK_FOR_LIMITS

        test ecx, 0xFFFFFF00

        jnz overflow

    #endif


        shl ecx, 23

        movd xmm7, ecx


        subss xmm0, xmm2

        addss xmm0, one

        minss xmm0, lim_two

        movd eax, xmm0


        and eax, 0x007F8000

        shr eax, 13

        movss xmm1, C2[eax]

        mulss xmm1, xmm0

        addss xmm1, C1[eax]

        mulss xmm1, xmm0

        addss xmm1, C0[eax]

        

        mulss xmm1, xmm7

        

        movss x, xmm1


    #if EXP2_CHECK_FOR_LIMITS

        jmp end

    overflow:

        comiss xmm0, one

        jp not_a_number

        jb underflow

        mov x, 0x7F800000

        jmp end

    underflow:

        mov x, 0x00000000

        jmp end

    not_a_number:

        movss x, xmm0

    end:

    #endif

    }


    return x;

}

It's accurate to 2 ulps. Basically it uses the 8 upper mantissa bits to lookup the coefficients of a quadratic polynomial, determined with the minimax algorithm. It handles all inputs (denormals, infinity, NaN), but flushes denormal results to 0.0.

All ideas to make it more accurate and/or faster are welcome!

#7 Reedbeta

    DevMaster Staff

  • Administrators
  • 5307 posts
  • LocationBellevue, WA

Posted 18 January 2007 - 10:46 PM

What are ulps? (Couldn't find the term with google...)
reedbeta.com - developer blog, OpenGL demos, and other projects

#8 Dia

    DevMaster Staff

  • Administrators
  • 1120 posts

Posted 18 January 2007 - 11:43 PM

Nick: Is this what you're looking for? ;)

btw, ulps stands for Units in the Last Place.

#9 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 19 January 2007 - 12:11 AM

Dia said:

Is this what you're looking for? :)
Thanks! :w00t:

#10 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 20 January 2007 - 09:48 AM

The accuracy/performance balance of the Intel library failed to impress me. At least for the exp2 implementation...

Here's my own vector implementation:

#include <emmintrin.h>


__m128 exp2(__m128 x)

{

    static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f

    static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f

    static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f

    static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127


    static const __declspec(align(16)) int C5[4] = {0x3AF61905, 0x3AF61905, 0x3AF61905, 0x3AF61905};   // 1.8775767e-3f

    static const __declspec(align(16)) int C4[4] = {0x3C134806, 0x3C134806, 0x3C134806, 0x3C134806};   // 8.9893397e-3f

    static const __declspec(align(16)) int C3[4] = {0x3D64AA23, 0x3D64AA23, 0x3D64AA23, 0x3D64AA23};   // 5.5826318e-2f

    static const __declspec(align(16)) int C2[4] = {0x3E75EAD4, 0x3E75EAD4, 0x3E75EAD4, 0x3E75EAD4};   // 2.4015361e-1f

    static const __declspec(align(16)) int C1[4] = {0x3F31727B, 0x3F31727B, 0x3F31727B, 0x3F31727B};   // 6.9315308e-1f

    static const __declspec(align(16)) int C0[4] = {0x3F7FFFFF, 0x3F7FFFFF, 0x3F7FFFFF, 0x3F7FFFFF};   // 9.9999994e-1f


    __asm

    {

        movaps xmm0, x


        minps xmm0, lim1

        maxps xmm0, lim2

        movaps xmm1, xmm0

        subps xmm1, half

        cvtps2dq xmm2, xmm1

        cvtdq2ps xmm1, xmm2

        paddd xmm2, base

        pslld xmm2, 23

        subps xmm0, xmm1

        movaps xmm1, C5

        mulps xmm1, xmm0

        addps xmm1, C4

        mulps xmm1, xmm0

        addps xmm1, C3

        mulps xmm1, xmm0

        addps xmm1, C2

        mulps xmm1, xmm0

        addps xmm1, C1

        mulps xmm1, xmm0

        addps xmm1, C0

        mulps xmm1, xmm2

        

        movaps x, xmm1

    }


    return x;

}

Unlike the previous version it requires SSE2. It doesn't use any lookup tables, but instead uses a polynomial of degree five. This creates a fairly long dependency chain, wich means a high latency. But at 21 instructions this is compensated with a high throughput. In English this means it's very suited for computing lots of 2x values in a loop.

It has a relative accuracy of 2-22.33. It returns infinity on overflow, 0.0 on underflow, and keeps it accracy with denormals. For NaN inputs it returns infinity.

#11 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 20 January 2007 - 02:29 PM

For those interested:

Here's a version with a weighted minimax polynomial of order four, yielding 2-18.54 relative precision:

#include <emmintrin.h>


__m128 exp2(__m128 x)

{

    static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f

    static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f

    static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f

    static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127


    static const __declspec(align(16)) int C4[4] = {0x3C5DBE69, 0x3C5DBE69, 0x3C5DBE69, 0x3C5DBE69};   // 1.3534167e-2f

    static const __declspec(align(16)) int C3[4] = {0x3D5509F9, 0x3D5509F9, 0x3D5509F9, 0x3D5509F9};   // 5.2011464e-2f

    static const __declspec(align(16)) int C2[4] = {0x3E773CC5, 0x3E773CC5, 0x3E773CC5, 0x3E773CC5};   // 2.4144275e-1f

    static const __declspec(align(16)) int C1[4] = {0x3F3168B3, 0x3F3168B3, 0x3F3168B3, 0x3F3168B3};   // 6.9300383e-1f

    static const __declspec(align(16)) int C0[4] = {0x3F800016, 0x3F800016, 0x3F800016, 0x3F800016};   // 1.0000026f


    __asm

    {

        movaps xmm0, x


        minps xmm0, lim1

        maxps xmm0, lim2

        movaps xmm1, xmm0

        subps xmm1, half

        cvtps2dq xmm2, xmm1

        cvtdq2ps xmm1, xmm2

        paddd xmm2, base

        pslld xmm2, 23

        subps xmm0, xmm1

        movaps xmm1, C4

        mulps xmm1, xmm0

        addps xmm1, C3

        mulps xmm1, xmm0

        addps xmm1, C2

        mulps xmm1, xmm0

        addps xmm1, C1

        mulps xmm1, xmm0

        addps xmm1, C0

        mulps xmm1, xmm2

        

        movaps x, xmm1

    }


    return x;

}

Here's a version with a weighted minimax polynomial of order three, yielding 2-13.70 relative precision:

#include <emmintrin.h>


__m128 exp2(__m128 x)

{

    static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f

    static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f

    static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f

    static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127


    static const __declspec(align(16)) int C3[4] = {0x3D9FCB52, 0x3D9FCB52, 0x3D9FCB52, 0x3D9FCB52};   // 7.8024521e-2f

    static const __declspec(align(16)) int C2[4] = {0x3E677E26, 0x3E677E26, 0x3E677E26, 0x3E677E26};   // 2.2606716e-1f

    static const __declspec(align(16)) int C1[4] = {0x3F322226, 0x3F322226, 0x3F322226, 0x3F322226};   // 6.9583356e-1f

    static const __declspec(align(16)) int C0[4] = {0x3F7FFB19, 0x3F7FFB19, 0x3F7FFB19, 0x3F7FFB19};   // 9.9992520e-1f


    __asm

    {

        movaps xmm0, x


        minps xmm0, lim1

        maxps xmm0, lim2

        movaps xmm1, xmm0

        subps xmm1, half

        cvtps2dq xmm2, xmm1

        cvtdq2ps xmm1, xmm2

        paddd xmm2, base

        pslld xmm2, 23

        subps xmm0, xmm1

        movaps xmm1, C3

        mulps xmm1, xmm0

        addps xmm1, C2

        mulps xmm1, xmm0

        addps xmm1, C1

        mulps xmm1, xmm0

        addps xmm1, C0

        mulps xmm1, xmm2

        

        movaps x, xmm1

    }


    return x;

}

Here's a version with a weighted minimax polynomial of order two, yielding 2-9.17 relative precision:

#include <emmintrin.h>


__m128 exp2(__m128 x)

{

    static const __declspec(align(16)) int lim1[4] = {0x43010000, 0x43010000, 0x43010000, 0x43010000};   // 129.00000e+0f

    static const __declspec(align(16)) int lim2[4] = {0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF, 0xC2FDFFFF};   // -126.99999e+0f

    static const __declspec(align(16)) int half[4] = {0x3F000000, 0x3F000000, 0x3F000000, 0x3F000000};   // 0.5e+0f

    static const __declspec(align(16)) int base[4] = {0x0000007F, 0x0000007F, 0x0000007F, 0x0000007F};   // 127


    static const __declspec(align(16)) int C2[4] = {0x3EACA418, 0x3EACA418, 0x3EACA418, 0x3EACA418};   // 3.3718944e-1f

    static const __declspec(align(16)) int C1[4] = {0x3F285ADA, 0x3F285ADA, 0x3F285ADA, 0x3F285ADA};   // 6.5763628e-1f

    static const __declspec(align(16)) int C0[4] = {0x3F803884, 0x3F803884, 0x3F803884, 0x3F803884};   // 1.0017247f


    __asm

    {

        movaps xmm0, x


        minps xmm0, lim1

        maxps xmm0, lim2

        movaps xmm1, xmm0

        subps xmm1, half

        cvtps2dq xmm2, xmm1

        cvtdq2ps xmm1, xmm2

        paddd xmm2, base

        pslld xmm2, 23

        subps xmm0, xmm1

        movaps xmm1, C2

        mulps xmm1, xmm0

        addps xmm1, C1

        mulps xmm1, xmm0

        addps xmm1, C0

        mulps xmm1, xmm2

        

        movaps x, xmm1

    }


    return x;

}

Every floating-point number has been tested, using an exhaustive loop. For the record, Intel's library claims an average relative error of 0.0018, but it peaks at a terrible 0.29 relative error.

#12 davepermen

    Senior Member

  • Members
  • PipPipPipPip
  • 1306 posts

Posted 20 January 2007 - 03:56 PM

Hehe, looks cool. If i would work currently in C++, I'd sure plug this into my code directly and look how it affects the image, depending on the quality of the function :)

But I don't think I'll touch C++ in the sooner time at all anymore... And I don't have my code lying around, else, it would be a simple copy-paste.

But I'll remember :)
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

#13 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 20 January 2007 - 06:09 PM

davepermen said:

But I don't think I'll touch C++ in the sooner time at all anymore...
Totally fallen in love with C#? ;)

#14 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 25 January 2007 - 04:25 PM

A log2 implementation (first attempt):

__m128 log2(__m128 x)

{

    static const __declspec(align(16)) int exp[4] = {0x7F800000, 0x7F800000, 0x7F800000, 0x7F800000};

    static const __declspec(align(16)) int one[4] = {0x3F800000, 0x3F800000, 0x3F800000, 0x3F800000};   // 1.0f

    static const __declspec(align(16)) int off[4] = {0x3FBF8000, 0x3FBF8000, 0x3FBF8000, 0x3FBF8000};   // 1.4960938f

    static const __declspec(align(16)) int mant[4] = {0x007FFFFF, 0x007FFFFF, 0x007FFFFF, 0x007FFFFF};

    static const __declspec(align(16)) int sign[4] = {0x80000000, 0x80000000, 0x80000000, 0x80000000};

    static const __declspec(align(16)) int base[4] = {0x43800000, 0x43800000, 0x43800000, 0x43800000};   // 256.0f


    static const __declspec(align(16)) int C5[4] = {0xBD0D0CC5, 0xBD0D0CC5, 0xBD0D0CC5, 0xBD0D0CC5};   // -3.4436006e-2f

    static const __declspec(align(16)) int C4[4] = {0x3EA2ECDD, 0x3EA2ECDD, 0x3EA2ECDD, 0x3EA2ECDD};   //  3.1821337e-1f

    static const __declspec(align(16)) int C3[4] = {0xBF9dA2C9, 0xBF9dA2C9, 0xBF9dA2C9, 0xBF9dA2C9};   // -1.2315303f

    static const __declspec(align(16)) int C2[4] = {0x4026537B, 0x4026537B, 0x4026537B, 0x4026537B};   //  2.5988452f

    static const __declspec(align(16)) int C1[4] = {0xC054bFAD, 0xC054bFAD, 0xC054bFAD, 0xC054bFAD};   // -3.3241990f

    static const __declspec(align(16)) int C0[4] = {0x4047691A, 0x4047691A, 0x4047691A, 0x4047691A};   //  3.1157899f


    __asm

    {

        movaps xmm0, x


        movaps xmm1, xmm0

        andps xmm1, exp

        psrld xmm1, 8

        orps xmm1, one

        subps xmm1, off

        mulps xmm1, base

        andps xmm0, mant

        orps xmm0, one

        movaps xmm2, C5

        mulps xmm2, xmm0

        addps xmm2, C4

        mulps xmm2, xmm0

        addps xmm2, C3

        mulps xmm2, xmm0

        addps xmm2, C2

        mulps xmm2, xmm0

        addps xmm2, C1

        mulps xmm2, xmm0

        addps xmm2, C0

        subps xmm0, one

        mulps xmm0, xmm2

        addps xmm1, xmm0

        

        movaps x, xmm1

    }


    return x;

}

Unfortunately logarithms don't converge very fast. It's also not that easy to get good behaviour for the limits. :sad:

#15 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 25 January 2007 - 04:41 PM

Any benchmarks to go with that? Not that I'm doubting it will be faster, but it's always nice to know how much faster it is. :)
"Stupid bug! You go squish now!!" - Homer Simpson

#16 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 25 January 2007 - 08:33 PM

Kenneth Gorking said:

Any benchmarks to go with that? Not that I'm doubting it will be faster, but it's always nice to know how much faster it is. :)
I'm not so sure it's faster. The dependency chains are quite bad, while the Intel library appears to use properly scheduled instructions.

Anyway, I'm considering making it an article for the 'Daily Code Gem', including benchmark results. I can't promise when I have the time for it though...

#17 davepermen

    Senior Member

  • Members
  • PipPipPipPip
  • 1306 posts

Posted 26 January 2007 - 12:23 AM

Nick said:

Totally fallen in love with C#? ;)

Yes. You and you're work are the only reason to regret this... But sorry, thats not enough... :)
davepermen.net
-Loving a Person is having the wish to see this Person happy, no matter what that means to yourself.
-No matter what it means to myself....

#18 Nautilus

    Senior Member

  • Members
  • PipPipPipPip
  • 351 posts

Posted 26 January 2007 - 03:40 AM

There's one thing I don't understand.
I'll quote from "AMaths.pdf", page 5:

Quote

Packed vs. Scalar AM Library Functions
Compared to the packed AM Library functions, the scalar ones are slower per operand but
faster per call. Therefore, when just one result is needed, it is faster to call the scalar version.
Otherwise (for two or more results), the packed version should be used.
What does it mean, exactly?

Regards,
Ciao ciao : )
-Nautilus

(readin' this? you ought to get out more)


#19 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 26 January 2007 - 06:27 AM

davepermen said:

Yes. You and you're work are the only reason to regret this... But sorry, thats not enough... :)
Yeah, C# is a very nice language. I'm actually considering writing my high-level code in it, and leave the performance critical parts in C++/assembly. I'm only afraid it could become messier than just sticking with C++. For the things I work on, high-level code can become low-level code the next day. :lol:

A native C# compiler with inline assembly would convince me in a second though...

#20 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 26 January 2007 - 06:47 AM

Nautilus said:

What does it mean, exactly?
Take for example am_sin_ss and am_sin_ps. The first computes one sine (a scalar), while the second computes four sines (a vector). The scalar version takes 48 clock cycles, and the vector version 66 clock cycles. So the scalar version is faster, but it only computes one sine! If you need only one sine (for example for constructing a rotation matrix), then the scalar version is optimal. But if you need to compute sines for a whole array of angles (for example for a Fourier series), then the vector version is more efficient. Computing four sines separately with the scalar version is slower than computing four in parallel with the vector version.





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users