I've been looking for references on texture swizzling, but found very little about how it is done.
Does anyone know of any good sources (books, online, whatever) on this?
I'm writing a simple software renderer and wanted to try swizzled textures to potentially increase the performance of the bilinear filtering through texel locality.
/Mattias
Texture Swizzling
Started by Mattias Gustavsson, Jun 21 2007 07:54 AM
5 replies to this topic
#1
Posted 21 June 2007 - 07:54 AM
#2
Posted 21 June 2007 - 04:44 PM
I don't know much of the specifics, but it's my understanding that it uses a space-filling curve as mapping from addresses to (x,y) texel locations. This article shows some examples of such curves, though I don't know if any of those is the specific one used for swizzled textures.
reedbeta.com - developer blog, OpenGL demos, and other projects
#3
Posted 21 June 2007 - 08:51 PM
What's the CPU architecture you're working on? Modern CPU's have really advanced caches with automatic prefetching optimized for accessing large 'matrices' in complex ways. In my experience swizzling sometimes helps a bit and sometimes hurts a bit. Add to this that it greatly complicates texture management, and you really want to think twice before using it. Michael Abrash came to the same conclusion. Good mipmapping is a lot more crucial.
Anyway, the implementation of texture swizzling is pretty simple. Just swap some of the U and V addressing bits.
For simplicity, assume we have a 16x16 texture with 8-bit color. For regular addressing, we compute the address as U + 16 * V. This gives us the following layout of the addressing bits (MSB first so 'read' from right to left):
V3V2V1V0U3U2U1U0
Note that incrementing V by one takes us 16 bytes further. To prevent this, we can rearrange the bits like this (for example):
V3V2U3U2V1V0U1U0
Now a small change in V isn't so drastic. Note that computing the address like this requires some masking and shifting operations though (which is the reason why swizzling is sometimes slower). An even better swizzling layout is this:
V3U3V2U2V1U1V0U0
Every bit is interleaved, and small changes in either U or V correspond to the smallest possible changes in the address. This swizzling is even harder to compute thougt.
If you can't really visualize the impact of swapping U and V bits, draw a 16x16 matrix on paper and give every element an address according to one of the above layouts. Then connect the numbers. :happy: The first layout groups the texels in blocks of 4x4 x 4x4. The second creates groups of 2x2 x 2x2 x 2x2 x 2x2. :ninja:
Anyway, the implementation of texture swizzling is pretty simple. Just swap some of the U and V addressing bits.
For simplicity, assume we have a 16x16 texture with 8-bit color. For regular addressing, we compute the address as U + 16 * V. This gives us the following layout of the addressing bits (MSB first so 'read' from right to left):
V3V2V1V0U3U2U1U0
Note that incrementing V by one takes us 16 bytes further. To prevent this, we can rearrange the bits like this (for example):
V3V2U3U2V1V0U1U0
Now a small change in V isn't so drastic. Note that computing the address like this requires some masking and shifting operations though (which is the reason why swizzling is sometimes slower). An even better swizzling layout is this:
V3U3V2U2V1U1V0U0
Every bit is interleaved, and small changes in either U or V correspond to the smallest possible changes in the address. This swizzling is even harder to compute thougt.
If you can't really visualize the impact of swapping U and V bits, draw a 16x16 matrix on paper and give every element an address according to one of the above layouts. Then connect the numbers. :happy: The first layout groups the texels in blocks of 4x4 x 4x4. The second creates groups of 2x2 x 2x2 x 2x2 x 2x2. :ninja:
#4
Posted 22 June 2007 - 12:10 AM
What nick talks about is also known as Morton-Order, and it creates the z-order space filling curve. A neat advantage of this order is, that you can implement affine texture mapping with nearly zero overhead.
The usual per pixel task for texture mapping is to extract the texture address and then increment or decrement the uv coordinates. Doing the same in morton order can be done somewhat like this:
(I use 4.4 formated fixed point which is way to imprecise to be practical, but it's ok to show the idea behind it)
That's how the UV coordinates look before converting them:
( w = whole bits, f = fraction bits)
u := wwwwffff
v := wwwwffff
Now interleave zero bits into the uv coordinates:
u := w0w0w0wffff
v := w0w0w0wffff
Convert your deltas for U and V for as well, but interleave with ones:
dudx = w1w1w1wffff
dvdx = w1w1w1wffff
And here comes the trick: The deltas can just be added onto the coordinates if you mask out
the zero bit gaps after addition. This makes the addition work directly in morton order.
Some code:
The netto cost per pixel are two AND operations. That's near to nothing if you consider that your cache coherency will skyrock. Oh, and you'll loose some bits due to the interleaving with zeros, but that shouldn't be a problem if you put everything in mmx registers or limit the texture size and subtexel precision a bit.
The only expensive part is to interleave the whole bits with ones and zeros. Unfortunately there is no clever way to do this fast unless your cpu offers a special instruction for such stuff, but you only have to do this once per polygon or once per perspective correction, so it's not that bad really.
Hope that sheds some light on how easy it is to implement texture swizzeling effectively.
Nils
The usual per pixel task for texture mapping is to extract the texture address and then increment or decrement the uv coordinates. Doing the same in morton order can be done somewhat like this:
(I use 4.4 formated fixed point which is way to imprecise to be practical, but it's ok to show the idea behind it)
That's how the UV coordinates look before converting them:
( w = whole bits, f = fraction bits)
u := wwwwffff
v := wwwwffff
Now interleave zero bits into the uv coordinates:
u := w0w0w0wffff
v := w0w0w0wffff
Convert your deltas for U and V for as well, but interleave with ones:
dudx = w1w1w1wffff
dvdx = w1w1w1wffff
And here comes the trick: The deltas can just be added onto the coordinates if you mask out
the zero bit gaps after addition. This makes the addition work directly in morton order.
Some code:
for (int x = left; x < right; x++)
{
// get texture address in morton order:
int addr = (u>>4)|((v>>4)<<1)
putpixel (x,y,texture[addr], yadda yadda)
// increment uv coordinates per pixel.
u = u + dudx;
v = v + dvdx;
// mask out the gap-bits
u &= 010101011111b;
v &= 010101011111b;
}
The netto cost per pixel are two AND operations. That's near to nothing if you consider that your cache coherency will skyrock. Oh, and you'll loose some bits due to the interleaving with zeros, but that shouldn't be a problem if you put everything in mmx registers or limit the texture size and subtexel precision a bit.
The only expensive part is to interleave the whole bits with ones and zeros. Unfortunately there is no clever way to do this fast unless your cpu offers a special instruction for such stuff, but you only have to do this once per polygon or once per perspective correction, so it's not that bad really.
Hope that sheds some light on how easy it is to implement texture swizzeling effectively.
Nils
My music: http://myspace.com/planetarchh <-- my music
My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
My stuff: torus.untergrund.net <-- some diy electronic stuff and more.
#5
Posted 22 June 2007 - 05:14 AM
Nils Pipenbrinck said:
A neat advantage of this order is, that you can implement affine texture mapping with nearly zero overhead.
Quote
The netto cost per pixel are two AND operations. That's near to nothing if you consider that your cache coherency will skyrock. ... The only expensive part is to interleave the whole bits with ones and zeros.
Quote
...once per perspective correction, so it's not that bad really.
Still a nice bit-hacking trick though...
#6
Posted 28 September 2007 - 12:57 AM
The funny things is that while perfect interleaving with Morton numbers seems ideally the most efficient, I am not sure that pratically it is the best in term of performance.(I ignore the cost of encoding the numbers here for software implementation)
As an example, let's say that your texture is drawn most of the time with a rotation close to -10, +10 deg, you prefer to have a swizzling scheme that make a favor for "mostly horizontal textures" thus having less misses when you walk closely to the horizontal in the texture.
Another example is that if you draw your image perfectly horizontally, the natural encoding of a bitmap is the most efficient. This prooves that a swizzling scheme really depends on the "average" orientation of textures inside a scene.(in nearest to simplify the talk)
Moreover, the size of a cache line is something to consider also.
(ex. 64 bytes/cache line on ARM cpu)
What could be interesting is to write a small piece of code that simulate the following :
- Time to fill a line of cache.
- Number of cache miss.
- Various cache size.
- Different swizzling scheme.
- Rendering at different angles.
And find what would be the most efficient for a given "architecture".
I would bet that people at NVidia or ATI have done that already,
as swizzling itself in hardware is most likely to cost zero (swap bit = route the wire differently)
By doing that, even in software you can probably find a swizzling scheme that
is quite efficient (= minimize the encoding cost also, maximize cache usage)
It could be funny to have your computer run the simulation for days and have 42 as answer
As an example, let's say that your texture is drawn most of the time with a rotation close to -10, +10 deg, you prefer to have a swizzling scheme that make a favor for "mostly horizontal textures" thus having less misses when you walk closely to the horizontal in the texture.
Another example is that if you draw your image perfectly horizontally, the natural encoding of a bitmap is the most efficient. This prooves that a swizzling scheme really depends on the "average" orientation of textures inside a scene.(in nearest to simplify the talk)
Moreover, the size of a cache line is something to consider also.
(ex. 64 bytes/cache line on ARM cpu)
What could be interesting is to write a small piece of code that simulate the following :
- Time to fill a line of cache.
- Number of cache miss.
- Various cache size.
- Different swizzling scheme.
- Rendering at different angles.
And find what would be the most efficient for a given "architecture".
I would bet that people at NVidia or ATI have done that already,
as swizzling itself in hardware is most likely to cost zero (swap bit = route the wire differently)
By doing that, even in software you can probably find a swizzling scheme that
is quite efficient (= minimize the encoding cost also, maximize cache usage)
It could be funny to have your computer run the simulation for days and have 42 as answer
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users












