## Optimization tricks: Fast sqrt(x^2 + y^2)

### April 10, 2011

[Before I get to the boring (hah!) numerics stuff, some news: In order to test KDE Games on a touchscreen, and because I’ve wanted a tablet for a long time now, and because I want a 3G device, I’ve now ordered a WeTab 2. More news on this coming up soon when I have it in my hands.]

I’ve been optimizing my favorite jigsaw puzzle game Palapeli lately. Today, I looked at the bevelmap generation algorithm which renders the nice three-dimensional shadows atop the jigsaw pieces. There are a lot of norm computations in there, i.e. `sqrt(x^2+y^2)`

. All bells are ringing: There must be a way to optimize this.

Now nearly everyone who has to do with numerics and programming probably knows about the fast inverse square root which approximates `1/sqrt(x)`

very efficiently and sufficiently precise (about 1% maximum error). Using the inverse of this fast inverse square root does not really give an improvement, though. After all, the formula `1/InvSqrt(x*x+y*y)`

contains at least four floating-point operation plus everything that happens in InvSqrt.

This approach is especially inefficient because, in my special case, x and y are integers, and the result is also needed as an integer. Looking around the interwebs, I’ve found this lovely page on square root calculation, which also has a special algorithm for norms.

It all starts with two approximations to `sqrt(x^2+y^2)`

. The first one is `max(abs(x), abs(y))`

, the second one is `1/sqrt(2)*(abs(x) + abs(y))`

. We need both because they are good for different values of x and y. To understand this, consider the unit circles of these approximations, that is: the set of x/y points where the approximation is 1. (You will probably make a sketch as you follow the discussion. I would have done one if it wasn’t so late already.) The unit circle for the first approximation is an upright square, while the second approximation’s unit circle is diamond-shaped (i.e. turned around by 45° compared to the first one).

An approximation is good when the unit circle comes nearby or touches the actual unit circle. For the first approximation, this happens when either x or y vanishes. The second approximation, on the other hand, is good when x and y are approximately equal. Now one can show that, at these conditions, the better approximation is always bigger than the other one. This means that we can define a third approximation by taking the maximum of both approximations. The unit circle for this approximation is a regular octagon, whose inscribed circle is the actual unit circle.

The approximation is thus always too big, except for the eight touching points of octagon and real unit circle, where approximation and exact result are identical. However, one can find a factor `z`

such that the octagon, scaled by `1/z`

, is inside the circle. Scaling the approximation by `(1+1/z)/2`

thus distributes the approximational error to both the negative and positive side. Taking all these considerations into account, I’ve arrived at the following implementation:

template<typename T> struct fastnorm_helper { typedef T fp; }; template<> struct fastnorm_helper<int> { typedef qreal fp; }; //Approximates sqrt(x^2 + y^2) with a maximum error of 4%. //Speedup on my machine is 30% for unoptimized and up to 50% for optimized builds. template<typename T> inline T fastnorm(T x, T y) { //need absolute values for distance measurement x = qAbs<T>(x); y = qAbs<T>(y); //There are two simple approximations to sqrt(x^2 + y^2): // max(x, y) -> good for x >> y or x << y (square-shaped unit circle) // 1/sqrt(2) * (x + y) -> good for x = y (approx.) (diamond-shaped unit circle) //The worse approximation is always bigger. By taking the maximum of both, //we get an octagon-shaped upper approximation of the unit circle. The //approximation can be refined by a prefactor (called a) below. typedef typename fastnorm_helper<T>::fp fp; static const fp a = (1 + sqrt(4 - 2 * qSqrt(2))) / 2; static const fp b = qSqrt(0.5); const T metricOne = qMax<T>(x, y); const T metricTwo = b * (x + y); return a * qMax(metricOne, metricTwo); //idea for this function from http://www.azillionmonkeys.com/qed/sqroot.html //(though the equations there have some errors at the time of this writing) }

[License: I grant any entity the right to use the above code snippet for any purpose, without any conditions, unless such conditions are required by law. (This is the best approximation to placing the code in the public domain, which is not possible in my jurisdiction.)]

I have tested this function successfully on random input with T = double, float, int32_t. For integer types, this function needs only two floating-point multiplications, and thus runs much much faster than the trivial implementation and the InvSqrt-based solution. So please take this as my contribution towards Green IT, and reduce the time your computer spends calculating norms.

P.S. The extension of the above function for three-dimensional coordinates remains as an exercise to the reader. The only hurdle is to calculate the new prefactor `z`

.

**Update:** As correctly pointed out by the commenters, if all you want is to compare the norm to some radius, e.g. `sqrt(x*x + y*y) > r*r`

, the easiest solution is to avoid the square root and test for `x*x + y*y > r*r`

instead.

April 11, 2011 at 03:32

Often you can square the other side of the (in)equation and avoid the square root altogether. x*x is cheap.

April 11, 2011 at 04:40

I use this:

static int fastnorm(int x, int y)

{

x = x < 0 ? -x : x;

y = y y) {

int d = y – (x >> 2);

if (d > 1);

}

} else {

int d = x – (y >> 2);

if (d > 1);

}

}

}

April 11, 2011 at 04:41

Argh, stupid HTML tags … use http://paste.kde.org/10468/

April 11, 2011 at 06:17

There always is CC0 license.

April 11, 2011 at 07:51

If you have many values to calculate that are near each other, I’m pretty sure you will be able to find a way to calculate values from nearby values very fast.

April 11, 2011 at 07:51

You should save the square roots of constants as constants too. Your statics aren’t good enough, since the value cannot be computed at compile time. That means your code has two mutex locks under GCC…

You should hardcode both a and b.

April 11, 2011 at 09:40

I actually thought that gcc would be able to precalculate the values in this case, but you are right: it fails, and thus does the two mutexes. I’ll dare to paste the dissambled code below (I added a main that returns fastnorm(2.0, 3.0), and replaced all the Qt functions with stdlib ditto. Note the calls to __cxa_guard.. blah blah. Those are essentially mutexes.

0x0000000000400710 : sub $0x8,%rsp

=> 0x0000000000400714 : cmpb $0x0,0x2004bd(%rip) # 0x600bd8

0x000000000040071b : je 0x400740

0x000000000040071d : movsd 0x16b(%rip),%xmm0 # 0x400890

0x0000000000400725 : mulsd 0x2004b3(%rip),%xmm0 # 0x600be0

0x000000000040072d : add $0x8,%rsp

0x0000000000400731 : cvttsd2si %xmm0,%eax

0x0000000000400735 : retq

0x0000000000400736 : nopw %cs:0x0(%rax,%rax,1)

0x0000000000400740 : mov $0x600bd8,%edi

0x0000000000400745 : callq 0x4005c0

0x000000000040074a : test %eax,%eax

0x000000000040074c : je 0x40071d

0x000000000040074e : movabs $0x3ff0a8bd3ded9c4a,%rax

0x0000000000400758 : mov $0x600bd8,%edi

0x000000000040075d : mov %rax,0x20047c(%rip) # 0x600be0

0x0000000000400764 : callq 0x400610

0x0000000000400769 : jmp 0x40071d

April 11, 2011 at 09:42

Of course, those calls are in brackets and thus eaten by the blog, but it is the two callq’s you see at 0x40045 and 0x400764.

Cool idea, though, I will definitely have to remember it.

April 11, 2011 at 21:53

Do we really need the floating-point multiplications?

I’d suggest replacing b * (x + y) by something like (46341 * (x + y)) >> 16 and a * qMax(metricOne, metricTwo) by something like (68236 * qMax(metricOne, metricTwo)) >> 16.

If this overflows, try using qlonglong (might still be faster than floating point) or changing the computations to reduce the chance of overflow, e.g.:

T metricMax = qMax(metricOne, metricTwo);

metricMax += (metricMax * 2700) >> 16;

return metricMax;

April 11, 2011 at 21:57

Oh, of course it doesn’t make sense to multiply by an even number just to shift it off (unless you’re on a CPU like the Motorola 68000 where shifts by 16 are special) so:

return (17059 * qMax(metricOne, metricTwo)) >> 14

or:

T metricMax = qMax(metricOne, metricTwo);

metricMax += (metricMax * 675) >> 14;

return metricMax;

will reduce the chance of overflows for free, too.

April 12, 2011 at 13:37

r, not r·r. 😉

April 12, 2011 at 17:37

Now that’s a neat trick! Simple once you know it… 🙂 How does it show loading-time wise?

Originally my plan was to replace the filter kernel with a feedback filter like the “fast-blur” algorithm somewhen. But you know how it is once something works. 😦

On a side note, there is also still potential for memory optimization – the run-length encoding(*) of the “zero runs” in the bevel map is only used to fast-forward the iteration index in applyBevelMap(), however the zeros are actually present. They could be skipped when generating the map. The downside is that the required bevelmap size becomes unpredictable. And considering shadow and piece pixmaps, the total gain is not that overwhelming.

(*) see piecevisuals.h:37

January 20, 2014 at 12:04

Have you ever try this ?

if a >= b >= 0

then sqrt( a*a + b*b ) ~= a + (1 + 1/4) * ( b*b / (2*a + b) )