Optimization tricks: Fast sqrt(x^2 + y^2)

April 10, 2011

[Before I get to the boring (hah!) numerics stuff, some news: In order to test KDE Games on a touchscreen, and because I’ve wanted a tablet for a long time now, and because I want a 3G device, I’ve now ordered a WeTab 2. More news on this coming up soon when I have it in my hands.]

I’ve been optimizing my favorite jigsaw puzzle game Palapeli lately. Today, I looked at the bevelmap generation algorithm which renders the nice three-dimensional shadows atop the jigsaw pieces. There are a lot of norm computations in there, i.e. sqrt(x^2+y^2). All bells are ringing: There must be a way to optimize this.

Now nearly everyone who has to do with numerics and programming probably knows about the fast inverse square root which approximates 1/sqrt(x) very efficiently and sufficiently precise (about 1% maximum error). Using the inverse of this fast inverse square root does not really give an improvement, though. After all, the formula 1/InvSqrt(x*x+y*y) contains at least four floating-point operation plus everything that happens in InvSqrt.

This approach is especially inefficient because, in my special case, x and y are integers, and the result is also needed as an integer. Looking around the interwebs, I’ve found this lovely page on square root calculation, which also has a special algorithm for norms.

It all starts with two approximations to sqrt(x^2+y^2). The first one is max(abs(x), abs(y)), the second one is 1/sqrt(2)*(abs(x) + abs(y)). We need both because they are good for different values of x and y. To understand this, consider the unit circles of these approximations, that is: the set of x/y points where the approximation is 1. (You will probably make a sketch as you follow the discussion. I would have done one if it wasn’t so late already.) The unit circle for the first approximation is an upright square, while the second approximation’s unit circle is diamond-shaped (i.e. turned around by 45° compared to the first one).

An approximation is good when the unit circle comes nearby or touches the actual unit circle. For the first approximation, this happens when either x or y vanishes. The second approximation, on the other hand, is good when x and y are approximately equal. Now one can show that, at these conditions, the better approximation is always bigger than the other one. This means that we can define a third approximation by taking the maximum of both approximations. The unit circle for this approximation is a regular octagon, whose inscribed circle is the actual unit circle.

The approximation is thus always too big, except for the eight touching points of octagon and real unit circle, where approximation and exact result are identical. However, one can find a factor z such that the octagon, scaled by 1/z, is inside the circle. Scaling the approximation by (1+1/z)/2 thus distributes the approximational error to both the negative and positive side. Taking all these considerations into account, I’ve arrived at the following implementation:

template<typename T> struct fastnorm_helper { typedef T fp; };
template<> struct fastnorm_helper<int> { typedef qreal fp; };

//Approximates sqrt(x^2 + y^2) with a maximum error of 4%.
//Speedup on my machine is 30% for unoptimized and up to 50% for optimized builds.
template<typename T> inline T fastnorm(T x, T y)
{
//need absolute values for distance measurement
x = qAbs<T>(x); y = qAbs<T>(y);
//There are two simple approximations to sqrt(x^2 + y^2):
//  max(x, y) -> good for x >> y or x << y          (square-shaped unit circle)
//  1/sqrt(2) * (x + y) -> good for x = y (approx.) (diamond-shaped unit circle)
//The worse approximation is always bigger. By taking the maximum of both,
//we get an octagon-shaped upper approximation of the unit circle. The
//approximation can be refined by a prefactor (called a) below.
typedef typename fastnorm_helper<T>::fp fp;
static const fp a = (1 + sqrt(4 - 2 * qSqrt(2))) / 2;
static const fp b = qSqrt(0.5);
const T metricOne = qMax<T>(x, y);
const T metricTwo = b * (x + y);
return a * qMax(metricOne, metricTwo);
//idea for this function from http://www.azillionmonkeys.com/qed/sqroot.html
//(though the equations there have some errors at the time of this writing)
}

[License: I grant any entity the right to use the above code snippet for any purpose, without any conditions, unless such conditions are required by law. (This is the best approximation to placing the code in the public domain, which is not possible in my jurisdiction.)]

I have tested this function successfully on random input with T = double, float, int32_t. For integer types, this function needs only two floating-point multiplications, and thus runs much much faster than the trivial implementation and the InvSqrt-based solution. So please take this as my contribution towards Green IT, and reduce the time your computer spends calculating norms.

P.S. The extension of the above function for three-dimensional coordinates remains as an exercise to the reader. The only hurdle is to calculate the new prefactor z.

Update: As correctly pointed out by the commenters, if all you want is to compare the norm to some radius, e.g. sqrt(x*x + y*y) > r*r, the easiest solution is to avoid the square root and test for x*x + y*y > r*r instead.

13 Responses to “Optimization tricks: Fast sqrt(x^2 + y^2)”

1. Andreas Says:

Often you can square the other side of the (in)equation and avoid the square root altogether. x*x is cheap.

2. christoph Says:

I use this:

static int fastnorm(int x, int y)
{
x = x < 0 ? -x : x;
y = y y) {
int d = y – (x >> 2);
if (d > 1);
}
} else {
int d = x – (y >> 2);
if (d > 1);
}
}
}

3. christoph Says:

Argh, stupid HTML tags … use http://paste.kde.org/10468/

4. ChALkeR Says:

[License: I grant any entity the right to use the above code snippet for any purpose, without any conditions, unless such conditions are required by law. (This is the best approximation to placing the code in the public domain, which is not possible in my jurisdiction.)]

There always is CC0 license.

5. Inge Wallin Says:

If you have many values to calculate that are near each other, I’m pretty sure you will be able to find a way to calculate values from nearby values very fast.

6. Thiago Maciiera Says:

You should save the square roots of constants as constants too. Your statics aren’t good enough, since the value cannot be computed at compile time. That means your code has two mutex locks under GCC…

You should hardcode both a and b.

• Esben Mose Hansen Says:

I actually thought that gcc would be able to precalculate the values in this case, but you are right: it fails, and thus does the two mutexes. I’ll dare to paste the dissambled code below (I added a main that returns fastnorm(2.0, 3.0), and replaced all the Qt functions with stdlib ditto. Note the calls to __cxa_guard.. blah blah. Those are essentially mutexes.

0x0000000000400710 : sub \$0x8,%rsp
=> 0x0000000000400714 : cmpb \$0x0,0x2004bd(%rip) # 0x600bd8
0x000000000040071b : je 0x400740
0x000000000040071d : movsd 0x16b(%rip),%xmm0 # 0x400890
0x0000000000400725 : mulsd 0x2004b3(%rip),%xmm0 # 0x600be0
0x000000000040072d : add \$0x8,%rsp
0x0000000000400731 : cvttsd2si %xmm0,%eax
0x0000000000400735 : retq
0x0000000000400736 : nopw %cs:0x0(%rax,%rax,1)
0x0000000000400740 : mov \$0x600bd8,%edi
0x0000000000400745 : callq 0x4005c0
0x000000000040074a : test %eax,%eax
0x000000000040074c : je 0x40071d
0x000000000040074e : movabs \$0x3ff0a8bd3ded9c4a,%rax
0x0000000000400758 : mov \$0x600bd8,%edi
0x000000000040075d : mov %rax,0x20047c(%rip) # 0x600be0
0x0000000000400764 : callq 0x400610
0x0000000000400769 : jmp 0x40071d

• Esben Mose Hansen Says:

Of course, those calls are in brackets and thus eaten by the blog, but it is the two callq’s you see at 0x40045 and 0x400764.

Cool idea, though, I will definitely have to remember it.

7. Kevin Kofler Says:

Do we really need the floating-point multiplications?

I’d suggest replacing b * (x + y) by something like (46341 * (x + y)) >> 16 and a * qMax(metricOne, metricTwo) by something like (68236 * qMax(metricOne, metricTwo)) >> 16.

If this overflows, try using qlonglong (might still be faster than floating point) or changing the computations to reduce the chance of overflow, e.g.:
T metricMax = qMax(metricOne, metricTwo);
metricMax += (metricMax * 2700) >> 16;
return metricMax;

8. Kevin Kofler Says:

Oh, of course it doesn’t make sense to multiply by an even number just to shift it off (unless you’re on a CPU like the Motorola 68000 where shifts by 16 are special) so:
return (17059 * qMax(metricOne, metricTwo)) >> 14
or:
T metricMax = qMax(metricOne, metricTwo);
metricMax += (metricMax * 675) >> 14;
return metricMax;
will reduce the chance of overflows for free, too.

9. The User Says:

sqrt(x*x + y*y) > r*r

r, not r·r. 😉

10. Johannes Löhnert Says:

Now that’s a neat trick! Simple once you know it… 🙂 How does it show loading-time wise?

Originally my plan was to replace the filter kernel with a feedback filter like the “fast-blur” algorithm somewhen. But you know how it is once something works. 😦

On a side note, there is also still potential for memory optimization – the run-length encoding(*) of the “zero runs” in the bevel map is only used to fast-forward the iteration index in applyBevelMap(), however the zeros are actually present. They could be skipped when generating the map. The downside is that the required bevelmap size becomes unpredictable. And considering shadow and piece pixmaps, the total gain is not that overwhelming.

(*) see piecevisuals.h:37

11. Laurent Magdelaine Says:

Have you ever try this ?
if a >= b >= 0
then sqrt( a*a + b*b ) ~= a + (1 + 1/4) * ( b*b / (2*a + b) )