Computing the Error of a <span class="caps">FP</span> Addition or Multiplication

Let \(a,b\) be 2 floating point numbers. It can be shown that \((a+b)-\RN(a+b)\) is a floating point number.

This may not be true for other rounding methods. It’s also assuming there is no overflow.

Let \(\beta\le3\), and assume we have correct rounding with rounding to the nearest. Let \(a,b\) be floating point numbers. Assume \(|a|\ge|b|\). Then consider this algorithm:

s = RN(a+b)
z = RN(s-a)
t = RN(b-z)

This has the property that \(s+t=a+b\) exactly.

Now in reality, we don’t always know if \(|a|\ge|b|\), and doing a check is expensive. The following algorithm will give the correct result all the time, regardless of the base, and as long as there is no overflow.

s = RN(a+b)
a' = RN(s-b)
b' = RN(s-a')
Da = RN(a-a')
Db = RN(b-b')
t = RN(Da + Db)

Think of Da as \(\delta_{a}\). It has also been shown that this algorithm is optimal in the number of operations.

So what is the point of all this? Well, assume \(a+b\) cannot be represented as a floating point number. Then \(t\) will give us the error in our sum.

We can do something similar for the product of two floating point numbers \(a,b\), provided that \(e_{a}+e_{b}\ge e_{\min}+p-1\). Let \(\pi=\RN(ab)\) and \(\rho=\RN(ab-\pi)\). Then \(\pi+\rho=ab\). Note that this utilizes FMA. It works with all the rounding functions. It is called the Fast2MultFMA Algorithm. This works if there is no overflow.