Accurate Computation of the Sum of Two Numbers

Posted by Beetle B. on Thu 21 March 2019

Let \(a,b\) be two floating point numbers. Let \(s\) be \(\RN(a+b)\). Regardless of which number it picks in a tie, it can be shown that \((a+b)-s\) is a floating point number provided \(s\) does not overflow.

I did not prove this all the way, but for the specific case of both being positive normal numbers, with \(a>b\), the idea is that we have a potential significand of \(M_{a}+M_{b}\beta^{\delta}\), where \(\delta=e_{b}-e_{a}\). Now if \(M_{a}+M_{b}\beta^{\delta}\) does not cause us to go to the next exponent, then we have two cases:

  1. \(\delta<-p\). Then the rounded result is just \(a\), and the difference is trivially \(b\). (Note, I’m playing fast an loose here - I didn’t audit if it should be \(-p\) or \(-p+1\), etc).
  2. Otherwise, we’ll have a long infinitely precise significand. Then \(a+b-s\) is just the truncated portion of the significand with a different exponent. The thing to note is that the infinitely precise significand has less than \(2p\) characters. Thus the significand of the difference should “fit” in p characters.

Now I suspect even if we go to the next exponent in the sum, the logic would be the same. Need to also handle subnormals and values of opposite signs.

Note that this need not be the case for other rounding modes.

Fast2Sum

I covered this algorithm elsewhere. Here are some interesting tidbits from the proof.

Let \(a,b\) be two floating point numbers, with \(a>0\) and \(e_{a}\ge e_{b}\). Then intuitively it is clear that when you add the two numbers, the exponent does not exceed \(e_{a}+1\). Formally:

\begin{equation*} \ |a+b|\le2\left(\beta^{p}-1\right)\beta^{e_{a}-p-1}\le\left(\beta^{p}-1\right)\beta^{e_{a}-p+2} \end{equation*}

The 2Sum Algorithm

This has been covered elsewhere. Note that although it has double the operations, it need not be slower. The reason is that Fast2Sum requires a comparison, and could result in a wrong branch prediction which is expensive.

In fact, the lack of comparisons is one of the benefits of this algorithm. On some architectures, this can be parallelized. A comparison can obstruct parallelism.

It has been shown that no RN-addition algorithm that doesn’t involve branching can exist with fewer than 5 operations, if it is to give the same result as 2Sum and be correctly rounded. This was proven by exhaustion.

If we do not use Rounding to Nearest

There do exist algorithms that work for any rounding function.