We need to calculate \(o(a/b)\) where \(a,b\) are binary floating point numbers, and \(o\) is RN, RD, RU or RZ.
We have a useful proof:
Let \(\beta\) be prime. Let \(q=b/a\), with \(a,b\) being floating point numbers of precision \(p\). Then either
- \(q\) cannot be exactly represented with a finite number of characters
- \(q\) can be represented exactly with precision \(p\) (ignoring overflow).
Proof:
Assume that \(q\) is otherwise (i.e. can be represented with a finite number of characters, but you need more than \(p\) of them). Then there exist integers \(Q\) and \(e_{q}\) such that \(q=Q\beta^{e_{q}-p+1}\) with \(Q>\beta^{p}\) and \(Q\) not a multiple of \(\beta\).
Let \(A,B\) be the integral significands of \(a,b\), and \(e_{a},e_{b}\) the exponents. Then:
Now if \(e=e_{q}+e_{a}-e_{b}-p+1\) is positive or zero, then \(B=AQ\beta^{e}\) and this implies \(B>Q>\beta^{p}\), which is a contradiction. If \(e<0\), then \(B\beta^{e}=AQ\) where \(e=e_{b}-e_{a}-e_{q}\) (reversed the sign and now \(e>0\)). Now we know that \(\beta\) does not divide \(Q\), So \(\beta^{e}\) divides \(A\) (primality is needed here). Thus \(B\) is a multiple of \(Q\) and thus \(B>\beta^{p}\).
Both ways, we have a contradiction.
Now a consequence of this is that in base 2, the quotient is never exactly half way between two floating point numbers.
A consequence of all this in binary is that since there are a finite number of quotients, there are exclusion zones around the mid points of floating point numbers where a quotient cannot appear. Below are two such lemmas:
Let \(\beta=2\), and let \(a,b\) be floating point numbers such that \(1\le a,b<2\). If \(c\) is either a floating point number or the exact midpoint between two of them, then either \(b/a=c\) or
No proof.
Another one is:
Let \(\beta=2\), and let \(a,b\) be floating point numbers such that \(1\le a,b<2\). If \(c\) is the exact midpoint between two of them, then:
The book proves it, but I didn’t bother copying it.
When computing \(a/b\), these exclusion zones help us know that should we round the result, we’ll get the correctly rounded value (i.e. it can tell us not to waste time on more iterations). This is often used when you are computing in a wider format and plan to convert to the narrower format.
When you are not in a wider format, the following is useful:
Assume \(\beta=2\), and let \(a,b\) be floating point numbers such that \(1\le a,b<2\). If:
- \(q\) is a faithful approximation to \(b/a\) (does this need to be within 1 ULP?
- \(y\) approximates \(1/a\) with a relative error less than \(2^{-p}\)
- the calculations \(r=o(b-aq)\) and \(q'=o(q+ry)\) where \(o\) is one of RN, RD, RU, RZ
Then \(q'\) is exactly \(o(b/a)\).
Note that these are FMA operations.
The proof is in the book. I didn’t go through it in all the detail.
The main use of the theorem is to calculate \(b/a\) once you have \(1/a\) computed very accurately. So we must make sure we can calculate \(1/a\) very accurately (or to be precise, to calculate \(\RN(1/a)\) exactly).
Let \(\beta=2\). Let \(y\) be an approximation to \(1/a\) with an error less than \(\ulp(1/a)\). Calculate \(r=\RN(1-ay),y'=\RN(y+ry)\). Then \(y'=\RN(1/a)\) exactly, as long as the integral significand of \(a\) is not all 1’s (i.e. not \(2^{p}-1\))
This is not proven in the book.
So in the big picture, how so we compute \(b/a\)?
First, to get the initial \(y=1/a\), we use the parallel iteration first, and then switch to the non-parallel but self correcting iteration. Do these in round to nearest mode. One must do a very careful error analysis to show that we are within one ulp of \(1/a\). Then apply this theorem (assuming the significand is not all 1’s, which is handled as a special case).
Then multiply \(\RN(1/a)\) with \(b\) to get an approximation. Then refine this with the last iteration in the previous section. We again need to do a careful analysis to ensure we are within \(\ulp(b/a)\). This is \(q\) in the theorem above.
Now this whole section assumed \(a,b\in[1,2)\). Can all other values be translated to this? What if \(a\) and \(b\) are quite different (e.g. several orders of magnitude apart)? Note that with these assumptions, the maximum value of \(b/a\) is under 2, and the minimum is over 0.5.
Aside
I have an open question I should work out before I move to another section. Let \(a\) be a floating point number. Then is the smallest floating point number less than \(\frac{a}{2}\ulp\left(\frac{b}{a}\right)\) equal to:
In other words, is there a sort of linearity with ulps? Is the smallest floating point number less than \(ab\) equal to \((a-\ulp(a))b\)? When does linearity break down?
I’m guessing it may not be a general linearity rule (I did not explore it). For this specific example, I showed that it holds true when \(a,b\in[1,2)\). I don’t want to claim it holds true in general. Note that even \(\frac{a}{2}\ulp\left(\frac{b}{a}\right)\) is a floating point number (assuming the exponent is within the exponent domain).
Why did I even ask this? It was utilized in a proof. Staring at it again, it does seem very interesting that subtracting the ulp led to the next lower floating point number. And once again I’m wondering if there is a general rule? I suspect it worked out easily in this case because all the factors are powers of \(\beta\).