Ranges The normal range is the set of real numbers: \(\beta^{e_{\textit{min}}}\le|x|\le\Omega\) and the subnormal range are where \(|x|<\beta^{e_{\textit{min}}}\).
Note that these aren’t the set of floating point values. They are the set of real numbers. Relative Error For a given rounding function \(o(x)\), the relative error is
for nonzero \(x\) and 0 if \(x\) is zero.
Now if \(x\) is in the normal range, then the bounds on the relative error are:
- \(\epsilon<\frac{1}{2}\beta^{1-p}\) for RN
- \(\epsilon<\beta^{1-p}\) for RD, RU and RZ
Proof:
\(x\) is nonzero, as it is in the normal range. If \(x\) is a floating point number, \(\epsilon(x)=0\). Otherwise, \(|x|\in\left(\beta^{e},\beta^{e+1}\right)\) for some \(e\). The distance between two consecutive floating point numbers in this set is \(\beta^{e-p+1}\). Thus, we know that \(|x-\textit{RN}(x)|\le\frac{1}{2}\beta^{e-p+1}\) and \(|x-o(x)|\le\beta^{e-p+1}\) for direct rounding. To get the relative error, divide by |x|, noting that the smallest possible value of |x| is greater than \(\beta^{e}\)
The unit roundoff \(\textbf{u}\) for a \(\beta\) and precision \(p\) floating point system is:
for RN, and:
for direct rounding modes.
Let \(x\) be in the normal range. Then the relative error \(\epsilon(x)\) when rounding \(x\) to the nearest satisfies:
To prove this, we split into two cases:
Case I, when \(|x|\ge\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}\), and Case II when \(\beta^{e}\le|x|<\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}\)
Before diving into the proof, let’s discuss what these two cases represent. We know that a floating point number is of the form \(M\beta^{e-p+1}\) where \(M\) is a \(p\) character integer in base \(\beta\). Now for a floating point number to be normal, this means that the leading character of \(M\) is 1. The smallest normal number for a given $e$ is where \(M=100\dots0=\beta^{p-1}\). This leads to the smallest normal number (for a given \(e\)) being \(\beta^{p-1}\beta^{e-p+1}=\beta^{e}\). Let’s call this number \(f\).
Now what is the next to smallest normal number for a given \(e\) (i.e. the next floating point number)? It has \(M=10\dots01=(1+\beta^{p-1})\). The number would be \((1+\beta^{p-1})\beta^{e-p+1}=\beta^{e}(1+\beta^{1-p})\). Let’s call this number \(g\).
Now case II above is when \(f\le|x|<g\), but with \(|x|\) closer to \(f\) than to \(g\) (i.e. less than half way between \(f\) and \(g\) - there is no tie here). In words, \(|x|\) will round to the smallest normal number \(f=\beta^{e}\) with the appropriate sign. Case I is all other cases with that exponent.
Now Case I is simple to handle:
By assumption, \(|x|\ge\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}\). We know that \(|x-\textrm{RN}(x)|\le\frac{1}{2}\beta^{e-p+1}\). Divide the two quantities to get the desired inequality.
For Case II, we have \(\textrm{RN}(x)=\textrm{sign}(x)\cdot f\). Thus \(|x-\textrm{RN}(x)|=|x|-f=|x|-\beta^{e}\). Divide by \(|x|\) to get \(\epsilon(x)\), and then utilize the fact that \(\beta^{e}\le|x|<\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}\).
For a subnormal number, the relative error can be anywhere from 0 to 1. Standard Models Let \(*\in\{+,-,\times,\div\}\) be any of those operations. Let \(o\) be any rounding function we’ve considered. Then for all floating point numbers \(a,b\) such that \(a*b\) neither underflows (i.e. inexact subnormal result) or overflows, then the following are true:
These two properties are called the first and second standard models of floating point arithmetic.
I may be missing something deeper, but the first seems trivially true. \(a*b\) is a number in the normal range, so it follows immediately.
The second one follows if you utilize:
But that identity is valid only for round to nearest. What about the other rounding mechanisms?
Here’s a proof from On Relative Errors of Floating Point Operations:
For a given number in the normal range \(z\), \(\textrm{ufp}(z)\le|o(z)|\le\beta\textrm{ufp}(z)\) where \(\textrm{ufp}(z)\) is the smallest value with the same exponent as \(z\).
Now \(|z|\) belongs in the interval \([\textrm{ufp}(z),\beta\textrm{ufp}(z))\), and this interval contains \((\beta-1)\beta^{p-1}\) floating point values. The distance between any two values is \(\textbf{u}\textrm{ufp}(z)\). Thus \(|z-o(z)|\le\textbf{u}\textrm{ufp}(z)\). Divide by \(|o(z)|\) to get:
But we showed above that the coefficient of \(\textbf{u}\) is bound by 1. This will complete the proof - set the LHS to \(\epsilon_{2}\).
The first statement can be written as:
For the case of round to nearest.
Now let’s consider a more general case: Where the result of \(x*y\) could be in the subnormal range. So now, as long as there is no overflow:
where \(|\epsilon|\le\textbf{u}\) and \(|\epsilon'|\le\beta^{e_{\textrm{min}}}\ \textbf{u}\).
Specifically, if the result is in the normal range, \(\epsilon'=0\). The result is just from the standard model. If the result is in the subnormal range, then \(\epsilon=0\). The result follows from the following: