Relative Error Due To Rounding

Ranges The normal range is the set of real numbers: $\beta^{e_{\textit{min}}}\le|x|\le\Omega$ and the subnormal range are where $|x|<\beta^{e_{\textit{min}}}$.

Note that these aren’t the set of floating point values. They are the set of real numbers. Relative Error For a given rounding function $o(x)$, the relative error is

\begin{equation*} \epsilon(x)=\left|\frac{x-o(x)}{x}\right| \end{equation*}

for nonzero $x$ and 0 if $x$ is zero.

Now if $x$ is in the normal range, then the bounds on the relative error are:

$\epsilon<\frac{1}{2}\beta^{1-p}$ for RN
$\epsilon<\beta^{1-p}$ for RD, RU and RZ

Proof:

$x$ is nonzero, as it is in the normal range. If $x$ is a floating point number, $\epsilon(x)=0$. Otherwise, $|x|\in\left(\beta^{e},\beta^{e+1}\right)$ for some $e$. The distance between two consecutive floating point numbers in this set is $\beta^{e-p+1}$. Thus, we know that $|x-\textit{RN}(x)|\le\frac{1}{2}\beta^{e-p+1}$ and $|x-o(x)|\le\beta^{e-p+1}$ for direct rounding. To get the relative error, divide by |x|, noting that the smallest possible value of |x| is greater than $\beta^{e}$

The unit roundoff $\textbf{u}$ for a $\beta$ and precision $p$ floating point system is:

\begin{equation*} \frac{1}{2}\textrm{ulp}=\frac{1}{2}\beta^{1-p} \end{equation*}

for RN, and:

\begin{equation*} \textrm{ulp}=\beta^{1-p} \end{equation*}

for direct rounding modes.

Let $x$ be in the normal range. Then the relative error $\epsilon(x)$ when rounding $x$ to the nearest satisfies:

\begin{equation*} \epsilon(x)\le\frac{\beta^{1-p}}{2+\beta^{1-p}}=\frac{\textbf{u}}{1+\textbf{u}} \end{equation*}

To prove this, we split into two cases:

Case I, when $|x|\ge\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}$, and Case II when $\beta^{e}\le|x|<\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}$

Before diving into the proof, let’s discuss what these two cases represent. We know that a floating point number is of the form $M\beta^{e-p+1}$ where $M$ is a $p$ character integer in base $\beta$. Now for a floating point number to be normal, this means that the leading character of $M$ is 1. The smallest normal number for a given $e$ is where $M=100\dots0=\beta^{p-1}$. This leads to the smallest normal number (for a given $e$) being $\beta^{p-1}\beta^{e-p+1}=\beta^{e}$. Let’s call this number $f$.

Now what is the next to smallest normal number for a given $e$ (i.e. the next floating point number)? It has $M=10\dots01=(1+\beta^{p-1})$. The number would be $(1+\beta^{p-1})\beta^{e-p+1}=\beta^{e}(1+\beta^{1-p})$. Let’s call this number $g$.

Now case II above is when $f\le|x|<g$, but with $|x|$ closer to $f$ than to $g$ (i.e. less than half way between $f$ and $g$ - there is no tie here). In words, $|x|$ will round to the smallest normal number $f=\beta^{e}$ with the appropriate sign. Case I is all other cases with that exponent.

Now Case I is simple to handle:

By assumption, $|x|\ge\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}$. We know that $|x-\textrm{RN}(x)|\le\frac{1}{2}\beta^{e-p+1}$. Divide the two quantities to get the desired inequality.

For Case II, we have $\textrm{RN}(x)=\textrm{sign}(x)\cdot f$. Thus $|x-\textrm{RN}(x)|=|x|-f=|x|-\beta^{e}$. Divide by $|x|$ to get $\epsilon(x)$, and then utilize the fact that $\beta^{e}\le|x|<\left(1+\frac{1}{2}\beta^{1-p}\right)\beta^{e}$.

For a subnormal number, the relative error can be anywhere from 0 to 1. Standard Models Let $*\in\{+,-,\times,\div\}$ be any of those operations. Let $o$ be any rounding function we’ve considered. Then for all floating point numbers $a,b$ such that $a*b$ neither underflows (i.e. inexact subnormal result) or overflows, then the following are true:

\begin{equation*} o(a*b)=(a*b)(1+\epsilon_{1}),|\epsilon_{1}|\le\textbf{u} \end{equation*}

\begin{equation*} o(a*b)=(a*b)/(1+\epsilon_{2}),|\epsilon_{2}|\le\textbf{u} \end{equation*}

These two properties are called the first and second standard models of floating point arithmetic.

I may be missing something deeper, but the first seems trivially true. $a*b$ is a number in the normal range, so it follows immediately.

The second one follows if you utilize:

\begin{equation*} \epsilon(x)=\frac{|z-o(z)|}{|z|}\le\frac{\textbf{u}}{1+\textbf{u}} \end{equation*}

But that identity is valid only for round to nearest. What about the other rounding mechanisms?

Here’s a proof from On Relative Errors of Floating Point Operations:

For a given number in the normal range $z$, $\textrm{ufp}(z)\le|o(z)|\le\beta\textrm{ufp}(z)$ where $\textrm{ufp}(z)$ is the smallest value with the same exponent as $z$.

Now $|z|$ belongs in the interval $[\textrm{ufp}(z),\beta\textrm{ufp}(z))$, and this interval contains $(\beta-1)\beta^{p-1}$ floating point values. The distance between any two values is $\textbf{u}\textrm{ufp}(z)$. Thus $|z-o(z)|\le\textbf{u}\textrm{ufp}(z)$. Divide by $|o(z)|$ to get:

\begin{equation*} \frac{|z|}{|o(z)|}-1\le\textbf{u}\frac{\textrm{ufp}(z)}{|o(z)|} \end{equation*}

But we showed above that the coefficient of $\textbf{u}$ is bound by 1. This will complete the proof - set the LHS to $\epsilon_{2}$.

The first statement can be written as:

\begin{equation*} o(a*b)=(a*b)(1+\epsilon_{1}),|\epsilon_{1}|\le\frac{\textbf{u}}{1+\textbf{u}} \end{equation*}

For the case of round to nearest.

Now let’s consider a more general case: Where the result of $x*y$ could be in the subnormal range. So now, as long as there is no overflow:

\begin{equation*} o(x*y)=(x*y)(1+\epsilon)+\epsilon' \end{equation*}

where $|\epsilon|\le\textbf{u}$ and $|\epsilon'|\le\beta^{e_{\textrm{min}}}\ \textbf{u}$.

Specifically, if the result is in the normal range, $\epsilon'=0$. The result is just from the standard model. If the result is in the subnormal range, then $\epsilon=0$. The result follows from the following:

\begin{equation*} \left|x-o(x)\right|<\beta^{e_{\textit{min}}-p+1} \end{equation*}