Basic Notions of Floating Point Arithmetic

Basic Notions For a binary floating point system, if \(x\) is normal, then the leading bit is 1. Otherwise it is 0. If we have some other mechanism to denote normality, then we need not store this redundant bit. Often, this information is embedded into the exponent bits.

System	p	e_min	e_max
32 bits	23+1	-126	127
64 bits	52+1	-1022	1023

Many pocket calculators use \(\beta=10\). One reason for this is that ordinary numbers like 0.1 are not representable in \(\beta=2\).

Base 10 is also used for financial calculations. Rounding Functions :rounding: A machine number is a number that can be represented exactly in the floating point system.

If \(x,y\) are machine numbers, then IEEE-754 mandates that when computing \(x*y\) where \(*\in\{+,-,\times,\div\}\), we should get \(o(x*y)\). In words, imagine computing it to infinite precision, and then rounding.

This property gives some advantages:

Full compatibility between computing systems
Helps a lot in the mathematical analysis of operations

The IEEE-754 (2008) now recommends correct rounding of several elementary functions. See the book for the list.

If \(x\) and \(y\) are floating point numbers, such that \(x/2\le y\le2x\), and the floating point has denormals and correct rounding, then \(x-y\) is a floating point number (no rounding needed).

The book does not have a proof for this and I did not try proving it. ULPs :ulp:

\(\ulp(0)\) is defined to be \(\beta^{e_{\min}-p+1}\). Fused Multiply Add :fma: Some benefits of FMA:

Exact computation of division remainders
Evaluation of polynomials via Horner’s Rule is much faster

Beware that you could kind of violate the monotonicity of rounding if you clumsily use the FMA. Consider evaluating \(\sqrt{x^{2}-y^{2}}\). Let \(x=y=1+2^{-52}\). Note that this is a machine number. Now say you compute \(x^{2}-y^{2}\) as \(\RN(\RN(x^{2})-y\times y)\). \(x^{2}\) is approximated down and then when you subtract \(y^{2}\), and then round, the closest number to the result is actually negative. The result is NaN.

Now if you just had used elementary operators, then you wouldn’t have this problem because monotonicity is guaranteed for those operators. IEEE 754 (2008) In the IEEE-754 (2008) standard, the most significant bit is the sign bit, then the exponent, then the significand (with leading bit omitted).

Now note that the exponent bits gives a non-negative number. So you define a bias. For 32 bits, the bias is 127. For 64 bits, the bias is 1023. The basic idea is this: Let \(E\) be the exponent represented by the exponent bits. If \(E=0\), the number is either 0 or a denormal. Don’t adjust. If \(E>0\), subtract the bias to get the real exponent. There is one exception, where \(E\) is all 1’s, which is reserved for infinities and NaN’s.

How do we differentiate between a NaN and an infinity? For infinity, the significand is all 0’s. Otherwise, it is a NaN.

One property of this way of ordering the bits: If you want the next largest floating point number, simply add 1.