Tag floating point

Use of Pythagorean Triples

Hugues’s paper. Look it up.

Gal’s Accurate Table Method

This method is useful when your intermediate precision is the same as the target precision.

Table Driven Methods

Given the input \(x\), the first step is to find a \(y\) such that \(f(x)\) can easily and accurately be calculated from \(f(y)\)....

Introduction to Table Based Methods

In general, when approximating a function, we’ll split the domain into multiple intervals and approximate each one with a polynomial....

Languages and Compilers

Here are some concerns: Say you want to compute \(a+b+c+d\) and they are all of 32 bit precision, but the machine supports 64 bit....

Sequential Evaluation of Polynomials

Sequential Evaluation of Polynomials If you don’t have any parallelism available, Horner’s scheme is a good option. And if you have the...

Compensated Polynomial Evaluation

The book provides an algorithm. I didn’t bother writing the details.

Compensated Dot Products

When the condition number is not high, one can do the naive algorithm for the dot product. Otherwise, one should do the...

Computing Sums More Accurately

Reordering the Operands, and a Bit More General ideas: Sort all your summands in ascending order (magnitude). Even more complex, sort...

Computing Validated Running Error Bounds

The problem with the previous error bounds is that they are in terms of quantities like \(\sum|a_{i}|\), which are not known in advance,...

Properties For Deriving Validated Running Error Bounds

Theorem for FMA Let \(x,y,z\) be nonnegative floating point numbers. Assuming underflow does not occur, then \(xy+z\le...

Some Refined Error Estimates

Let the rounding mode be RN. Assume no overflow occurs. Then if you do recursive summation, the following inequality related to the...

Notation For Error Analysis and Classical Error Estimates

Unless specified otherwise, everything in this chapter assumes no underflow. We often will have many factors of \((1+\epsilon_{i})\),...

Polynomials With Exact Representable Coefficients

An Iterative Method Compute the minimax approximation in a wider format. Then round the coefficient of the constant term. Then recompute...

Introduction to Polynomial Approximations In Finite Precision

We discussed calculating the minimax polynomial using Remez’s Algorithm, but we overlooked some subtleties. While the algorithm does...

Evaluation of the Error of an FMA

The error of an FMA calculation is not always a floating point number. However, we can use two floating point numbers to exactly...

Multiplication by an Arbitrary Precision Constant with an FMA

Suppose you need to multiply by a constant that is not exactly representable. Think \(\pi\) and the like. We’d like to multiply and...

Conversions Between Integers and Floating Point Numbers

This is a short section with some details and magic numbers. I did not bother.

Radix Conversion Algorithms

The algorithms are in the book. I did not reproduce them here. I did not read the rest of the section. There are a lot more details there.

Conditions on the Formats

This section deals with changing bases. The most obvious application is to go back and forth between decimal and binary (to make it easy...

Newton-Raphson Based Square Root With FMA

The Basic Methods One way is to use Newton’s iteration on \(f(x)=x^{2}-a\). This method for calculating square root goes back thousands...

Possible Double Rounding in Division Algorithms

This section deals with floating point \(a,b\) values, not necessarily between 1 and 2. Assume they are non-negative, though. For this,...

Using The Newton Iteration For Correctly Rounded Division With FMA

We need to calculate \(o(a/b)\) where \(a,b\) are binary floating point numbers, and \(o\) is RN, RD, RU or RZ. We have a useful proof:...

Variants of the Newton Raphson Iteration

Assume \(\beta=2\) for this section. Some of it may not work for decimal. We want to approximate \(b/a\). Assume \(1\le a,b<2\). In...

Another Splitting Technique: Splitting Around a Power of 2

In this section, assume \(\beta=2\). Now given a floating point \(x\), we want to form two floating point numbers \(x_{h}\) and...

Computation of Residuals of Division and Square Root With an FMA

For this article, define a representable pair for a floating point number \(x\) to be any pair \((M,e)\) such that...

Accurate Computation of the Product of Two Numbers

The 2MultFMA Algorithm This has been covered elsewhere. It works well when you use FMA. If No FMA Is Available If there is no FMA...

Accurate Computation of the Sum of Two Numbers

Let \(a,b\) be two floating point numbers. Let \(s\) be \(\RN(a+b)\). Regardless of which number it picks in a tie, it can be shown that...

Exact Multiplications and Divisions

When you multiply a floating point number by a power of \(\beta\), the result is exact provided there is no over or underflow. Another...

Exact Addition

Sterbenz’s Lemma: If your floating point system has denormals, and if \(x,y\) are non-negative, finite floating point numbers such that...

Computing The Precision

To get \(p\) of the floating point system you are on: i = 0 A = 1.0 B = 2 # The radix. while (A + 1.0) - A == 1.0: A = B * A i += 1...

Computing The Radix

Suppose we want to compute the radix of a floating point system. The code below will do it for you - it works assuming the...

Accurately Computing Supremum Norms

We never discussed how to calculate \(||f-p||_{\infty}\). Maple has a function to do this, but it can be inaccurate. Most people will...

Rational Approximations

Sometimes you need a fairly high degree polynomial to get reasonable accuracy, but can achieve a far greater accuracy with a much lower...

Remez’s Algorithm

Remez’s algorithm is one that converges to the minimax polynomial of a function. The author recommends using a polynomial approximation...

Miscellaneous (Chebyshev)

Chebyshev vs Minimax Note that the best minimax polynomial approximation need not be the Chebyshev polynomial. The latter is the best...

Least Maximum Polynomial Approximations

The supremum norm is given by \(||f-p||_{\infty}=\max_{a\le x\le b}|f(x)-p(x)|\). It is denoted by \(L^{\infty}\). Given a function...

Least Squares Polynomial Approximations

First, just a definition: A monic polynomial is one whose leading coefficient is 1. We want to find a polynomial of degree \(n\) that...

IEEE Support in Programming Languages

Do not assume that the operations in a programming language will map to the ones in the standard. The standard was originally written...

Introduction to The Classical Theory of Polynomial or Rational Approximations

We often will approximate functions as polynomial or rational functions. When doing this, we introduce two types of errors:...

Rest of chapter

I skipped the rest of the chapter (inlcuding hardware details).

Special Values

NaN Signaling NaNs do not appear as the result of arithmetic operations. When they appear as an operand, they signal an...

Default Exception Handling

Invalid The default result of such an operation is a quiet NaN. The operations that lead to Invalid are: Most operations on a...

Conversions To/From String Representations

This section addresses how one can convert a character sequence into a decimal/binary floating point number. Decimal Character Sequence...

Comparisons

The standard requires that you can compare any two floating point numbers, as long as they share the same radix. The unordered condition...

Attributes and Rounding

Rounding Direction Attributes IEEE 754-2008 requires that the following be correctly rounded: Arithmetic operations: Addition...

Operations Specified By The Standards

Arithmetic Operations and Square Root Handling Signed 0 If \(x,y\) are nonzero, and \(x+y=0\) or \(x-y=0\) exactly, then it is \(+0\)...

Formats

The standard defines several interchange formats to allow for transferring floating point data between machines. They could be as bit...

Manipulating Double or Triple Word Numbers

The target format is the format of the result. The target precision is the precision of the target format. When computing polynomials,...

Computing the Error of a FP Addition or Multiplication

Let \(a,b\) be 2 floating point numbers. It can be shown that \((a+b)-\RN(a+b)\) is a floating point number. This may not be true for...

Basic Notions of Floating Point Arithmetic

Basic Notions For a binary floating point system, if \(x\) is normal, then the leading bit is 1. Otherwise it is 0. If we have some...

Note on the Choice of Radix

It has been shown that \(\beta=2\) gives better worst case and average accuracy than all other bases. if...

Lost and Preserved Properties of Arithmetic

Floating point addition and multiplication are still commutative. Associativity is compromised, though. An example: Let...

Floating Point Exceptions

In IEEE-754, the implementer can signal an exception along with the result of the operation. Usually (or perhaps mandated?), the signal...

Fused Multiply Add

Let \(o\) be the rounding function, and \(a,b,c\) are floating point numbers. Then \(\mathrm{FMA}(a,b,c)\) is \(o(ab+c)\). if...

ULP Errors vs Relative Errors

Converting From ULP Errors to Relative Errors Let \(x\) be in the normal range, and \(|x-X|=\alpha\ulp(x)\). Then: \begin{equation*}...

The ULP Function

There are multiple definitions of unit in the last place. I think most agree when \(x\) is not near a boundary point. Here is the...

Relative Error Due To Rounding

Ranges The normal range is the set of real numbers: \(\beta^{e_{\textit{min}}}\le|x|\le\Omega\) and the subnormal range are where...

Rounding Functions

The IEEE 754-2008 specifies five rounding functions: Round toward \(-\infty\) (RD): It is the largest floating point number less than or...

The Other “Numbers”

0 (some systems have signed 0’s as well) NaN for any invalid operation \(\infty\) (some systems are signed, some are not). In the IEEE...

Underflow

Underflow before rounding occurs when the absolute value of the exact value is strictly less than \(\beta^{e_{\textit{min}}}\) (i.e. the...

Normalizing

We would like a unique way to represent \(x\). One approach is to pick the one which gives the smallest exponent possible (while still...

Definitions

A radix \(\beta\) floating point number \(x\) is of the form \(m\beta{e}\), where \(|m|<\beta\) is called the significand and \(e\) is...