Tag floating point

Use of Pythagorean Triples

Hugues’s paper. Look it up.

Posted by Beetle B. on Tue 17 September 2019

Gal’s Accurate Table Method

This method is useful when your intermediate precision is the same as the target precision.

Posted by Beetle B. on Tue 17 September 2019

Table Driven Methods

Given the input \(x\), the first step is to find a \(y\) such that \(f(x)\) can easily and accurately be calculated from \(f(y)\)....

Posted by Beetle B. on Tue 17 September 2019

Introduction to Table Based Methods

In general, when approximating a function, we’ll split the domain into multiple intervals and approximate each one with a polynomial....

Posted by Beetle B. on Tue 17 September 2019

Languages and Compilers

Here are some concerns: Say you want to compute \(a+b+c+d\) and they are all of 32 bit precision, but the machine supports 64 bit....

Posted by Beetle B. on Mon 16 September 2019

Sequential Evaluation of Polynomials

Sequential Evaluation of Polynomials If you don’t have any parallelism available, Horner’s scheme is a good option. And if you have the...

Posted by Beetle B. on Wed 12 June 2019

Compensated Polynomial Evaluation

The book provides an algorithm. I didn’t bother writing the details.

Posted by Beetle B. on Fri 07 June 2019

Compensated Dot Products

When the condition number is not high, one can do the naive algorithm for the dot product. Otherwise, one should do the...

Posted by Beetle B. on Fri 07 June 2019

Computing Sums More Accurately

Reordering the Operands, and a Bit More General ideas: Sort all your summands in ascending order (magnitude). Even more complex, sort...

Posted by Beetle B. on Thu 23 May 2019

Computing Validated Running Error Bounds

The problem with the previous error bounds is that they are in terms of quantities like \(\sum|a_{i}|\), which are not known in advance,...

Posted by Beetle B. on Thu 16 May 2019

Properties For Deriving Validated Running Error Bounds

Theorem for FMA Let \(x,y,z\) be nonnegative floating point numbers. Assuming underflow does not occur, then \(xy+z\le...

Posted by Beetle B. on Wed 15 May 2019

Some Refined Error Estimates

Let the rounding mode be RN. Assume no overflow occurs. Then if you do recursive summation, the following inequality related to the...

Posted by Beetle B. on Fri 10 May 2019

Notation For Error Analysis and Classical Error Estimates

Unless specified otherwise, everything in this chapter assumes no underflow. We often will have many factors of \((1+\epsilon_{i})\),...

Posted by Beetle B. on Fri 10 May 2019

Polynomials With Exact Representable Coefficients

An Iterative Method Compute the minimax approximation in a wider format. Then round the coefficient of the constant term. Then recompute...

Posted by Beetle B. on Wed 24 April 2019

Introduction to Polynomial Approximations In Finite Precision

We discussed calculating the minimax polynomial using Remez’s Algorithm, but we overlooked some subtleties. While the algorithm does...

Posted by Beetle B. on Wed 24 April 2019

Evaluation of the Error of an FMA

The error of an FMA calculation is not always a floating point number. However, we can use two floating point numbers to exactly...

Posted by Beetle B. on Fri 19 April 2019

Multiplication by an Arbitrary Precision Constant with an FMA

Suppose you need to multiply by a constant that is not exactly representable. Think \(\pi\) and the like. We’d like to multiply and...

Posted by Beetle B. on Fri 19 April 2019

Conversions Between Integers and Floating Point Numbers

This is a short section with some details and magic numbers. I did not bother.

Posted by Beetle B. on Fri 19 April 2019

Radix Conversion Algorithms

The algorithms are in the book. I did not reproduce them here. I did not read the rest of the section. There are a lot more details there.

Posted by Beetle B. on Fri 19 April 2019

Conditions on the Formats

This section deals with changing bases. The most obvious application is to go back and forth between decimal and binary (to make it easy...

Posted by Beetle B. on Fri 19 April 2019

Newton-Raphson Based Square Root With FMA

The Basic Methods One way is to use Newton’s iteration on \(f(x)=x^{2}-a\). This method for calculating square root goes back thousands...

Posted by Beetle B. on Fri 19 April 2019

Possible Double Rounding in Division Algorithms

This section deals with floating point \(a,b\) values, not necessarily between 1 and 2. Assume they are non-negative, though. For this,...

Posted by Beetle B. on Thu 18 April 2019

Using The Newton Iteration For Correctly Rounded Division With FMA

We need to calculate \(o(a/b)\) where \(a,b\) are binary floating point numbers, and \(o\) is RN, RD, RU or RZ. We have a useful proof:...

Posted by Beetle B. on Wed 17 April 2019

Variants of the Newton Raphson Iteration

Assume \(\beta=2\) for this section. Some of it may not work for decimal. We want to approximate \(b/a\). Assume \(1\le a,b<2\). In...

Posted by Beetle B. on Tue 26 March 2019

Another Splitting Technique: Splitting Around a Power of 2

In this section, assume \(\beta=2\). Now given a floating point \(x\), we want to form two floating point numbers \(x_{h}\) and...

Posted by Beetle B. on Tue 26 March 2019

Computation of Residuals of Division and Square Root With an FMA

For this article, define a representable pair for a floating point number \(x\) to be any pair \((M,e)\) such that...

Posted by Beetle B. on Mon 25 March 2019

Accurate Computation of the Product of Two Numbers

The 2MultFMA Algorithm This has been covered elsewhere. It works well when you use FMA. If No FMA Is Available If there is no FMA...

Posted by Beetle B. on Thu 21 March 2019

Accurate Computation of the Sum of Two Numbers

Let \(a,b\) be two floating point numbers. Let \(s\) be \(\RN(a+b)\). Regardless of which number it picks in a tie, it can be shown that...

Posted by Beetle B. on Thu 21 March 2019

Exact Multiplications and Divisions

When you multiply a floating point number by a power of \(\beta\), the result is exact provided there is no over or underflow. Another...

Posted by Beetle B. on Wed 20 March 2019

Exact Addition

Sterbenz’s Lemma: If your floating point system has denormals, and if \(x,y\) are non-negative, finite floating point numbers such that...

Posted by Beetle B. on Wed 20 March 2019

Computing The Precision

To get \(p\) of the floating point system you are on: i = 0 A = 1.0 B = 2 # The radix. while (A + 1.0) - A == 1.0: A = B * A i += 1...

Posted by Beetle B. on Wed 20 March 2019

Computing The Radix

Suppose we want to compute the radix of a floating point system. The code below will do it for you - it works assuming the...

Posted by Beetle B. on Wed 20 March 2019

Accurately Computing Supremum Norms

We never discussed how to calculate \(||f-p||_{\infty}\). Maple has a function to do this, but it can be inaccurate. Most people will...

Posted by Beetle B. on Thu 14 March 2019

Rational Approximations

Sometimes you need a fairly high degree polynomial to get reasonable accuracy, but can achieve a far greater accuracy with a much lower...

Posted by Beetle B. on Wed 13 March 2019

Remez’s Algorithm

Remez’s algorithm is one that converges to the minimax polynomial of a function. The author recommends using a polynomial approximation...

Posted by Beetle B. on Tue 12 March 2019

Miscellaneous (Chebyshev)

Chebyshev vs Minimax Note that the best minimax polynomial approximation need not be the Chebyshev polynomial. The latter is the best...

Posted by Beetle B. on Tue 12 March 2019

Least Maximum Polynomial Approximations

The supremum norm is given by \(||f-p||_{\infty}=\max_{a\le x\le b}|f(x)-p(x)|\). It is denoted by \(L^{\infty}\). Given a function...

Posted by Beetle B. on Mon 11 March 2019

Least Squares Polynomial Approximations

First, just a definition: A monic polynomial is one whose leading coefficient is 1. We want to find a polynomial of degree \(n\) that...

Posted by Beetle B. on Mon 11 March 2019

IEEE Support in Programming Languages

Do not assume that the operations in a programming language will map to the ones in the standard. The standard was originally written...

Posted by Beetle B. on Mon 04 March 2019

Introduction to The Classical Theory of Polynomial or Rational Approximations

We often will approximate functions as polynomial or rational functions. When doing this, we introduce two types of errors:...

Posted by Beetle B. on Mon 04 March 2019

Rest of chapter

I skipped the rest of the chapter (inlcuding hardware details).

Posted by Beetle B. on Thu 07 February 2019

Special Values

NaN Signaling NaNs do not appear as the result of arithmetic operations. When they appear as an operand, they signal an...

Posted by Beetle B. on Thu 07 February 2019

Default Exception Handling

Invalid The default result of such an operation is a quiet NaN. The operations that lead to Invalid are: Most operations on a...

Posted by Beetle B. on Wed 06 February 2019

Conversions To/From String Representations

This section addresses how one can convert a character sequence into a decimal/binary floating point number. Decimal Character Sequence...

Posted by Beetle B. on Wed 06 February 2019

Comparisons

The standard requires that you can compare any two floating point numbers, as long as they share the same radix. The unordered condition...

Posted by Beetle B. on Wed 06 February 2019

Attributes and Rounding

Rounding Direction Attributes IEEE 754-2008 requires that the following be correctly rounded: Arithmetic operations: Addition...

Posted by Beetle B. on Wed 30 January 2019

Operations Specified By The Standards

Arithmetic Operations and Square Root Handling Signed 0 If \(x,y\) are nonzero, and \(x+y=0\) or \(x-y=0\) exactly, then it is \(+0\)...

Posted by Beetle B. on Wed 30 January 2019

Formats

The standard defines several interchange formats to allow for transferring floating point data between machines. They could be as bit...

Posted by Beetle B. on Mon 28 January 2019

Manipulating Double or Triple Word Numbers

The target format is the format of the result. The target precision is the precision of the target format. When computing polynomials,...

Posted by Beetle B. on Thu 24 January 2019

Computing the Error of a FP Addition or Multiplication

Let \(a,b\) be 2 floating point numbers. It can be shown that \((a+b)-\RN(a+b)\) is a floating point number. This may not be true for...

Posted by Beetle B. on Wed 23 January 2019

Basic Notions of Floating Point Arithmetic

Basic Notions For a binary floating point system, if \(x\) is normal, then the leading bit is 1. Otherwise it is 0. If we have some...

Posted by Beetle B. on Wed 23 January 2019

Note on the Choice of Radix

It has been shown that \(\beta=2\) gives better worst case and average accuracy than all other bases. if...

Posted by Beetle B. on Tue 22 January 2019

Lost and Preserved Properties of Arithmetic

Floating point addition and multiplication are still commutative. Associativity is compromised, though. An example: Let...

Posted by Beetle B. on Tue 22 January 2019

Floating Point Exceptions

In IEEE-754, the implementer can signal an exception along with the result of the operation. Usually (or perhaps mandated?), the signal...

Posted by Beetle B. on Tue 22 January 2019

Fused Multiply Add

Let \(o\) be the rounding function, and \(a,b,c\) are floating point numbers. Then \(\mathrm{FMA}(a,b,c)\) is \(o(ab+c)\). if...

Posted by Beetle B. on Tue 22 January 2019

ULP Errors vs Relative Errors

Converting From ULP Errors to Relative Errors Let \(x\) be in the normal range, and \(|x-X|=\alpha\ulp(x)\). Then: \begin{equation*}...

Posted by Beetle B. on Tue 22 January 2019

The ULP Function

There are multiple definitions of unit in the last place. I think most agree when \(x\) is not near a boundary point. Here is the...

Posted by Beetle B. on Wed 16 January 2019

Relative Error Due To Rounding

Ranges The normal range is the set of real numbers: \(\beta^{e_{\textit{min}}}\le|x|\le\Omega\) and the subnormal range are where...

Posted by Beetle B. on Tue 15 January 2019

Rounding Functions

The IEEE 754-2008 specifies five rounding functions: Round toward \(-\infty\) (RD): It is the largest floating point number less than or...

Posted by Beetle B. on Mon 14 January 2019

The Other “Numbers”

0 (some systems have signed 0’s as well) NaN for any invalid operation \(\infty\) (some systems are signed, some are not). In the IEEE...

Posted by Beetle B. on Fri 11 January 2019

Underflow

Underflow before rounding occurs when the absolute value of the exact value is strictly less than \(\beta^{e_{\textit{min}}}\) (i.e. the...

Posted by Beetle B. on Fri 11 January 2019

Normalizing

We would like a unique way to represent \(x\). One approach is to pick the one which gives the smallest exponent possible (while still...

Posted by Beetle B. on Fri 11 January 2019

Definitions

A radix \(\beta\) floating point number \(x\) is of the form \(m\beta{e}\), where \(|m|<\beta\) is called the significand and \(e\) is...

Posted by Beetle B. on Fri 11 January 2019