Here are some concerns:
- Say you want to compute \(a+b+c+d\) and they are all of 32 bit precision, but the machine supports 64 bit. Should the machine calculate in 64 bit and then downcast?
- So you have r=a+b+c+d where r is of a higher precision type, but the rest of the variables are not. Should it compute in the higher precision, or compute in the lower precision?
The tradeoffs to consider:
- 32 bit is more accurate than most real world measurements. Also, more bits has potential cache headaches (more cache needed), so it could slow things down.
- On the flip side, if you’re doing lots (e.g. millions) of iterations, then it makes sense to go to higher precision to minimize the loss propagation.
In C, addition is left associative in terms of grouping, but the actual calculation may be done in higher precision.
The IEEE 754 says that a compiler should not translate an a*b+c statement into an FMA instruction. GCC has an option to explicitly prevent this from happening.
On top of what the compiler will do, the operating system also makes (default) decisions about the floating point environment - so the same program may not be portable in different OS’s.
For math operations, the C++14 standard refers to the C99 standard, and not the C11 standard. Mostly due to the time it takes for beauracracy.
C11 only claims to support the 1985 edition of IEEE 754!
The main header files in C related to floating point are:
- float.h
- math.h
- tgmath.h
- fenv.h
- complex.h (maybe)
Note that the C standard does not even require the base to be 2! It is obtained using the FLTRADIX macro.
tgmath.h has “type-generic” macros. So for example if you call sin with a float argument, it will call sinf behind the scenes.
In C, the support for signed infinities and signed zeros is optional.
In C, the support for signaling NaNs doesn’t exist (in the standard). The goal seems to have been algebraic closure which qNaN offers.
The C11 standard provides the macro FLTEVALMETHOD which will specify what format intermediate calculations will take place in (e.g. when adding a double and a float).
The C11 standard allows for contraction (e.g. using an FMA operation even when not explicitly asking for it). You can control this using the FPCONTRACT pragma.
In the C99 standard, pow(-1,Inf)=1. The idea is that any large enough number is an even number. Also, pow(x,0)=1 even when \(x=0\) or NaN.
C++ has the class std::numeric_limits where you can make decisions based on the floating point environment.
Very important: The cmath header uses sin for both single and double precision, and uses overloading to determine which should be called.
I skipped the Fortran section.
Briefly scanned the Java section.
For Python, it is dependent on the machine architecture and the C compiler it was compiled with. It supports only double precision. But the numpy package supports a lot more.