当前位置：网站首页>Summary of floating point double precision, single precision and half precision knowledge

Summary of floating point double precision, single precision and half precision knowledge

2022-04-23 17:51:00 【ppipp1109】

Recently, I met a 16 The problem of half precision floating-point numbers , It's been a long time , Here we have studied , To sum up ：

1. Single precision （32 position ） The structure of floating point numbers ：

name length The bit Location

Sign bit Sign(S): 1bit (b31）

Index part Exponent(E): 8bit (b30-b23)

Mantissa part Mantissa(M): 23bit (b22-b0)

The exponential part （E） Offset code （biased） To express the positive and negative exponents , if E<127 Is a negative index , Otherwise, it is a nonnegative index .

Be careful ：%f Output float type , Output 6 Decimal place , The number of significant digits is generally 7 position ;

2. Double precision （64 position ） The structure of floating point numbers

name length Bit position

Sign bit Sign （S） : 1bit （b63）
Index part Exponent （E） : 11bit （b62-b52）
Mantissa part Mantissa （M） : 52bit （b51-b0）

The exponential part of double precision （E） The offset code used is 1023

How to evaluate :(-1)S*(1.M)*2(E-1023) （ The formula 2）

Be careful ： Double precision numbers are also available %f Format output , Its significant bit is generally 16 position , Give decimals 6 position .（ This is particularly important when calculating the amount , Numbers that exceed the significant digits are meaningless , It usually makes mistakes .

3. Semi precision （16 position ） Floating point structure

name length Bit position

Sign bit Sign （S） : 1bit （b15）
Index part Exponent （E） : 5bit （b14-b10）
Mantissa part Mantissa （M） : 10bit （b9-b0）

Recently, a kind of Bfloat16 How to count , Use the same number of digits as the half precision , It keeps the same exponential bit as the single precision, that is 8 Bits refer to digits , It can represent the same number range as single precision , But at the expense of decimal places, that is, precision .

Semi precision floating point is a kind of binary floating point data type used by computer . Semi precision floating point numbers use 2 byte （16 position ） Storage . stay IEEE 754-2008 in , It is called binary16. This type is only suitable for storing numbers that do not require high accuracy , Not suitable for calculating .
Half precision floating-point numbers are a relatively new type of floating-point numbers . NVIDIA in 2002 Released at the beginning of the year Cg Language defines it as half data type , And for the first time in 2002 Issued at the end of the year GeForce FX To realize .ILM We were looking for a way to have a high dynamic range , And do not need to consume too much hard disk and memory , And image formats that can be used for floating-point calculations like single precision floating-point numbers and double precision floating-point numbers . from SGI Of John Airey Led the hardware accelerated programmable coloring team in 1997 Invented as ’bali’ Part of the design work s10e5 data type . This was in SIGGRAPH2000 It was introduced in the paper in .（ See Chapter 4.3） And patented in the United States 7518615 Further recorded in .
Semi precision floating-point numbers can include OpenEXR,JPEG XR,OpenGL,Cg Language and D3DX And several other computer graphics environments . And 8 Bit or 16 Comparison of bit integers , Its advantage is that it can improve the dynamic range , So that more details in high contrast pictures can be retained . Compared with single precision floating point numbers , Its advantage is that it only needs half of the storage space and bandwidth （ But at the expense of precision and numerical range ）

Detailed explanation of semi precision floating point numbers ：

IEEE754-2008 Contains a “ Semi precision ” Format , Only 16 A wide . So it is also called binary16, This type of floating-point number is only suitable for storing numbers that do not require high precision , Not suitable for calculation . Compared with single precision floating point numbers , Its advantage is that it only needs half of the storage space and bandwidth , But the disadvantage is the low accuracy .

The semi precision format is similar to the single precision format , The leftmost bit is still the sign bit , The index has 5 Wide and spare -16（excess-16） Form storage of , Mantissa has 10 A wide , But there is an implication 1.

As shown in the figure ,sign Symbol bit ,0 Indicates that the floating-point number is positive ,1 Indicates that the floating-point number is negative

Let's start with the mantissa , And the index ,fraction mantissa , Yes 10 Bit length , But there are implications 1, Mantissa can be understood as a floating point number, the number after the decimal point , Such as 1.11, The mantissa is 1100000000（1）, The last implication 1 Mainly used in calculation , Implication 1 There may be situations where you can carry .

exponent Is the number of digits , Yes 5 Bit length , The specific values are as follows ：

When the index is all 0 , The mantissa is also all 0 When the , That means 0
When the index is all 0, The trailing digits are not all 0 when , Expressed as subnormal value, Denormalized floating point numbers , It's a very small number
When the index is all 1, The mantissa is all 0 when , It means infinity , At this time, if the symbol bit is 0, Positive infinity , Symbol bit 1, Negative infinity
When the index is all 1, The trailing digits are not all 0 when , It's not a number
The rest of the time , Subtract... From the value of the exponential bit 15 Is the index it represents , Such as 11110 That means 30-15=15
So we can get , The calculation method of floating point number is half precision （-1）^sign×2^（ The value of the exponential bit ）×（1+0. mantissa ）

remarks ： here 0. mantissa , Indicates that the last digit is 0001110001, be 0. The last digits are 0.0001110001

Take a few examples ：

The maximum value that can be expressed with half precision ：0 11110 1111111111 The calculation method is as follows ：（-1）^0×2^(30-15)×1.1111111111 = 1.1111111111×2^15, It's decimal 65504
The minimum value that can be represented by half precision （ except subnormal value）：0 00001 0000000000 The calculation method is as follows ：（-1）^（-1）×2(1-15)=2^(-14), Approximately equal to decimal 6.104×10^(-5)
Another ordinary number , The other way round this time , Such as -1.5625×10^(-1) , namely -0.15625 = -0.00101（ Decimal to binary ）= -1.01×2^(-3), So the sign bit is 1, Index is -3+15=12, So the exponent is 01100, The last digits are 0100000000. therefore -1.5625×10^(-1) Expressed as a semi precision floating-point number, it is 1 01100 0100000000

Code measurement ：

Float16ToFloat32：

You can put 16 Bit float IEEE754 canonical int Value to 32 Bit float

float Float16ToFloat( short fltInt16 )
{
    int fltInt32    =  ((fltInt16 & 0x8000) << 16);
    fltInt32        |= ((fltInt16 & 0x7fff) << 13) + 0x38000000;

    float fRet;
    memcpy( &fRet, &fltInt32, sizeof( float ) );
    return fRet;
 }

Float32ToFloat16：

You can put 32 Of float IEEE754 The specification is transformed into 16 position int value

short FloatToFloat16( float value )
{
    short   fltInt16;
    int     fltInt32;
    memcpy( &fltInt32, &value, sizeof( float ) );
    fltInt16    =  ((fltInt32 & 0x7fffffff) >> 13) - (0x38000000 >> 13);
    fltInt16    |= ((fltInt32 & 0x80000000) >> 16);

    return fltInt16;
}