当前位置:网站首页>Summary of floating point double precision, single precision and half precision knowledge
Summary of floating point double precision, single precision and half precision knowledge
2022-04-23 17:51:00 【ppipp1109】
Recently, I met a 16 The problem of half precision floating-point numbers , It's been a long time , Here we have studied , To sum up :
1. Single precision (32 position ) The structure of floating point numbers :
name length The bit Location
Sign bit Sign(S): 1bit (b31)
Index part Exponent(E): 8bit (b30-b23)
Mantissa part Mantissa(M): 23bit (b22-b0)
The exponential part (E) Offset code (biased) To express the positive and negative exponents , if E<127 Is a negative index , Otherwise, it is a nonnegative index .
Be careful :%f Output float type , Output 6 Decimal place , The number of significant digits is generally 7 position ;
2. Double precision (64 position ) The structure of floating point numbers
name length Bit position
Sign bit Sign (S) : 1bit (b63)
Index part Exponent (E) : 11bit (b62-b52)
Mantissa part Mantissa (M) : 52bit (b51-b0)
The exponential part of double precision (E) The offset code used is 1023
How to evaluate :(-1)S*(1.M)*2(E-1023) ( The formula 2)
Be careful : Double precision numbers are also available %f Format output , Its significant bit is generally 16 position , Give decimals 6 position .( This is particularly important when calculating the amount , Numbers that exceed the significant digits are meaningless , It usually makes mistakes .
3. Semi precision (16 position ) Floating point structure
name length Bit position
Sign bit Sign (S) : 1bit (b15)
Index part Exponent (E) : 5bit (b14-b10)
Mantissa part Mantissa (M) : 10bit (b9-b0)
Recently, a kind of Bfloat16 How to count , Use the same number of digits as the half precision , It keeps the same exponential bit as the single precision, that is 8 Bits refer to digits , It can represent the same number range as single precision , But at the expense of decimal places, that is, precision .
Semi precision floating point is a kind of binary floating point data type used by computer . Semi precision floating point numbers use 2 byte (16 position ) Storage . stay IEEE 754-2008 in , It is called binary16. This type is only suitable for storing numbers that do not require high accuracy , Not suitable for calculating .
Half precision floating-point numbers are a relatively new type of floating-point numbers . NVIDIA in 2002 Released at the beginning of the year Cg Language defines it as half data type , And for the first time in 2002 Issued at the end of the year GeForce FX To realize .ILM We were looking for a way to have a high dynamic range , And do not need to consume too much hard disk and memory , And image formats that can be used for floating-point calculations like single precision floating-point numbers and double precision floating-point numbers . from SGI Of John Airey Led the hardware accelerated programmable coloring team in 1997 Invented as ’bali’ Part of the design work s10e5 data type . This was in SIGGRAPH2000 It was introduced in the paper in .( See Chapter 4.3) And patented in the United States 7518615 Further recorded in .
Semi precision floating-point numbers can include OpenEXR,JPEG XR,OpenGL,Cg Language and D3DX And several other computer graphics environments . And 8 Bit or 16 Comparison of bit integers , Its advantage is that it can improve the dynamic range , So that more details in high contrast pictures can be retained . Compared with single precision floating point numbers , Its advantage is that it only needs half of the storage space and bandwidth ( But at the expense of precision and numerical range )
Detailed explanation of semi precision floating point numbers :
IEEE754-2008 Contains a “ Semi precision ” Format , Only 16 A wide . So it is also called binary16, This type of floating-point number is only suitable for storing numbers that do not require high precision , Not suitable for calculation . Compared with single precision floating point numbers , Its advantage is that it only needs half of the storage space and bandwidth , But the disadvantage is the low accuracy .
The semi precision format is similar to the single precision format , The leftmost bit is still the sign bit , The index has 5 Wide and spare -16(excess-16) Form storage of , Mantissa has 10 A wide , But there is an implication 1.
As shown in the figure ,sign Symbol bit ,0 Indicates that the floating-point number is positive ,1 Indicates that the floating-point number is negative
Let's start with the mantissa , And the index ,fraction mantissa , Yes 10 Bit length , But there are implications 1, Mantissa can be understood as a floating point number, the number after the decimal point , Such as 1.11, The mantissa is 1100000000(1), The last implication 1 Mainly used in calculation , Implication 1 There may be situations where you can carry .
exponent Is the number of digits , Yes 5 Bit length , The specific values are as follows :
When the index is all 0 , The mantissa is also all 0 When the , That means 0
When the index is all 0, The trailing digits are not all 0 when , Expressed as subnormal value, Denormalized floating point numbers , It's a very small number
When the index is all 1, The mantissa is all 0 when , It means infinity , At this time, if the symbol bit is 0, Positive infinity , Symbol bit 1, Negative infinity
When the index is all 1, The trailing digits are not all 0 when , It's not a number
The rest of the time , Subtract... From the value of the exponential bit 15 Is the index it represents , Such as 11110 That means 30-15=15
So we can get , The calculation method of floating point number is half precision (-1)^sign×2^( The value of the exponential bit )×(1+0. mantissa )
remarks : here 0. mantissa , Indicates that the last digit is 0001110001, be 0. The last digits are 0.0001110001
Take a few examples :
The maximum value that can be expressed with half precision :0 11110 1111111111 The calculation method is as follows :(-1)^0×2^(30-15)×1.1111111111 = 1.1111111111×2^15, It's decimal 65504
The minimum value that can be represented by half precision ( except subnormal value):0 00001 0000000000 The calculation method is as follows :(-1)^(-1)×2(1-15)=2^(-14), Approximately equal to decimal 6.104×10^(-5)
Another ordinary number , The other way round this time , Such as -1.5625×10^(-1) , namely -0.15625 = -0.00101( Decimal to binary )= -1.01×2^(-3), So the sign bit is 1, Index is -3+15=12, So the exponent is 01100, The last digits are 0100000000. therefore -1.5625×10^(-1) Expressed as a semi precision floating-point number, it is 1 01100 0100000000
Code measurement :
Float16ToFloat32:
You can put 16 Bit float IEEE754 canonical int Value to 32 Bit float
float Float16ToFloat( short fltInt16 )
{
int fltInt32 = ((fltInt16 & 0x8000) << 16);
fltInt32 |= ((fltInt16 & 0x7fff) << 13) + 0x38000000;
float fRet;
memcpy( &fRet, &fltInt32, sizeof( float ) );
return fRet;
}
Float32ToFloat16:
You can put 32 Of float IEEE754 The specification is transformed into 16 position int value
short FloatToFloat16( float value )
{
short fltInt16;
int fltInt32;
memcpy( &fltInt32, &value, sizeof( float ) );
fltInt16 = ((fltInt32 & 0x7fffffff) >> 13) - (0x38000000 >> 13);
fltInt16 |= ((fltInt32 & 0x80000000) >> 16);
return fltInt16;
}
Reference link :
版权声明
本文为[ppipp1109]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230549076149.html
边栏推荐
- Construction of functions in C language programming
- Client example analysis of easymodbustcp
- 239. Maximum value of sliding window (difficult) - one-way queue, large top heap - byte skipping high frequency problem
- 常用SQL语句总结
- 開期貨,開戶雲安全還是相信期貨公司的軟件?
- Use of list - addition, deletion, modification and query
- 开源按键组件Multi_Button的使用,含测试工程
- 31. 下一个排列
- Operation of 2022 mobile crane driver national question bank simulation examination platform
- 读《Software Engineering at Google》(15)
猜你喜欢
Click Cancel to return to the previous page and modify the parameter value of the previous page, let pages = getcurrentpages() let prevpage = pages [pages. Length - 2] / / the data of the previous pag
JVM class loading mechanism
2021 Great Wall Cup WP
2022 tea artist (primary) examination simulated 100 questions and simulated examination
2022 judgment questions and answers for operation of refrigeration and air conditioning equipment
SystemVerilog (VI) - variable
1217_使用SCons生成目标文件
394. 字符串解码-辅助栈
The ultimate experience, the audio and video technology behind the tiktok
958. Complete binary tree test
随机推荐
In embedded system, must the program code in flash be moved to ram to run?
Add animation to the picture under V-for timing
PC uses wireless network card to connect to mobile phone hotspot. Why can't you surf the Internet
剑指 Offer 22. 链表中倒数第k个节点-快慢指针
Sword finger offer 22 The penultimate node in the linked list - speed pointer
Entity Framework core captures database changes
SystemVerilog (VI) - variable
1217_使用SCons生成目标文件
20222 return to the workplace
Detailed deployment of flask project
Compare the performance of query based on the number of paging data that meet the query conditions
Allowed latency and side output
干货 | 快速抽取缩略图是怎么练成的?
Operation of 2022 mobile crane driver national question bank simulation examination platform
JS interview question: FN call. call. call. Call (FN2) parsing
Summary of common SQL statements
vite配置proxy代理解决跨域
高德地图搜索、拖拽 查询地址
Halo open source project learning (II): entity classes and data tables
Add drag and drop function to El dialog