language-icon Old Web
English
Sign In

Binary scaling

A representation of a floating point value using binary scaling is more precise than a floating point representation occupying the same number of bits, but typically represents values of a more limited range, therefore more easily leading to arithmetic overflow during computation. Implementation of operations using integer arithmetic instructions is often (but not always) faster than the corresponding floating point instructions. A position for the 'binary point' is chosen for each variable to be represented, and binary shifts associated with arithmetic operations are adjusted accordingly. The binary scaling corresponds in Q (number format) to the first digit, i.e. Q1.15 is a 16 bit integerscaled with one bit as integer and fifteen as fractional. A Bscal 1 or Q1.15 number would represent approx 1.999 to −2.0 as floating point. To give an example, a common way to use integer arithmetic to simulate floating point, using 32 bit numbers, is to multiply the coefficients by 65536. Using binary scientific notation, this will place the binary point at B16. That is to say, the most significant 16 bits represent the integer part the remainder are represent the fractional part. This means, as a signed two's complement integer B16 number can hold a highest value of ≈ 32767.9999847 {displaystyle approx 32767.9999847} and a lowest value of −32768.0. Put another way, the B number, is the number of integer bits used to represent the number which defines its value range. Remaining low bits (i.e. the non-integer bits) are used to store fractional quantities and supply more accuracy. For instance, to represent 1.2 and 5.6 as B16 one multiplies them by 216, giving 78643 and 367001.

[ "Arbitrary-precision arithmetic", "Floating point", "Saturation arithmetic", "IEEE 754-1985", "Half-precision floating-point format", "NaN", "Minifloat" ]
Parent Topic
Child Topic
    No Parent Topic
Baidu
map