Floating point numbers

Decimal numbers

In mathematics we have different types of numbers: Integers, Rationals, Reals, Complex, ...

For a minute, lets focus on integers.

There are infinitely many integers mathematically, but only finitely many representable on the computer.

On the computer there may be more than one storage type for each mathematical type, For example, Julia has 12 built in types for integers (signed and unsigned and different storage sizes.)

Int64 is typical. Int8 has only $256=2^8$ possible values. It uses $-2^7$ to $2^7-1$.

Internally, the values are kept in binary:

convert(Int8, 5) |> bits
"00000101"

The leading $0$ indicates the sign.

Binary numbers

Binary numbers use powers of $2$, not $10$. So – as you likely know – $1101$ in binary is just $1\cdot 2^0 + 0\cdot 2^1 + 1 \cdot 2^2 + 1\cdot 2^3 = 13$.

The largest value in Int8 would then be $1 + 2 + 4 + 8 + 16 + 32 + 64 = 127$.

Converting binary to decimal

It is easy to convert binary to decimal: we just mentally use powers of 2 instead of 10.

Here is some code:

x = "1101101100"
out = [ parseint(x[length(x) - i])* 2^i for i in 0:(length(x) - 1)]
10-element Array{Any,1}:
   0
   0
   4
   8
   0
  32
  64
   0
 256
 512

and we just add

sum(out)
876

Converting decimal in integer to binary

Start with an integer $n$. we generate the value of x from left to right. Suppose $n\geq 2$.

First find $p$ with $2^p \leq n < 2^(p+1)$. The number will have $p+1$ digits, and the left most one will be $1$.

Then consider $n = n - 2^p$. Apply the same to this. Repeat.

n = 27
log2(n) # 4. So x = 1XXXX
4.754887502163468
n = n - 2^4
log2(n) # 3. So x = 11XXX
3.4594316186372973
n = n - 2^3
log2(n) # 1 So x = 1101X
1.584962500721156
n = n - 2^1 ## smaller than 2, so clearly x = 11011
1

and check:

x = "11011"
sum([parseint(x[length(x) - i])* 2^i for i in 0:(length(x) - 1)])
27

Math with binary numbers

Adding two binary numbers follows the usual algorithm with carrying:

  1101
+ 1011
------
 11000  

$(1 + 4 + 8 + 1 +2 + 8 = 16 + 8)$

Multiplication

In decimal multiplying or dividing by 10 is easy – add a $0$ or shift the decimal point. Similarly with binary:

   10101
*     10
--------
   00000
  10101
--------
  101010

Negative numbers

How are negative numbers stored?

bits(convert(Uint8, 5)),  bits(convert(Uint8, -5))
("00000101","11111011")

They are quite different!

two's complement

The storage uses two's complements. Basically $-x$ is stored as $2^n-x$.

We have $n$ bits ($n=8$ in our example, $64$ is typical, though $32$ may be for older machines). Positive numbers use the first $n-1$ which is why there the largest number is $2^{n-1} + 2^{n-2} + \cdots 2 + 1 = 2^n-1$.

It could be that the last bit could just be a sign bit, but instead of that, the values for negative numbers are stored in a different format. $-x$ ($x>0$) is stored as $2^n-x$.

bits(convert(Uint8, 5)),  bits(convert(Uint8, -5))
("00000101","11111011")
11111111    (Carry)
 00000101   (5=1*1 + 0*2 + 1*4)
 11111011   (2^8-5 = 251 = 1 + 2 + 8 + 16+32 + 64 +128
---------
100000000

Why

This makes addition easy, as illustrated above. Or here with 15 + (-5)

11111111    (Carry)
 00001111   (15=1*1 + 1*2 + 1*4+1*8)
 11111011   (2^8-5 = 251 = 1 + 2 + 8 + 16+32 + 64 +128
---------
100001010

And 00001010 $ = 2 + 8 = 10$.

multiplying and shifting

Let's see what happens with powers of 2:

bits(2)
"0000000000000000000000000000000000000000000000000000000000000010"
bits(2 * 2)
"0000000000000000000000000000000000000000000000000000000000000100"
bits(2 * 2^2)
"0000000000000000000000000000000000000000000000000000000000001000"

and ...

bits(2^62)
"0100000000000000000000000000000000000000000000000000000000000000"

and we go over the edge...

bits(2 * 2^62)
"1000000000000000000000000000000000000000000000000000000000000000"

The largest value is

bits(typemax(Int64))
"0111111111111111111111111111111111111111111111111111111111111111"

Which is $2^{63} - 1$.

Integers are stored exactly – as possible. But that has limitations. With 64 bit numbers, the largest integer is $2^{63}-1$.

2^63 - 1
9223372036854775807

but not 2^64:

2^64
0

What happens:

bits(2)
"0000000000000000000000000000000000000000000000000000000000000010"
bits(2*2)
"0000000000000000000000000000000000000000000000000000000000000100"

shifts things left

bits(2^62)
"0100000000000000000000000000000000000000000000000000000000000000"
bits(2^62 + 2^61)
"0110000000000000000000000000000000000000000000000000000000000000"
bits(2^62 * 2)
"1000000000000000000000000000000000000000000000000000000000000000"

But still this is correct:

bits(2^63 - 1)
"0111111111111111111111111111111111111111111111111111111111111111"

Decimal numbers

On a calculator there is one basic number type: floating point. This is primarily the case with computers as well.

In mathematics we primarily work with decimal numbers

$$ 12.34 = 1 \cdot 10^2 + 2 \cdot 10^1 + 3\cdot 10^{-1} + 4 \cdot 10^{-2} $$

Scientific notation

We can write a real number in terms of powers of 10. For example:

$$ 1.234 = 1.234 \cdot 10^0 = 12.34 \cdot 10^{-1} = .1234 \cdot 10^1 = \cdots $$

We can use normalized scientific notation to say that we can express $x$ by three quantities:

$$ x = \pm r \cdot 10^n $$

scientific notation with different bases

We can use different bases in scientific notation. A number would be represented with

$$ x = \pm q \cdot \beta^m $$

We can normalize the number by insisting $q=d.ddddd...$ where the leading term of $q$ is between $1$ and $q-1$.

A special case would be $\beta =2$ or base 2, which forces the leading term to be 1.

Converting decimal to binary

We can convert decimal numbers to binary. The same simple algorithm for integers works with some modification. Start with $x$. We want to produce digits $ddd.ddd...$ where $d$ are either $0$ or $1$.

First, take the log base 2:

x = 12.1
log2(x) |> floor
3.0

This says $2^3 \leq x < 2^4$. Remember $3$, then subtract $2^3$ and repeat:

ds = [3]
x = x - 2.0^ds[end]
n = log2(x) |> floor
2.0

Remember 2 and then subtract $2^2$ and repeat:

push!(ds, n)
x = x - 2.0^ds[end]
n = log2(x) |> floor
-4.0

Remember $-4$ and repeat

push!(ds, n)
x = x - 2.0^ds[end]
n = log2(x) |> floor
-5.0

etc.

The ds will tell us there is a 1 in position 3,2,-4, -5, 8,9... So 1100.000110011...

One thing to note: what numbers terminate in decimal are generally different than what numbers will terminate in binary.

Floating point numbers

Floating point is a representation of numbers using scientific notation. Only there are constraints on the sizes:

A simple case

For example consider base $\beta=10$, $p=3$ and $e_{min}=-1$ and $e_{max}=2$. Then the possible values are limited to $-9.99 \cdot 10^{-1}$ to $9.99 \cdot 10^2$. How many are there?

$$ 2 \cdot ((\beta-1)\cdot\beta^{p-1}) \cdot (e_{max} - e_{min}) $$

Binary floating point

For binary floating point, things are similar. For simplicity let's look at 16-bit floating point where

- 1 bit is the sign bit 0 = $+1$, 1 is $-1$ - $q$ is represented with $10$ bits (the precision is 10) - $m$ is represented with $5$ bits.

There is nothing to represent the sign of $m$. The trick is offset the value by subtracting and using $m -15$.

With this, can we represent some numbers:

What is $1$? It is $+1 \cdot 1.0 \cdot 10^{15 - 15}$. So we should have

- sign is 0 - $q$ is 0000000000 - $m$ is 01111

Checking we have

convert(Float16, 1.0) |> bits
"0011110000000000"

Kinda hard to see: Let's wrap this in a function:

function seebits(x)
  b = bits(convert(Float16,x))
  b[1], b[2:6], b[7:end]
end
seebits (generic function with 1 method)
seebits(1)
('0',"01111","0000000000")

We have $2 = 1.0 \cdot 2^1$. Se we expect $q$ to represent $0$ and $m$ to represent $16$, as $16-15 = 1$:

seebits(2)
('0',"10000","0000000000")

What about the sign bit?

seebits(-2)
('1',"10000","0000000000")

What about other numbers

seebits(1 + 1/2 + 1/4 + 1/8 + 0/16 + 1/32)
('0',"01111","1110100000")
seebits(2^4*(1 + 1/2 + 1/4 + 1/8 + 0/16 + 1/32)) ## 19 - (1 + 1*2 + 1*16) = 0
('0',"10011","1110100000")

Zero and other "special" numbers

Wait a minute – if we insist on the significand being $1.dddd...$ we can't represent $0$!

Some values in floating point are special:

bits(0.0)
"0000000000000000000000000000000000000000000000000000000000000000"

(The code uses the smallest possible exponent, and $0$ for the significand)

bits(-0.0)   ## why??
"1000000000000000000000000000000000000000000000000000000000000000"

"Infinity": Inf

Infinity. Why?

This value is deemed valuable to have supported at the hardware level. It is coded by reserving the largest possible value of $m$ and $0$ for the significand.

bits(Inf)  # see bits(Inf)[2:12]
"0111111111110000000000000000000000000000000000000000000000000000"

There is room for $-\infty$ and it too is defined:

bits(-Inf)
"1111111111110000000000000000000000000000000000000000000000000000"

Not a number: NaN

NaN. This is a special value reserved for computations where no value is possible. Examples include 0/0 or 0 * Inf:

0/0, 0 * Inf
(NaN,NaN)

These are related to limit problems (indeterminate), though not all forms are indeterminate:

1/0, 0^0
(Inf,1)

How is NaN coded:

bits(NaN)
"0111111111111000000000000000000000000000000000000000000000000000"

This is very similar to Inf, but the value of $q$ is non-zero!

bits(NaN)[13:end], bits(Inf)[13:end]
("1000000000000000000000000000000000000000000000000000","0000000000000000000000000000000000000000000000000000")

NaN semantics are a bit controversial.

Poison

The values of Inf and NaN will "poison" subsequent operations, for example

NaN + 1, Inf - 1
(NaN,Inf)

These special values are generated instead of errors being thrown for some common cases:

(Whether something like sqrt(-1.0) is an error or NaN is not specified.)

Range of numbers

What is the range of the numbers that can be represented? Let's check with Float16.

The largest positive value would have $m$ coded with 11110 or ($2 + 4 + 8 + 16 - 15 = 15$) (The value 11111 is saved for Inf and NaN.)

The largest value for $q$ would be 1111111111, or

sum([1/2^i for i in 0:10])
1.9990234375

Altogether we have:

sum([1/2^i for i in 0:10]) * 2^15
65504.0

Is this right?

prevfloat(typemax(Float16))
float16(65504.0)

For the smallest positive number, the smallest exponent is code 00000 or $0 - 15 = -15$. So the value should be:

1/2^15
3.0517578125e-5

But this isn't actually the case:

nextfloat(convert(Float16, 0))
float16(5.9605e-8)

(As there are tricks to get smaller numbers called subnormal numbers)

For double precision numbers (Float64) the values are given by:

prevfloat(Inf), nextfloat(0.0)
(1.7976931348623157e308,5.0e-324)

Rounding

We are familiar with the task of rounding: a passing grade is 59.5 not a 60!

We have three types of rounding: round up, round down, mixing based on a rule.

Rounding to the nearest integer shows this fairly clearly:

x = 3.14
ceil(x), floor(x), round(x)
(4.0,3.0,3.0)

The same would be true of $3.1$, as that is all that is looked at here.

What becomes of $1.5$? The default is to round up.

x = 1.5
ceil(x), floor(x), round(x)
(2.0,1.0,2.0)

Rounding can be done for real numbers too to some specified number of digits:

x = 1.23456
round(x,1), round(x,2), round(x,4)
(1.2,1.23,1.2346)

Rounding in floating point

When converting from decimal to floating point, even simple decimal numbers get rounded!

convert(Float16, 0.1)
float16(0.099976)
seebits(0.1)
('0',"01011","1001100110")

"1001100110" becomes:

q = (1 + 1/2 + 1/16 + 1/32 + 1/256 + 1/512 )
1.599609375

And 01011 for $m$ becomes

m = 2.0^(1 + 2 + 8 - 15)
0.0625
1 * q * 2^m
1.6704301324376012

Notice the number $0.1$ is necessarily approximated.

Error bound

Let $x$ be the number and $\tilde{x}$ be its rounded value. How big is the difference when $x$ is rounded to $n$ decimal places?

We consider the value of the $n+1$st decimal number.

- If it is in 0,1,2,3,4 then we round down and the error is $i \cdot 10^{-(n+1)}$

- If it is in 5,6,7,8,9 then we round up, and the error is $10-i \cdot 10^{-(n+1)}$.

In either case, it is less –in absolute value – than $(10/2) \cdot 10^{-(n+1)} = 1/2 \cdot 10^{-n}$.

The error in rounding to $n$ decimal points is bounded by: $|x - \tilde{x}| < 1/2 \cdot 10^{-n}$.

Had we chopped (floor) or always rounded up (ceil) then the error is bounded by $10^{-n}$.

Ulp

There is an alternate name for the error. When rounding to a certain number of digits, there is a unit of last precision, and ulp. In the above, the unit of last precision was $10^{-n}$, so the error is less than $1/2$ ulp.

ulps are easy enough to understand. If we round $3.1415$ to $3.14$ then the error is $0.15$ ulp.

How much off can rounding be?

In binary floating point, the unit of last precision is $2^{-p}$, so the error in rounding is at most $(1/2) \cdot 2^{-p}$, or half an ulp.

Float16, Float32, Float64, ...

16-bit floating point is not typical. What is common is:

How the bits are arranged in IEEE 754 binary formats for floating point we have this table. See also this post by John Cook for the common case.

For example

b = bits(2^2 + 2^0 + 1/2 + 1/8) ## 101.101 = 1.01101 * 2^2
b[1], b[2:12], b[13:end]
('0',"10000000001","0110100000000000000000000000000000000000000000000000")

Here $m = 2^{10} + 1 - (2^{10} - 1)$ and we can see that $q=1.01101$ with the first $1$ implicit.

Machine numbers

The numbers that can be represented exactly in floating point are called machine numbers.

  • There aren't very many compared to the infinite number of floating point values.

    Let's visualize in a hypothetical Float8 mode with 1 sign bit, 3 exponent bits and 4 bits for the mantissa.

    The possible positive values are

    qs = [1 + i/2 + j/4 + k/8 + l/16 for i in 0:1, j in 0:1, k in 0:1, l in 0:1] |> vec |> sort
    16-element Array{Float64,1}:
     1.0   
     1.0625
     1.125 
     1.1875
     1.25  
     1.3125
     1.375 
     1.4375
     1.5   
     1.5625
     1.625 
     1.6875
     1.75  
     1.8125
     1.875 
     1.9375

    The values for the exponents are $-3, -2, -1, 0, 1, 2, 3$. So all our values are given by

    ms = (-3):3
    vals = [q * 2.0^m for q in qs, m in ms] |> vec
    112-element Array{Any,1}:
      0.125   
      0.132813
      0.140625
      0.148438
      0.15625 
      0.164063
      0.171875
      0.179688
      0.1875  
      0.195313
      0.203125
      0.210938
      0.21875 
      ⋮       
     10.0     
     10.5     
     11.0     
     11.5     
     12.0     
     12.5     
     13.0     
     13.5     
     14.0     
     14.5     
     15.0     
     15.5     

    Plotting the machine numbers

    We can plot these points:

    using Gadfly
    plot(x = vals, y = 0*vals, Geom.point)
    x -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 -20 0 20 40 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 y 0 -3.0 -2.9 -2.8 -2.7 -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 -4 -2 0 2 4 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0

    We notice:

    • they are definitely finite
    • there are definitely gaps
    • they are not evenly spaced out
    • there is a "hole" near 0

    Machine precision

    Of special note is the size of the gap between values around 1:

    nextfloat(1.0), nextfloat(1.0) - 1.0
    (1.0000000000000002,2.220446049250313e-16)

    This is sometimes called machine precision and in Julia is returned by eps():

    eps()
    2.220446049250313e-16
    eps(Float16)
    float16(0.00097656)