In mathematics we have different types of numbers: Integers, Rationals, Reals, Complex, ...
For a minute, lets focus on integers.
There are infinitely many integers mathematically, but only finitely many representable on the computer.
On the computer there may be more than one storage type for each mathematical type, For example, Julia
has 12 built in types for integers (signed and unsigned and different storage sizes.)
Int64
is typical. Int8
has only $256=2^8$ possible values. It uses $-2^7$ to $2^7-1$.
Internally, the values are kept in binary:
convert(Int8, 5) |> bits
"00000101"
The leading $0$ indicates the sign.
Binary numbers use powers of $2$, not $10$. So – as you likely know – $1101$ in binary is just $1\cdot 2^0 + 0\cdot 2^1 + 1 \cdot 2^2 + 1\cdot 2^3 = 13$.
The largest value in Int8
would then be $1 + 2 + 4 + 8 + 16 + 32 + 64 = 127$.
It is easy to convert binary to decimal: we just mentally use powers of 2 instead of 10.
Here is some code:
x = "1101101100" out = [ parseint(x[length(x) - i])* 2^i for i in 0:(length(x) - 1)]
10-element Array{Any,1}: 0 0 4 8 0 32 64 0 256 512
and we just add
sum(out)
876
Start with an integer $n$. we generate the value of x
from left to right. Suppose $n\geq 2$.
First find $p$ with $2^p \leq n < 2^(p+1)$. The number will have $p+1$ digits, and the left most one will be $1$.
Then consider $n = n - 2^p$. Apply the same to this. Repeat.
n = 27 log2(n) # 4. So x = 1XXXX
4.754887502163468
n = n - 2^4 log2(n) # 3. So x = 11XXX
3.4594316186372973
n = n - 2^3 log2(n) # 1 So x = 1101X
1.584962500721156
n = n - 2^1 ## smaller than 2, so clearly x = 11011
1
and check:
x = "11011" sum([parseint(x[length(x) - i])* 2^i for i in 0:(length(x) - 1)])
27
Adding two binary numbers follows the usual algorithm with carrying:
1101 + 1011 ------ 11000
$(1 + 4 + 8 + 1 +2 + 8 = 16 + 8)$
In decimal multiplying or dividing by 10 is easy – add a $0$ or shift the decimal point. Similarly with binary:
10101 * 10 -------- 00000 10101 -------- 101010
How are negative numbers stored?
bits(convert(Uint8, 5)), bits(convert(Uint8, -5))
("00000101","11111011")
They are quite different!
The storage uses two's complements. Basically $-x$ is stored as $2^n-x$.
We have $n$ bits ($n=8$ in our example, $64$ is typical, though $32$ may be for older machines). Positive numbers use the first $n-1$ which is why there the largest number is $2^{n-1} + 2^{n-2} + \cdots 2 + 1 = 2^n-1$.
It could be that the last bit could just be a sign bit, but instead of that, the values for negative numbers are stored in a different format. $-x$ ($x>0$) is stored as $2^n-x$.
bits(convert(Uint8, 5)), bits(convert(Uint8, -5))
("00000101","11111011")
11111111 (Carry) 00000101 (5=1*1 + 0*2 + 1*4) 11111011 (2^8-5 = 251 = 1 + 2 + 8 + 16+32 + 64 +128 --------- 100000000
This makes addition easy, as illustrated above. Or here with 15 + (-5)
11111111 (Carry) 00001111 (15=1*1 + 1*2 + 1*4+1*8) 11111011 (2^8-5 = 251 = 1 + 2 + 8 + 16+32 + 64 +128 --------- 100001010
And 00001010
$ = 2 + 8 = 10$.
Let's see what happens with powers of 2:
bits(2)
"0000000000000000000000000000000000000000000000000000000000000010"
bits(2 * 2)
"0000000000000000000000000000000000000000000000000000000000000100"
bits(2 * 2^2)
"0000000000000000000000000000000000000000000000000000000000001000"
and ...
bits(2^62)
"0100000000000000000000000000000000000000000000000000000000000000"
and we go over the edge...
bits(2 * 2^62)
"1000000000000000000000000000000000000000000000000000000000000000"
The largest value is
bits(typemax(Int64))
"0111111111111111111111111111111111111111111111111111111111111111"
Which is $2^{63} - 1$.
Integers are stored exactly – as possible. But that has limitations. With 64 bit numbers, the largest integer is $2^{63}-1$.
2^63 - 1
9223372036854775807
but not 2^64
:
2^64
0
What happens:
bits(2)
"0000000000000000000000000000000000000000000000000000000000000010"
bits(2*2)
"0000000000000000000000000000000000000000000000000000000000000100"
shifts things left
bits(2^62)
"0100000000000000000000000000000000000000000000000000000000000000"
bits(2^62 + 2^61)
"0110000000000000000000000000000000000000000000000000000000000000"
bits(2^62 * 2)
"1000000000000000000000000000000000000000000000000000000000000000"
But still this is correct:
bits(2^63 - 1)
"0111111111111111111111111111111111111111111111111111111111111111"
On a calculator there is one basic number type: floating point. This is primarily the case with computers as well.
In mathematics we primarily work with decimal numbers
$$ 12.34 = 1 \cdot 10^2 + 2 \cdot 10^1 + 3\cdot 10^{-1} + 4 \cdot 10^{-2} $$
We can write a real number in terms of powers of 10. For example:
$$ 1.234 = 1.234 \cdot 10^0 = 12.34 \cdot 10^{-1} = .1234 \cdot 10^1 = \cdots $$
We can use normalized scientific notation to say that we can express $x$ by three quantities:
$$ x = \pm r \cdot 10^n $$
We can use different bases in scientific notation. A number would be represented with
$$ x = \pm q \cdot \beta^m $$
We can normalize the number by insisting $q=d.ddddd...$ where the leading term of $q$ is between $1$ and $q-1$.
A special case would be $\beta =2$ or base 2, which forces the leading term to be 1.
We can convert decimal numbers to binary. The same simple algorithm for integers works with some modification. Start with $x$. We want to produce digits $ddd.ddd...$ where $d$ are either $0$ or $1$.
First, take the log base 2:
x = 12.1 log2(x) |> floor
3.0
This says $2^3 \leq x < 2^4$. Remember $3$, then subtract $2^3$ and repeat:
ds = [3] x = x - 2.0^ds[end] n = log2(x) |> floor
2.0
Remember 2 and then subtract $2^2$ and repeat:
push!(ds, n) x = x - 2.0^ds[end] n = log2(x) |> floor
-4.0
Remember $-4$ and repeat
push!(ds, n) x = x - 2.0^ds[end] n = log2(x) |> floor
-5.0
etc.
The ds
will tell us there is a 1 in position 3,2,-4, -5, 8,9... So 1100.000110011...
One thing to note: what numbers terminate in decimal are generally different than what numbers will terminate in binary.
Floating point is a representation of numbers using scientific notation. Only there are constraints on the sizes:
For example consider base $\beta=10$, $p=3$ and $e_{min}=-1$ and $e_{max}=2$. Then the possible values are limited to $-9.99 \cdot 10^{-1}$ to $9.99 \cdot 10^2$. How many are there?
$$ 2 \cdot ((\beta-1)\cdot\beta^{p-1}) \cdot (e_{max} - e_{min}) $$
For binary floating point, things are similar. For simplicity let's look at 16-bit floating point where
- 1 bit is the sign bit 0
= $+1$, 1
is $-1$ - $q$ is represented with $10$ bits (the precision is 10) - $m$ is represented with $5$ bits.
There is nothing to represent the sign of $m$. The trick is offset the value by subtracting and using $m -15$.
With this, can we represent some numbers:
What is $1$? It is $+1 \cdot 1.0 \cdot 10^{15 - 15}$. So we should have
- sign is 0
- $q$ is 0000000000
- $m$ is 01111
Checking we have
convert(Float16, 1.0) |> bits
"0011110000000000"
Kinda hard to see: Let's wrap this in a function:
function seebits(x) b = bits(convert(Float16,x)) b[1], b[2:6], b[7:end] end
seebits (generic function with 1 method)
seebits(1)
('0',"01111","0000000000")
We have $2 = 1.0 \cdot 2^1$. Se we expect $q$ to represent $0$ and $m$ to represent $16$, as $16-15 = 1$:
seebits(2)
('0',"10000","0000000000")
What about the sign bit?
seebits(-2)
('1',"10000","0000000000")
What about other numbers
seebits(1 + 1/2 + 1/4 + 1/8 + 0/16 + 1/32)
('0',"01111","1110100000")
seebits(2^4*(1 + 1/2 + 1/4 + 1/8 + 0/16 + 1/32)) ## 19 - (1 + 1*2 + 1*16) = 0
('0',"10011","1110100000")
Wait a minute – if we insist on the significand being $1.dddd...$ we can't represent $0$!
Some values in floating point are special:
bits(0.0)
"0000000000000000000000000000000000000000000000000000000000000000"
(The code uses the smallest possible exponent, and $0$ for the significand)
bits(-0.0) ## why??
"1000000000000000000000000000000000000000000000000000000000000000"
Infinity. Why?
This value is deemed valuable to have supported at the hardware level. It is coded by reserving the largest possible value of $m$ and $0$ for the significand.
bits(Inf) # see bits(Inf)[2:12]
"0111111111110000000000000000000000000000000000000000000000000000"
There is room for $-\infty$ and it too is defined:
bits(-Inf)
"1111111111110000000000000000000000000000000000000000000000000000"
NaN. This is a special value reserved for computations where no value is possible. Examples include 0/0
or 0 * Inf
:
0/0, 0 * Inf
(NaN,NaN)
These are related to limit problems (indeterminate), though not all forms are indeterminate:
1/0, 0^0
(Inf,1)
How is NaN
coded:
bits(NaN)
"0111111111111000000000000000000000000000000000000000000000000000"
This is very similar to Inf
, but the value of $q$ is non-zero!
bits(NaN)[13:end], bits(Inf)[13:end]
("1000000000000000000000000000000000000000000000000000","0000000000000000000000000000000000000000000000000000")
NaN
semantics are a bit controversial.
The values of Inf
and NaN
will "poison" subsequent operations, for example
NaN + 1, Inf - 1
(NaN,Inf)
These special values are generated instead of errors being thrown for some common cases:
(Whether something like sqrt(-1.0)
is an error or NaN
is not specified.)
What is the range of the numbers that can be represented? Let's check with Float16.
The largest positive value would have $m$ coded with 11110
or ($2 + 4 + 8 + 16 - 15 = 15$) (The value 11111
is saved for Inf
and NaN
.)
The largest value for $q$ would be 1111111111
, or
sum([1/2^i for i in 0:10])
1.9990234375
Altogether we have:
sum([1/2^i for i in 0:10]) * 2^15
65504.0
Is this right?
prevfloat(typemax(Float16))
float16(65504.0)
For the smallest positive number, the smallest exponent is code 00000
or $0 - 15 = -15$. So the value should be:
1/2^15
3.0517578125e-5
But this isn't actually the case:
nextfloat(convert(Float16, 0))
float16(5.9605e-8)
(As there are tricks to get smaller numbers called subnormal numbers)
For double precision numbers (Float64) the values are given by:
prevfloat(Inf), nextfloat(0.0)
(1.7976931348623157e308,5.0e-324)
We are familiar with the task of rounding: a passing grade is 59.5 not a 60!
We have three types of rounding: round up, round down, mixing based on a rule.
Rounding to the nearest integer shows this fairly clearly:
x = 3.14 ceil(x), floor(x), round(x)
(4.0,3.0,3.0)
The same would be true of $3.1$, as that is all that is looked at here.
What becomes of $1.5$? The default is to round up.
x = 1.5 ceil(x), floor(x), round(x)
(2.0,1.0,2.0)
Rounding can be done for real numbers too to some specified number of digits:
x = 1.23456 round(x,1), round(x,2), round(x,4)
(1.2,1.23,1.2346)
When converting from decimal to floating point, even simple decimal numbers get rounded!
convert(Float16, 0.1)
float16(0.099976)
seebits(0.1)
('0',"01011","1001100110")
"1001100110" becomes:
q = (1 + 1/2 + 1/16 + 1/32 + 1/256 + 1/512 )
1.599609375
And 01011
for $m$ becomes
m = 2.0^(1 + 2 + 8 - 15)
0.0625
1 * q * 2^m
1.6704301324376012
Notice the number $0.1$ is necessarily approximated.
Let $x$ be the number and $\tilde{x}$ be its rounded value. How big is the difference when $x$ is rounded to $n$ decimal places?
We consider the value of the $n+1$st decimal number.
- If it is in 0,1,2,3,4 then we round down and the error is $i \cdot 10^{-(n+1)}$
- If it is in 5,6,7,8,9 then we round up, and the error is $10-i \cdot 10^{-(n+1)}$.
In either case, it is less –in absolute value – than $(10/2) \cdot 10^{-(n+1)} = 1/2 \cdot 10^{-n}$.
The error in rounding to $n$ decimal points is bounded by: $|x - \tilde{x}| < 1/2 \cdot 10^{-n}$.
Had we chopped (floor
) or always rounded up (ceil
) then the error is bounded by $10^{-n}$.
There is an alternate name for the error. When rounding to a certain number of digits, there is a unit of last precision, and ulp
. In the above, the unit of last precision was $10^{-n}$, so the error is less than $1/2$ ulp
.
ulp
s are easy enough to understand. If we round $3.1415$ to $3.14$ then the error is $0.15$ ulp
.
In binary floating point, the unit of last precision is $2^{-p}$, so the error in rounding is at most $(1/2) \cdot 2^{-p}$, or half an ulp
.
16-bit floating point is not typical. What is common is:
How the bits are arranged in IEEE 754 binary formats for floating point we have this table. See also this post by John Cook for the common case.
For example
b = bits(2^2 + 2^0 + 1/2 + 1/8) ## 101.101 = 1.01101 * 2^2 b[1], b[2:12], b[13:end]
('0',"10000000001","0110100000000000000000000000000000000000000000000000")
Here $m = 2^{10} + 1 - (2^{10} - 1)$ and we can see that $q=1.01101$ with the first $1$ implicit.
The numbers that can be represented exactly in floating point are called machine numbers.
Let's visualize in a hypothetical Float8 mode with 1 sign bit, 3 exponent bits and 4 bits for the mantissa.
The possible positive values are
qs = [1 + i/2 + j/4 + k/8 + l/16 for i in 0:1, j in 0:1, k in 0:1, l in 0:1] |> vec |> sort
16-element Array{Float64,1}: 1.0 1.0625 1.125 1.1875 1.25 1.3125 1.375 1.4375 1.5 1.5625 1.625 1.6875 1.75 1.8125 1.875 1.9375
The values for the exponents are $-3, -2, -1, 0, 1, 2, 3$. So all our values are given by
ms = (-3):3 vals = [q * 2.0^m for q in qs, m in ms] |> vec
112-element Array{Any,1}: 0.125 0.132813 0.140625 0.148438 0.15625 0.164063 0.171875 0.179688 0.1875 0.195313 0.203125 0.210938 0.21875 ⋮ 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5
We can plot these points:
using Gadfly plot(x = vals, y = 0*vals, Geom.point)
We notice:
Of special note is the size of the gap between values around 1:
nextfloat(1.0), nextfloat(1.0) - 1.0
(1.0000000000000002,2.220446049250313e-16)
This is sometimes called machine precision and in Julia
is returned by eps()
:
eps()
2.220446049250313e-16
eps(Float16)
float16(0.00097656)