The Intermediate Value Theorem: Let $f(x)$ be a continuous function on $[a,b]$ with the property that $f(a)$ and $f(b)$ have a different sign ($[a,b]$ is a bracket). Then there exists a value $c$ in $[a,b]$ with $f(c) = 0$.
This gives us a guarantee that under some assumptions the equation $f(x) = 0$ has a solution in the interval.
But it does not tell us what a value of $c$ is.
Consider the following algorithm starting with $[a,b]$ and $f$ as in the IVT and, say, $f(a) < 0$.
––
- If $f(c) == 0$ we are done
- If $f(c) < 0$ then our new interval is $[c,b]$
- If $f(c) > 0$ our new interval is $[a,c]$
What does this algorithm produce? At each step we get a value of $c$. To distinguish, call them $c_i$. Then we have two possibilities:
- $f(c_i) = 0$ for some $i$ and we are done
- $f(c_i) \neq 0$ for all $i$.
In the latter case, what to do?
Let $c = \lim_i c_i$. This exists as the sequence is clearly Cauchy ($|x_n - x_{n+m}| \leq C2^{-n}$). Then:
- $c$ exists! - $f(c) = 0$.
Why? Split the values $c_i$ into those that have $f(c_i) < 0$ and those that have $f(c_i) > 0$. Call these sequences $l_i$ and $u_i$.
If a sequence is infinite, it must converge to $c$. If both are infinite, then one limit is less than or equal to $0$, the other greater than or equal to $0$, so must be $0$.
If a sub-sequence is finite, say terminating at $N$. The $c_i, i \geq N$ has $f(c_i)$ all of the same sign. So in the choice which endpoint to replace $[a,b]$ one is always chosen. Say $f(a) > 0$ and $f(c_i) < 0$ for all $i > N$. It must be that $c_i$ converges to $a$, so we have $f(c_i)$ converges to $f(a)$ which is positive, but each $f(c_i)$ is negative. So this can't happen.
The above is true mathematically, but can it be true in floating point?
- The $c_i$ in any algorithm are necessarily finite in number, so they can't take on any value, just machine numbers. Since machine numbers are rational numbers in binary, numbers like $\sqrt{2}$ can not be represented exactly, so the value of $c$ is not guaranteed!
- Even the simple statement $c = (a+b)/2$ has problems! $x+y$ can overflow, It isn't even guaranteed that $fl((x+y)/2)$ is in $[x,y]$ with some rounding modes!
Use thresholds:
Stop the algorithm if
- $|f(c_i)| < \epsilon$
- $|b_i - a_i| < \delta$.
function bisectme(f, a, b) delta, epsilon=4eps(), 1e-12 @assert f(a) * f(b) < 0 # check bracket c = (a + b)/2 while abs(f(c)) >= epsilon && (b-a) >= delta c = (a + b) / 2 f(c) == 0.0 && return(c) f(c)*f(a) < 0 ? (b = c) : (a = c) end return(c) end
bisectme (generic function with 1 method)
Solve $f(x) = e^x - \sin(x) = 0$ in the interval $[-4, -3]$:
f(x) = exp(x) - sin(x) xstar = bisectme(f, -4, -3) xstar, f(xstar)
(-3.1830630119329726,4.0688286073731206e-13)
Theoretically, at each step we have $a_i < c_i < b_i$. Let $c$ be a solution. Then how big is $|c - c_i|$?
$$ |c - c_i| \leq \frac{1}{2} (b_i - a_i) $$
But:
$$ b_i - a_i = (1/2)(b_{i-1} - a_{i-1}) = (1/2)^2 (b_{i-2} - a_{i-2}) = \cdots = 2^{i}(b_0 - a_0). $$
So,
$$ |c - c_i| \leq \frac{1}{2^{i+1}} (b_0 - a_0). $$
If we have $a_0, b_0 = -4, -3$, how many steps to get $|c - c_i| < 10^{-15}$?
We'd need to solve for the smallest $i$ so that:
$$ \frac{1}{2^{i+1}} (b_0 - a_0) = \frac{1}{2^{i+1}} < 10^{-15} $$
ceil(log2(1e-15) - 1)
-50.0
Mathematically, we can always subdivide $[a,b]$ via $c=(a+b)/2$. Not so with the computer.
We said that using c = (a+b)/2
has issues (overflow, loss of low bits, magnified errors...)
We avoid some of that with c = a + (b-a)/2
Or we could have tried c = a + b/2 - a/2
These aren't all the same:
a, b = prevfloat(0.0), nextfloat(0.0) (a + b)/2, a + (b-a)/2, a + b/2 - a/2
(0.0,0.0,-5.0e-324)
b = prevfloat(Inf) a = prevfloat(prevfloat(b)) (a + b)/2, a + (b-a)/2, a + b/2 - a/2
(Inf,1.7976931348623155e308,Inf)
The problem is studied in this paper.
Without care, the following two properties of $m(I)$ are not guaranteed:
That is, we can't be sure that this will stop appropriately
c = (a + b)/2 if a < c < b .... end
In Julia
's Roots
package they use something different by Jason Merrill
We can see the following:
x1_int = reinterpret(Uint64, abs(x1)) x2_int = reinterpret(Uint64, abs(x2)) unsigned = reinterpret(Float64, (x1_int + x2_int) >> 1)
In practice this converts the (positive) floating point numbers into an ordered set of integers:
as = rand(1) for i in 1:9 push!(as, nextfloat(as[end])) end map(x -> bits(reinterpret(Uint64, x)), as)
10-element Array{ASCIIString,1}: "0011111111101111011011100010110000111000011001011000110011000010" "0011111111101111011011100010110000111000011001011000110011000011" "0011111111101111011011100010110000111000011001011000110011000100" "0011111111101111011011100010110000111000011001011000110011000101" "0011111111101111011011100010110000111000011001011000110011000110" "0011111111101111011011100010110000111000011001011000110011000111" "0011111111101111011011100010110000111000011001011000110011001000" "0011111111101111011011100010110000111000011001011000110011001001" "0011111111101111011011100010110000111000011001011000110011001010" "0011111111101111011011100010110000111000011001011000110011001011"
Using this for finding the midpoint (suitably configured to handle mixed signs) means that in no more than 64 steps this process is guaranteed to converge. This means we can skip the check on the size of $f(x_n)$ to look something like:
while a < midpoint(a,b) < b do something... end
What happens at each step: we must have:
- f(c) == 0
(exactly), or
- either f(c) * f(a) < 0
or f(c) * f(b) < 0
So if the answer is not exact, we guarantee that this condition holds for the returned value of c
It must be that
f(prevfloat(c)) * f(nextfloat(c)) <= 0
.
(Even without IVT).
using Roots f(x) = sin(x) c = fzero(f, [3,4]) c, f(c), sign(f(prevfloat(c))) * sign(f(nextfloat(c)))
(3.141592653589793,1.2246467991473532e-16,-1.0)
What about $f(x) = 1/x$? It has no zero, but it does have a zero crossing at $0$ which will get picked up:
f(x) = 1/x c = fzero(f, [-1, 1]) c, f(c), sign(f(prevfloat(c))) * sign(f(nextfloat(c)))
(0.0,Inf,-1.0)