Test 1 Review

Chapter 1

Chapter 1 covered mathematical theorems from calculus and some definitions:

The fundamental notions of a limit, continuity, and differentiability.

A new hierarchy of functions was discussed, these being $C^k$, where this stands for functions which have $k$ continuous derivatives.

A an example of a $C^1$ function is $F(x) = \int_a^x f(u) du$ when $f$ is continuous. This example can be re-integrated to get examples of $C^k$ functions.

The basic theorems of continuous and differentiable functions:

- The intermediate value theorem

- the extreme value theorem

- the mean value theorem. (We formulated this in an extended sense due to Cauchy.)

The all important theorem of Taylor, written here in a slightly less powerful means:

Let $f$ be a $C^{k+1}$ function on $[a,b]$. Let $a < c < b$ and $x$ be in $(a,b)$. Then there exists an $\xi$ between $c$ and $x$ satisfying:

$$ f(x) = T_k(x) + \frac{f^{(k+1)}(\xi)}{(k+1)!} (x - c)^{k+1}, $$

where $T_k(x) = f(c) + f'(c)(x-c) + \cdots + f^{(k)}(c)/k! \cdot (x-c)^k$.

$T_k$ is the Taylor polynomial of degree $k$ for $f$ about $c$.

There are other ways to write the remainder term, and somewhat relaxed assumptions on $f$ that are possible, but this is the easiest to remember.

A less precise, but still true, form of the above is:

$$ f(x) = T_k(x) + \mathcal{O}((x-c)^{k+1}). $$

Typical questions

Suppose $f(x)$ is continuous on $[a,b]$ and $f(a) = -1$ while $f(b) = 3$. Which is guaranteed true and why?

- There exists a $\xi$ in $[a,b]$ with $f(\xi) = 0$.

- There exists a $\xi$ in $[a,b]$ with $f'(\xi) = 0$.

- There exists a $\xi$ in $[a,b]$ with $f'(\xi) \cdot (b-a) = f(b) - f(c)$.

Prove that

$$ \lim_{h \rightarrow 0} \frac{f(x+h) - f(x-h)}{2h} = f'(x). $$

What are the assumptions on $f$ used in your proof?

Suppose $f(x)$ is $C^1$ on $[a,b]$. Why do you know that for any $\xi$ in $[a,b]$ you must have $|f(\xi)| \leq M$ for some $M$ that does not depend on $\xi$?

The tangent line approximation for a function is just the function $T_1(x)$. Compute the tangent line approximation for the functions:

- $\sin(x)$ at $c=0$

- $\log(1+x)$ at $c=0$

- $1 / (1 + x)$ at $c=0$

- $\arctan(x)$ at $c=0$

Find $T_3(x)$ for $c=0$ for each of

- $\sin(x)$

- $e^x$

Use the above to find $T_3(x)$ for $e^{\sin(x)}$.

In class we saw for $f(x) = \log(1+x)$ and $c=$ that to find $k$ such that if $0 \leq \xi \leq 1$ that

$$ \frac{f^{(k+1)}(\xi)}{(k+1)!} (x-0)^{k+1} \leq 2^{-53} $$

We needed a very large value of $k$. What if we tried this over a smaller interval, say $0 \leq \xi \leq 1/2$, instead? How big would $k$ need to be then.

We used $f^{(k)}(x) = \pm 1 / (k (1+x)^k)$.

Chapter 2

Chapter 2 deals with floating point representation of real numbers. Some basic things we saw along the way:

integers can be exactly represented in binary.

- We saw how the non-negative integers $0, \dots, 2^n-1$ can fit in $n$ bits in a simple manner.

- We saw how the integers $-2^{n-1}, \dots, 0, \dots, 2^{n-1}-1$ can fit in $n$ bits using two's complements for the negative numbers. The advantage with this storage is fast addition and subtraction.

We mentioned that in Julia, rational numbers can also be exactly represented, as integers can be.

For real numbers, we mentioned that they are stored in floating point. The exact format depends on the amount of size dedicated to storage, with 64 bits being the typical size now, though 32 was when the book was written.

The basic storage uses

- a sign bit

- $p$ bits to store the significand which is normalized to be $1.ddd\cdots d$.

- some bits to store the exponent, $e_{min} \leq m \leq e_{max}$.

and all this is put together to create the floating point numbers of the form

$$ \pm 1.ddddd\cdots d \cdot 2^m. $$

The floating point numbers are discrete. They are more spaced out as they get bigger, and more concentrated near $0$.

There are some conventions:

- the sign bit comes first and uses 1 for minus, and 0 for plus.

- the exponent is stored as an unsigned integer ($0, \cdots 2^k - 1$) and there is an implicit bias to be subtracted. The value $000\cdots 0$ is special and used for $0.0$ (or $-0.0$) and subnormal numbers. The value $111\cdots 1$ is used for Inf, -Inf and various types of NaN.

- the significand has an implicit $1$ in front, except for the special numbers 0, Inf, and NaN.

We use $fl(x)$ to be the floating point number that $x$ rounds to. A key formula is

$$ fl(x) = x(1 + \delta) $$

What is $\delta$? Some number between $-\epsilon$ and $\epsilon$. What is $\epsilon$? Good question.

we define $\epsilon$ or eps through $\epsilon = 1^+ - 1$, where $1^+$ is the next largest floating than $1$. We saw $\epsilon = 2^{-p}$.

The basic math operations with floating point numbers satisfy: $fl(x \odot y) = (x \odot y)(1 + \delta)$.

The basic math operations do not satisfy some common properties: associativity and commutativity.

However, relative errors can be a problem when rounding is involved.

- We saw that if $x$ and $y$ are real numbers, that the relative error of the floating point result of $x-y$ can be large if $x$ is close to $y$

- We saw a theorem that says even if there is no rounding error, the subtraction of $y$ from $x$ can introduce a loss of precision. Basically, if $x$ and $y$ agree to $p$ binary digits, then a shift is necessary of $p$ units. More concretely: if $x > y > 0$ and $1 - y/x \leq 2^{-p}$ then at least $p$ significant binary bits are lost in forming $x-y$.

- We saw that if possible we should avoid big numbers, as the errors are then possibly bigger. (Why the book suggests finding $(a+b)/2$ as $a + (b-a)/2$.)

- We saw that when possible we should cut down on the operations used. (One reason why Horner's method for polynomial evaluations is preferred.)

- We saw that errors can accumulate. In particular we discussed this theorem:

If $x_i$ are positive, the relative error in a nieve summation of $\sum x_i$ is $\mathcal{O}(n\epsilon)$.

We saw that there are times where trivial functions mathematically speaking are necessary.

We saw that errors can propagate quickly. The book has the example of $x_{n+1} = 13/3 x_n - 4/3 x_{n-1}$ which numerically diverges from the true quickly with 32 bit usage.

We saw that some problems lend themselves to a condition number. There were two examples

- evaluation of function when the input is uncertain. That is we evaluate $f(x+h)$ when we want to find $f(x)$. (It could be $x + h = x(1+\delta)$, say. For this we have

$$ \frac{f(x+h) - f(x)}{f(x)} \approx \frac{x f'(x)}{f(x)} \cdot \frac{h}{x}, $$

Or the relative error in the image is the relative error in the domain times a factor $xf'(x)/f(x)$.

- evaluation of a perturbed function (which can happen with polynomials that have rounded coefficients). For this, we have $F(x) = f(x) + \epsilon g(x)$. The example we had is if $r$ is a root of $f$ and $r+h$ is a root of $F$. What can we say about $h$? We can see that

$$ h \approx -\epsilon g(r)/f'(r) $$

Which can be big. The example in the book uses the Wilkinson polynomial and $r=20$. (The Wilkinson polynomial actually is exactly this case, as there is necessary rounding to get its coefficients into floating point.

Some sample problems

Suppose our decimal floating point system had numbers of the form $\pm d.dd \cdot 10^m$.

- If $1 = 1.00 \cdot 10^2$. What is $\epsilon$?

- What is $3.14 \cdot 10^0 - 3.15 \cdot 10^0$?

- What is $4.00 \cdot 10^0$ times $3.00 \cdot 10^1$?

- What is $\delta$ (where $fl(x \cdot y) = (x\cdot y)\cdot (1 \cdot \delta)$) when computing $1.23 \cdot 10^4$ times $4.32 \cdot 10^1$?

Suppose our binary floating point system has numbers of the form $\pm 1.pp \cdot 2^m$ where $-2 \leq m \leq 1$.

- How many total numbers are representable in this form ($0$ is not)?

- What is $\epsilon$?

- what is $1.11 \cdot 2^1 - 1.00 \cdot 2^0$?

- Convert the number $-1.01 \cdot 2^{-2}$ to decimal.

- Let $x=1.11 \cdot 2^0$ and $y=1.11 \cdot 2^1$. Find $\delta$ in $fl(x \cdot y) = (x \cdot y)(1 + \delta)$.

The answer to a famous question is coded in 0101000101000000. The first bit, 0 is the sign bit, the exponent 10100 and significant 0101000000). Can you find the number? Remember the exponent is encoded and you'll need to subtract 01111 then convert.

Use Horner's method (or synthetic division) to compute $p(x) = x^3 + 2x^2 + 3x^3 + 1$ at $x=2$.

The direct definition of $\sinh(x) = (e^x - e^{-x})/2$ is not how it is actually computed. In particular, for $0 \leq x \leq 22$, the following is used where E = expm1(x) is the more precise version of $e^x - 1$:

$$ (1/2) \cdot E + E/(E+1) $$

Can you think of why the direct approach might cause issues for some values of $x$ in that range?

Express the following in a different manner mathematically so that the issue of loss of precision is avoided:

- $\log(x) - \log(y)$

- $x^{-3} (\sin(x) - x)$

- $\sin(x) - \tan(x)$.

The computation $y = \sqrt{x^2 + 1} -1$ is to be computed. For what values of $x$ are you guaranteed that no more than 2 bits of precision will be lost? (Why is solving $\sqrt{x^2+1}/1 = 2^{-2}$ of any interest?)

The for $\xi$ between $0$ and $x$ we have $x - \sin(x) = x^3/3! - x^5/5! + x^7/7! + \cdot + (-1)^k x^{2k-1}/(2k-1)! + (-1)^{k+1}(\xi)^{2k+1}/(2k+1)!$

What value of $k$ will ensure that the error over $[0, 1/4]$ is no more than $10^{-3}$?

If $fl(xy) = xy\cdot(1 + \delta)$, where $\delta$ depends on the value of $x$ and $y$, show that

$$ fl(fl(xy)\cdot z) \neq fl(x \cdot fl(yz)) $$

That is, floating point multiplication is not associative. You can verify by testing (0.1 * 0.2)*0.3 and 0.1 * (0.2 * 0.3).

Show that $fl(x^3) = x^3(1 + \delta)^2$ where $|\delta| \leq \epsilon$.

True of false? For a fixed $x$, a floating point number. We expect the following to converge to $f'(x)$?

$$ \lim_{n\rightarrow\infty} \frac{fl(f(x+10^{-n})) + fl(f(x))}{10^{-n}} $$

That is, if you computed the difference quotient, $(f(x+h)-f(x))/h$ in floating would you expect smaller and smaller values of $h$ would converge. Why?

True of false. Suppose $p > q > 0$ are floating point numbers with $1/2 \leq p/q \leq 2$. Then $p-q$ is a floating point number also (unless it is too small).

Show with a rough proof that the error in $\log(y(x))$ is about the relative error of $y(x)$, that is

$$ \log(y(x+h)) - \log(y(x)) \approx \frac{y(x+h) - y(x)}{y(x)}. $$

Chapter 3 -- solving f(x) = 0

This chapter is about solving for zeros of a real-valued, scalar function $f(x)$.

We only managed to cover the first section on the bisection method. This is related to the

Intermediate value theorem. If $f(x)$ is continuous on $[a,b]$, then for any $y$ in the interval between $f(a)$ and $f(b)$, there exists $c$ such that $f(c) = y$.

The special case is when $f(a) \cdot f(b) < 0$ ($[a,b]$ is a bracket), then there is a $c$ where $f(c) = 0$.

A proof follows by subsequently bisecting the interval. If we number $a_0, b_0=a,b$, and set $c_0$ equal to $(a_0 + b_0)/2$. Then either $f(c_0)$ is positive, negative or $0$. If $0$, we can stop. If not, then either $[a_0,c_0]$ or $[c_0,b_0]$ will be a bracket. Call this $[a_1,b_1]$ and define $c_1$ as a new midpoint. We repeat and get a sequence $c_0, c_1, \dots$. If this terminates, we are done. Otherwise, since it can be shown $|c_n - c_{n+k}| \leq 2^{-n}|b_0 - a_0|$, that $c_i$ has a limt $c$. This limit will be the zero. It can't have $f(c)> 0$. The values of $c_i$ where $f(c_i)<0$ will also have limit of $c$ and by continuity $f(c) \leq 0$. (This is provided there is an infinite sequences of $c_i$s with $f(c_i) <0$, which requires proof. Similarly, it can't be $f(c) < 0$. So it must be $0$.

The point of the proof is that there is a bound on the error:

$$ |c_n - c| \leq \frac{1}{2} |b_n - a_n| \leq \frac{1}{2^{(n+1)}} |b_0 - a_0|. $$

Some sample problems

Let $f(x) = x^2 - 2$. Starting with $a_0, b_0 = 1, 2$, find $a_4, b_4$.

Let $e_n$ be $c_n - c$. The order of convergence of $c_n$ is $q$ provided

$$ \lim_n \frac{e_{n+1}}{e_n^q} = A $$

Using the bound above, what is the obvious guess for the order of convergence?

Explain why the bisection method is no help in finding the zeros of $f(x) = (x-1)^2 \cdot e^x$.

In floating point, the computation of the midpoint via $(a+b)/2$ is discouraged and using $a + (b-a)/2$ is suggested. Why?

Mathematically if $a < b$, it always the case that there exists a $c = (a+b)/2$ and $a < c < b$. Is this also *always* the case in floating point? Can you think of an example of when it wouldn't be?

To compute $\pi$ as a solution to $\sin(x) = 0$, one might use the bisection method with $a_0, b_0 = 3,4$. Were you to do so, how many steps would it take to find an error of no more than $10^{-16}$?

A simple zero for a function $f(x)$ is one where $f'(x) \neq 0$. Some algorithms have different convergence properties for functions with only simple zeros as compared to those with non-simple zeros. Would the bisection algorithm have a difference?

If you answered yes above, you could still be right, even though you'd be wrong mathematically (Why? look at the bound on the error and the assumptions on $f$.). This is because for functions with non simple zeros, you can have a lot of numeric issues creep in. The book gives an example of the function lie $f(x) = (x-1)^5$. Explain what is going on with this graph near $x=1$:

using Gadfly
f(x) = x^5 - 5x^4 +10x^3 -10x^2 + 5x -1
plot(f, 0.999, 1.001)