# Numerical ComputationError

**Error** is the discrepancy between a quantity and the value used to represent it in the program. A result is **accurate** if its error is small. If is an approximation for , then

- the
**absolute error**is , and - the
**relative error**is .

We are usually more interested in relative error, since the relevance of an error is usually in proportion to the quantity being represented. For example, misreporting the weight of an animal by one kilogram would be much more significant if the animal were a squirrel than if it were a blue whale.

**Example**

The expression `sqrt(200.0)`

, which returns the Float64-square-root of 200, yields

The actual decimal representation of

The difference between these values,

## Sources of Numerical Error

There are a few categories of numerical error.

**Roundoff error** comes from rounding numbers to fit them into a floating point representation.

**Example**`0.2 + 0.1`

is equal to `Float64`

arithmetic. The discrepancy between 0.3 and this value is roundoff error.

**Truncation error** comes from using approximate mathematical formulas or algorithms.

**Example**

The Maclaurin series of

**Example**

Newton's method approximates a zero of a function

Under certain conditions,

**Example**

We may approximate

**Statistical error** arises from using randomness in an approximation.

**Example**

We can approximate the average height of a population of 100,000 people by selecting 100 people uniformly at random and averaging their measured heights. The error associated with this approximation is an example of statistical error.

**Exercise**

Discuss the error in each of the following scenarios using the terms *roundoff error, truncation error, or statistical error*.

- We use the trapezoid rule with 1000 trapezoids to approximate
.\displaystyle \int_0^{10} \frac{1}{4+x^4} \mathrm{d}x - We are trying to approximate
for some functionf'(5) `f`

that we can compute, and we attempt to do so by running`(f(5 + 0.5^100) - f(5))/0.5^100`

. We fail to get a reasonable answer. - To approximate the minimum of a function
, we evaluatef:[0,1] \to \mathbb{R} at 100 randomly selected points inf and return the smallest value obtained.[0,1]

*Solution.*

- The more trapezoids we use, the more accurate our answer will be. The difference between the exact answer and the value we get when we stop at 1000 trapezoids is truncation error.
- The real problem here is roundoff error.
`5 + 0.5^100`

gets rounded off to 5.0, so the numerator will always evaluate to 0. However, even if we used a`BigFloat`

version of each of these values, there would still be truncation error in this approximation, since the expression we used was obtained by cutting off the limit in the definition of the derivative at a small but positive increment size. - This is an example of statistical error, since the output of the algorithm depends on the randomness we use to select the points.

## Condition Number

The derivative of a single-variable function may be thought of as a measure of how the function stretches or compresses absolute error:

The **condition number** of a function measures how it stretches or compresses *relative* error. Just as the derivative helps us understand how small changes in input transform to small changes in output, the condition number tells us how a small relative error in the initial data of a problem affects the relative error of the solution. We will use the variable

The condition number of a function is defined to be the absolute value of the ratio of the relative change in output of the function to a very small relative change in the input. The condition number of a *problem* is the condition number of the function which maps the problem's initial data to its solution.

**Definition**

If

**Example**

Show that the condition number of

*Solution.* We have

for all

**Example**

Show that the condition number of the function

*Solution.* We substitute into the formula for condition number and get

for values of

Subtracting 1 from two numbers near 1 preserves their

**Example**

If

in the solution. The relative change in input is

**Exercise**

Consider a function *relative change* in the output to the relative change in the input, and show that you get

*Solution.* The relative change in output is

and the relative change in input is

as desired.

More generally, if the initial data is in

where

### Well conditioned and ill-conditioned problems

If the condition number of a problem is very large, then small errors in the problem data lead to large changes in the result. A problem with large condition number is said to be **ill-conditioned**. Unless the initial data can be specified with correspondingly high precision, it will not be possible to solve the problem meaningfully.

**Example** Consider the following matrix equation for

Find the values of

*Solution.* If

Using the formula for

If

[2.01 3; 6 9] \ [4; 5]

[2.02 3; 6 9] \ [4; 5]

### Machine epsilon

**Machine epsilon**, denoted `Float64`

, this value is

A competing convention—more widely used outside academia—defines `Float64`

is `eps()`

in Julia. Since we typically introduce a relative error on the order of

### Algorithm stability

An algorithm used to solve a problem is **stable** if it is approximately as accurate as the condition number of the problem allows. In other words, an algorithm is *unstable* if the answers it produces have relative error many times larger than

**Example**

Consider the problem of evaluating

Comment on whether there are stable algorithms for evaluating

*Solution.* Substituting this function into the condition number formula, we find that

Therefore,

What's happening is that a roundoff error of approximately `Float64`

. When 1 is subtracted, we still have an error of around

There are stable algorithms for approximating

and approximate

which can be obtained by multiplying by

### Matrix condition number

The condition number of an

**Exercise**

Show that the condition number of a matrix

Interpret your results by explaining how to choose two vectors with small relative difference which are mapped to two vectors with large relative difference by

*Solution.* The derivative of the transformation

**Exercise**

Find the condition number of the function `A = [1 2; 3 4]`

and show that there is a vector

using LinearAlgebra A = [1 2; 3 4]

*Solution.* We choose `v`

and `e`

to be the columns of `A`

:

using LinearAlgebra A = [1 2; 3 4] U, S, V = svd(A) σmax, σmin = S κ = σmax/σmin v = V[:,2] e = V[:,1] rel_error_output = norm(A*(v+e) - A*v)/norm(A*v) rel_error_input = norm(v + e - v) / norm(v) rel_error_output / rel_error_input, κ

## Hazards

Integer or floating point arithmetic can **overflow**, and may do so without warning.

2^63

10.0^309

**Example**

In September 2013, NASA lost touch with the *Deep Impact* space probe because systems on board tracked time as a 32-bit-signed-integer number of tenth-second increments from January 1, 2000. The number of such increments reached the maximum size of a 32-bit signed integer in August of 2013.

Errors resulting from performing ill-conditioned subtractions are called *catastrophic cancellation*.

**Example**

Approximating `sqrt(10^6 + 1) - sqrt(10^6)`

, we get a relative error of approximately `1/(sqrt(10^6 + 1) + sqrt(10^6))`

gives a relative error of

**Exercise**

Use your knowledge of floating point arithmetic to explain why computing

*Solution.* The gaps between successive representable positive values get wider as we move on the right on the number line. Therefore, the error of the first calculation is the roundoff error associated with calculating `sqrt(10^6+1)`

, which is roughly

The relative error in `1/(sqrt(10^6 + 1) + sqrt(10^6))`

, meanwhile, is approximately the same as the relative error in the calculation of `sqrt(10^6 + 1) + sqrt(10^6)`

(since the condition number of the reciprocal function is approximately 1). This relative error is only about

If you rely on exact comparisons for floating point numbers, be alert to the differences between `Float64`

arithmetic and real number arithmetic:

function increment(n) a = 1.0 for i = 1:n a = a + 0.01 end a end

increment(100) > 2

(increment(100) - 2) / eps(2.0)

Each time we add `Float64`

. These roundoff errors accumulate and lead to a result which is two ticks to the right of 2.0.

It is often more appropriate to compare real numbers using `≈`

( `\approx«tab»`

), which checks that two numbers

**Exercise**

Guess what value the following code block returns. Run it and see what happens. Discuss why your initial guess was correct or incorrect, and suggest a value near 0.1 that you could use in place of 0.1 to get the expected behavior.

function increment_till(t, step=0.1) x = 0.5 while x < t x += step end x end increment_till(1.0)

*Solution.* It's reasonable to guess that the returned value will be 1.0. However, it's actually approximately 1.1. The reason is that adding the Float64 representation of 0.1 ten times starting from 0.0 results in a number slightly *smaller* than 1.0. It turns out that 0.6 (the real number) is 20% of the way from the Float64 tick before it to the Float64 tick after it:

(1//10 - floor(2^53 * 1//10) // 2^53) * 2^53

This means that the Float64 sum is

We could use 1/8 = 0.125 instead of 0.1 to get the expected behavior, since small inverse powers of 2 and their sums with small integers can be represented exactly as 64-bit floating points numbers.