Error is the discrepancy between a quantity and the value used to represent it in the program. A result is accurate if its error is small. If is an approximation for , then
- the absolute error is , and
- the relative error is .
We are usually more interested in relative error, since the relevance of an error is usually in proportion to the quantity being represented. For example, misreporting the weight of an animal by one kilogram would be much more significant if the animal were a squirrel than if it were a blue whale.
sqrt(200.0), which returns the Float64-square-root of 200, yields
The actual decimal representation of is
The difference between these values, , is the absolute error, and is the relative error.
Sources of Numerical Error
There are a few categories of numerical error.
Roundoff error comes from rounding numbers to fit them into a floating point representation.
0.2 + 0.1is equal to in
Float64 arithmetic. The discrepancy between 0.3 and this value is roundoff error.
Truncation error comes from using approximate mathematical formulas or algorithms.
The Maclaurin series of is , so approximating as yields a truncation error equal to .
Newton's method approximates a zero of a function by starting with a value near the desired zero and defining for all .
Under certain conditions, converges to a zero of as . The discrepancy between and is the truncation error associated with stopping Newton's method at the $n$th iteration.
We may approximate using the sum . The error associated this approximation is a type of truncation error.
Statistical error arises from using randomness in an approximation.
We can approximate the average height of a population of 100,000 people by selecting 100 people uniformly at random and averaging their measured heights. The error associated with this approximation is an example of statistical error.
Discuss the error in each of the following scenarios using the terms roundoff error, truncation error, or statistical error.
- We use the trapezoid rule with 1000 trapezoids to approximate .
- We are trying to approximate for some function
fthat we can compute, and we attempt to do so by running
(f(5 + 0.5^100) - f(5))/0.5^100. We fail to get a reasonable answer.
- To approximate the minimum of a function , we evaluate at 100 randomly selected points in and return the smallest value obtained.
- The more trapezoids we use, the more accurate our answer will be. The difference between the exact answer and the value we get when we stop at 1000 trapezoids is truncation error.
- The real problem here is roundoff error.
5 + 0.5^100gets rounded off to 5.0, so the numerator will always evaluate to 0. However, even if we used a
BigFloatversion of each of these values, there would still be truncation error in this approximation, since the expression we used was obtained by cutting off the limit in the definition of the derivative at a small but positive increment size.
- This is an example of statistical error, since the output of the algorithm depends on the randomness we use to select the points.
The derivative of a single-variable function may be thought of as a measure of how the function stretches or compresses absolute error:
The condition number of a function measures how it stretches or compresses relative error. Just as the derivative helps us understand how small changes in input transform to small changes in output, the condition number tells us how a small relative error in the initial data of a problem affects the relative error of the solution. We will use the variable to denote a problem's initial data and to denote the solution of the problem with initial data .
The condition number of a function is defined to be the absolute value of the ratio of the relative change in output of the function to a very small relative change in the input. The condition number of a problem is the condition number of the function which maps the problem's initial data to its solution.
If is the map from the initial data of a problem to its solution , then the condition number of the problem is
Show that the condition number of is constant, for any .
Solution. We have
for all .
Show that the condition number of the function is very large for values of near 1.
Solution. We substitute into the formula for condition number and get
for values of near . This expression goes to infinity as , so the condition number is very large.
Subtracting 1 from two numbers near 1 preserves their
If , then the solution of the equation is . If we change the initial data to , then the solution changes to , which represents a relative change of
in the solution. The relative change in input is , so taking the absolute value of the ratio of to and sending , we see that condition number of this problem is .
Consider a function . If the input changes from to for some small value , then the output changes to approximately . Calculate the ratio of the relative change in the output to the relative change in the input, and show that you get
Solution. The relative change in output is
and the relative change in input is . Dividing these two quantities gives
More generally, if the initial data is in and the solution is in , then the condition number is defined to be
where denotes the
Well conditioned and ill-conditioned problems
If the condition number of a problem is very large, then small errors in the problem data lead to large changes in the result. A problem with large condition number is said to be ill-conditioned. Unless the initial data can be specified with correspondingly high precision, it will not be possible to solve the problem meaningfully.
Example Consider the following matrix equation for and .
Find the values of for which solving this equation for is ill-conditioned.
Solution. If , then the solution of this equation is
Using the formula for above, we can work out (after several steps) that
If is very close to , then is very large, and the matrix is ill-conditioned:
[2.01 3; 6 9] \ [4; 5]
[2.02 3; 6 9] \ [4; 5]
Machine epsilon, denoted , is the maximum relative error associated with rounding a real number to the nearest value representable as a given floating point type. For
Float64, this value is .
A competing convention—more widely used outside academia—defines to be the difference between 1 and the next representable number, which for
Float64is . This is the value returned by
eps() in Julia. Since we typically introduce a relative error on the order of to encode the initial data of a problem, the relative error of the computed solution should be expected to be no smaller than , regardless of the algorithm used.
An algorithm used to solve a problem is stable if it is approximately as accurate as the condition number of the problem allows. In other words, an algorithm is unstable if the answers it produces have relative error many times larger than .
Consider the problem of evaluating near for values of near 0. Show that the problem is well-conditioned, but algorithm of evaluating the expression following the order of operations is unstable.
Comment on whether there are stable algorithms for evaluating near .
Solution. Substituting this function into the condition number formula, we find that
Therefore, , which means that this problem is well-conditioned at 0. However, the algorithm of substituting directly includes an ill-conditioned step: subtracting 1.
What's happening is that a roundoff error of approximately is introduced when is rounded to the nearest
Float64. When 1 is subtracted, we still have an error of around . Since , we will have , and that means that the relative error in the value we find for will be approximately . If is small, this will be many times larger than .
There are stable algorithms for approximating near . For example, we could use the Taylor series
and approximate as a sum of the first several terms on the right-hand side. Since power functions are well-conditioned (and performing the subtractions is also well-conditioned as long as is small enough that each term is much smaller than the preceding one), this algorithm is stable. Alternatively, we can use the identity
which can be obtained by multiplying by and simplifying the numerator. Substituting into this expression is stable, because adding 1, square rooting, and reciprocating are well-conditioned.
Matrix condition number
The condition number of an matrix is defined to be the maximum condition number of the function as ranges over . The condition number of can be computed using its singular value decomposition:
Show that the condition number of a matrix is equal to the ratio of its largest and smallest singular values.
Interpret your results by explaining how to choose two vectors with small relative difference which are mapped to two vectors with large relative difference by , assuming that has a singular value which is many times larger than another. Use the figure below to help with the intuition.
Solution. The derivative of the transformation is the matrix itself, and the operator norm of is equal to its largest singular value. Therefore, to maximize , we minimize the ratio . This ratio is minimized when is the right singular vector with the least singular value. Therefore, the maximum possible value of is the ratio of the largest singular value of to the smallest singular value of .
Find the condition number of the function , where
A = [1 2; 3 4] and show that there is a vector and an error for which the relative error is indeed magnified by approximately the condition number of .
using LinearAlgebra A = [1 2; 3 4]
Solution. We choose
e to be the columns of in the singular value decomposition of
using LinearAlgebra A = [1 2; 3 4] U, S, V = svd(A) σmax, σmin = S κ = σmax/σmin v = V[:,2] e = V[:,1] rel_error_output = norm(A*(v+e) - A*v)/norm(A*v) rel_error_input = norm(v + e - v) / norm(v) rel_error_output / rel_error_input, κ
Integer or floating point arithmetic can overflow, and may do so without warning.
In September 2013, NASA lost touch with the Deep Impact space probe because systems on board tracked time as a 32-bit-signed-integer number of tenth-second increments from January 1, 2000. The number of such increments reached the maximum size of a 32-bit signed integer in August of 2013.
Errors resulting from performing ill-conditioned subtractions are called catastrophic cancellation.
Approximating with the result of
sqrt(10^6 + 1) - sqrt(10^6), we get a relative error of approximately , while using
1/(sqrt(10^6 + 1) + sqrt(10^6)) gives a relative error of (more than a thousand times smaller).
Use your knowledge of floating point arithmetic to explain why computing directly is much less precise than computing .
Solution. The gaps between successive representable positive values get wider as we move on the right on the number line. Therefore, the error of the first calculation is the roundoff error associated with calculating
sqrt(10^6+1), which is roughly .
The relative error in
1/(sqrt(10^6 + 1) + sqrt(10^6)), meanwhile, is approximately the same as the relative error in the calculation of
sqrt(10^6 + 1) + sqrt(10^6) (since the condition number of the reciprocal function is approximately 1). This relative error is only about .
If you rely on exact comparisons for floating point numbers, be alert to the differences between
Float64 arithmetic and real number arithmetic:
function increment(n) a = 1.0 for i = 1:n a = a + 0.01 end a end
increment(100) > 2
(increment(100) - 2) / eps(2.0)
Each time we add , we have to round off the result to represent it as a
Float64. These roundoff errors accumulate and lead to a result which is two ticks to the right of 2.0.
It is often more appropriate to compare real numbers using
\approx«tab»), which checks that two numbers and differ by at most .
Guess what value the following code block returns. Run it and see what happens. Discuss why your initial guess was correct or incorrect, and suggest a value near 0.1 that you could use in place of 0.1 to get the expected behavior.
function increment_till(t, step=0.1) x = 0.5 while x < t x += step end x end increment_till(1.0)
Solution. It's reasonable to guess that the returned value will be 1.0. However, it's actually approximately 1.1. The reason is that adding the Float64 representation of 0.1 ten times starting from 0.0 results in a number slightly smaller than 1.0. It turns out that 0.6 (the real number) is 20% of the way from the Float64 tick before it to the Float64 tick after it:
(1//10 - floor(2^53 * 1//10) // 2^53) * 2^53
This means that the Float64 sum is less than the mathematical sum after adding 0.1 once, then less after adding 0.1 again, and so on. By the time we we get to , we've lost a full tick spacing so after 5 iterations, is equal to .
We could use 1/8 = 0.125 instead of 0.1 to get the expected behavior, since small inverse powers of 2 and their sums with small integers can be represented exactly as 64-bit floating points numbers.