Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MATH256 2021–22 Numerical Methods
Errors and rounding
Main Message in one sentence:
If all numbers are allowed to have a fixed number of digits (throwing away the rest)
e.g. x = 0.x1x2...x10 × 10n for 10 digits, what happens to subsequence operations or the accuracy?
In Part 1 (or chapter 1), we encountered the situation in which we needed to break a loop to obtain an
approximate value for an infinite series. We used the condition
|tn+1| < 0.5× 10−10 × |Sn|, (∗)
where Sn is the partial sum up to term n and tn+1 is the first term omitted. In this chapter, we will examine
errors and approximations in more detail. In the process, we will see the justification for (∗).
2.1 Computer storage and correct rounding
Any nonzero x ∈ R can be written in scientific notation; that is
x = a× 10p, where 0.1 ≤ |a| < 1 and p ∈ Z.
The first nonzero digit of a always appears immediately after the decimal point. For example,
0.01206 = 0.1206× 10−1,
203.002 = 0.203002× 103,
and
−pi = −0.31415 . . . × 101.
Computers store numerical values in a similar way, using a fixed number of significant digits. Consequently,
most operations are subject to rounding error. With the default settings, Maple uses ten digits, so a typical
number a˜ in Maple (omitting exponent)
a˜ = 0.a1a2 . . . a10
= a1 × 10−1 + a2 × 10−2 + . . .+ a10 × 10−10.
Here the digits aj are integers between 0 and 9 inclusive, except a1, which cannot be zero (unless a˜ = 0
exactly). We care about how an exact and nearby number a = 0.a1a2 . . . a10, a11, . . . a50 . . . is approximated
by a˜ by rounding. The correctly rounded approximation to a is the most accurate approximation stopping
at a10. The true value for a then satisfies
|a− a˜| ≤ 5× 10−11 = 0.5× 10−10.
Note that numbers ending in 5 can be equidistantly rounded in two wayss, though traditional rounding
chooses one way.
For example, to round the 11-digit number a = 0.12345678905 to only 10 digits, we can take
a˜ = 0.1234567890 or a˜ = 0.1234567891
with |a− a˜| = 5× 10−11 in both cases. This is called a tie. The elementary rule ‘5 rounds up’ is usually not
used to resolve these because always rounding up can cause errors to accumulate. By default, Maple resolves
ties by requiring that the final stored digit is even. Each error caused by a tie then has a 50% chance of
being positive and a 50% chance of being negative, so they are unlikely to accumulate. Of course, numbers
not ending in 5 never encounter such a tie situations e.g. a = 0.12345678914→ a = 0.1234567891 with
no other choices.
2-2
MATH256 2021–22 Numerical Methods (5 Parts)
Remarks:
• We may visualise why the digit 5 is special in rounding from below∣∣∣(10)∣∣∣−−− 9−−− 8−−− 7−−− 6−−− [5]−−− 4−−− 3−−− 2−−− 1−−− ∣∣∣(0)∣∣∣
• If x and y are stored exactly (and y 6= 0), then the result of x ± y, xy and x/y will be correctly
rounded, as will √x and √y. However, this applies for a single operation only. Maintaining correct
rounding throughout a long calculation is more or less impossible.
• In other words, rounding error cannot be avoided, so we must accept that the last digit in a result is
likely to be wrong (perhaps more in long calculations).
• Although it is not our intention to dig deep into the process of how to implement these elementary
operations, it is not hard to see that one must take care of the components p first and then
operate between the decimal numbers a (often called mantissa) as below. Assume p2 ≥ p1 for
x1 = a1 × 10p1 , x2 = a2 × 10p2 :
⇒ x2 ± x1 = (a2 × 10p2−p1 ± x1)× 10p1 ≈ a˜3 × 10p˜3 ,
⇒ x1/x2 = (a1/a2)× 10−(p2−p1) ≈ a˜4 × 10p˜4 ,
⇒ x1 × x2 = (a1 × a2)× 10p1+p2 ≈ a˜5 × 10p˜5 ,
where possibly p˜4 = −(p2−p1), 1−(p2−p1) and a˜5 = p1+p2, p1+p2−1 since 0.1 < |a1/a2| < 10
and |a1 × a2| < 1.
• The number of digits stored by Maple can be increased by changing the parameter Digits. Remember:
this controls the accuracy to which numbers are stored at each stage of a calculation. It does not
guarantee the number of correct significant figures in the final answer.
• Rounding error places a bound on accuracy. Trying to beat this (e.g. by including more terms in a
series) will never succeed.
Example
Suppose that x = 0.13, and consider the two approximations
cos(x) ≈ 1− x
2
2! +
x4
4! and cos(x) ≈ 1−
x2
2! +
x4
4! −
x6
6! .
There are two sources of error here: rounding error and the truncation of an infinite series. Using ten digit
arithmetic, we obtain
cos(x) ≈ 0.9915619004 and cos(x) ≈ 0.9915618937.
The second approximation is more accurate, so we have made an improvement by including an extra term
in the series. In fact, the second result is correctly rounded. Adding more terms cannot further improve the
accuracy unless we also decrease rounding error by including more digits.
2-3
MATH256 2021–22 Numerical Methods (5 Parts)
2.2 Absolute and relative error
Notation: For any exact quantity x, x˜ represents an approximate value.
The absolute error is given by
Ex = x˜− x.
The significance of absolute error depends on the scale of the numbers in question.
• A 5mm error in measuring out a football pitch is not important.
• A 5mm error in eye surgery is likely to be disastrous.
To take the magnitudes of x into account, we can use the relative error:
εx =
x˜− x
x
= x˜
x
− 1.
The percentage error is 100× |εx|. Relative and percentage error are undefined if x = 0.
Remark: In general, the sign of the error (absolute or relative) does not matter, only the magnitude is
important.
2.2.1 Example: decimal places
This is a measure of absolute error; it takes no account of the magnitude of the numbers we are dealing
with. We must define exactly what we mean when we say an approximation is accurate to n decimal places.
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Attempt 1: Count matching digits after the decimal point.
Counterexample: x = 1.00001
x˜1 = 1.00200
x˜2 = 0.99991.
Clearly, x˜1 approximates x to two places, but the more accurate approximation is x˜2, in which none of the
digits after the decimal point are the same as those of x.
Attempt 2: Round x˜ to n decimal places, then check if the results are the same.
This fixes the above problem; x˜2 = 1.000 to three places and 0.9999 to four, so x˜2 = x to 3 d.p.
Counterexample: x = 1.530
x˜1 = 1.535
x˜2 = 1.525.
Here we have a tie (see section 2.1): x˜1 and x˜2 are equidistant from x. It makes no sense to claim that
either approximation is ‘better.’ It also makes no sense to allow the number of correct d.p. to be influenced
by the mechanism used for resolving ties.
2-4
MATH256 2021–22 Numerical Methods (5 Parts)
Attempt 3: Consider the difference between x and x˜. If |x− x˜| < 10−n, then there are n zeros after the
decimal point and x˜ is accurate to n places.
Counterexample: x = 1.560
x˜ = 1.55001
Clearly, x˜ is accurate to one place, but x− x˜ = 0.00999, which has two zeros after the decimal. To prevent
this, we require the n+ 1th digit after the decimal point in x˜− x to be less than 5, so that it does not
round up. Hence (finally), we have the following definition.
To state that x˜ is an approximation to x, accurate to n decimal places, means that
|x− x˜| < 5× 10−(n+1) = 12 × 10−n.
· · · · · · · · · · · · · · · · · · · · · · · · · · · ·
2.2.2 Example: significant figures
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
This is a measure of relative error. Again, we must make a precise definition. We begin by writing x using
scientific notation, that is
x = a× 10p, where 0.1 ≤ |a| < 1 and p ∈ Z.
Since the first nonzero digit in a comes immediately after the decimal point, n decimal places in a is
equivalent to n significant figures in x. Thus, if
x˜ = a˜× 10p
then x˜ approximates x to n significant figures if and only if
|a− a˜| < 12 × 10−n. (∗)
Now
a− a˜
a
= 10
p × (a− a˜)
10p × a =
x− x˜
x
,
so (∗) yields ∣∣∣∣x− x˜x
∣∣∣∣× |a| < 12 × 10−n.
Finally, since |a| < 1, we have the following:
If
∣∣∣∣x− x˜x
∣∣∣∣ ≤ 12 × 10−n then x˜ approximates x to n significant figures.
· · · · · · · · · · · · · · · · · · · · · · · · · · · ·
2-5
MATH256 2021–22 Numerical Methods (5 Parts)
Important remark: This is a sufficient but not necessary condition. It is stronger than necessary
because we have dropped |a| from the left-hand side. If n is the largest integer for which it holds, then we
definitely have n s.f. and we may have n+ 1 (a could be as small as 0.1). We definitely do not have n+ 2
s.f.
2.2.3 Example approximation of e
• If we approximate e using
e˜ = 27211001 ,
then
εe =
e˜− e
e =
2721
1001e − 1 ≈ −0.405× 10
−7.
The test shows that we have 7 s.f. but we may have 8. We don’t because
e ≈ 2.718281828
whereas
e˜ ≈ 2.718281718.
• If x = 1 and x˜ = 1.049, then
εx =
1.049− 1
1 = 0.049 < 0.5× 10
−1.
The test shows that we have 1 s.f. but we may have 2 (we do).
• If x = 1 and x˜ = 1.051, then
εx =
1.051− 1
1 = 0.051 < 0.5× 10
0.
The test shows that we may have 1 s.f. (we do).
• If ε > 0.5, then the largest possible value for n is actually −1, so we definitely don’t have 1 s.f. An
approximation with a relative error greater than 0.5 is largely useless.
2.2.4 Series summation condition
Consider a series in which the sum up to term n is Sn, and tn+1 is the first term omitted.
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Let x = Sn+1 and x˜ = Sn. Then
x− x˜
x
= Sn+1 − Sn
Sn+1
= tn+1
Sn + tn+1
= tn+1
Sn
× 11 + (tn+1/Sn) .
If |tn+1| < 0.5× 10−10 × |Sn| then the last term is very close to 1, so Sn+1 and Sn are (almost certainly)
the same to 10 s.f.
2-6
MATH256 2021–22 Numerical Methods (5 Parts)
· · · · · · · · · · · · · · · · · · · · · · · · · · · ·
This explains the condition we use for breaking a loop in a series summation: nothing is achieved by
including tn+1 in the sum.
Remarks: (i) Careless use of this condition can produce incorrect results (see Practice Problem 3.2 for
example). However, it is a good practical choice in many cases.
(ii) The above study on Sn and and Sn+1 is not a complete analysis, because it does not ensure
that the remainder |Rn| = |∑∞k=n+1 tk| is less than 0.5× 10−10. It is not trivial to give a general formula.
2.3 Cancellation
Rounding error alone is not usually problematic, but certain operations can magnify existing errors, which is
much more dangerous. Consider the example shown in quadratic.mw. Here, the quadratic equation
t2 − 42t+ 1 = 0
is solved using the standard formula, with the result
t1 = 41.97617696 and t2 = 0.02382304.
The second root has only seven significant digits; three have been lost. A few such operations will produce
results that are completely wrong, so we need to investigate what has happened.
2.4 Error propagation
We can use the fact that εx = x˜/x− 1, and so
x˜ = (1 + εx)x,
to study how errors propagate through calculations. For a given operation f , we need to find an expression
for the relative error in f(x˜); that is
εf =
f(x˜)− f(x)
f(x)
in terms of x and εx. Note that we are not concerned here with the error generated by computing f (which
should be small). Rather, we are concerned with the effect that f has on the pre-existing error in x˜. As we
will see, this can be significant.