UP | HOME

Numerical Analysis

\( \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \DeclareMathOperator{\argmax}{arg\,max} \DeclareMathOperator{\argmin}{arg\,min} \DeclareMathOperator{\pto}{\rightharpoonup} \DeclareMathOperator{\LIM}{LIM} % algebra \DeclareMathOperator{\Hom}{Hom} \DeclareMathOperator{\Span}{Span} \DeclareMathOperator{\Ran}{Ran} \DeclareMathOperator{\Ker}{Ker} \DeclareMathOperator{\Nul}{Nul} \DeclareMathOperator{\Col}{Col} \DeclareMathOperator{\Row}{Row} \DeclareMathOperator{\Tr}{Tr} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\proj}{proj} \DeclareMathOperator{\diag}{diag} % probability and statistics \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\prob}{\mathbb{P}} \DeclareMathOperator{\indep}{\perp \!\!\! \perp} \DeclareMathOperator{\Var}{\mathbb{V}} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Cor}{Cor} % programs \newcommand{\sem}[1]{[\![ #1 ]\!]} \newcommand{\ans}[1]{[\![ #1 ]\!]} \DeclareMathOperator{\Val}{Val} \DeclareMathOperator{\Vars}{Vars} \DeclareMathOperator{\AExp}{AExp} \DeclareMathOperator{\BExp}{BExp} \DeclareMathOperator{\AND}{\mathsf{and}} \DeclareMathOperator{\OR}{\mathsf{or}} \DeclareMathOperator{\NOT}{\mathsf{not}} \DeclareMathOperator{\TRUE}{\mathsf{true}} \DeclareMathOperator{\FALSE}{\mathsf{false}} \DeclareMathOperator{\SKIP}{\mathsf{skip}} \DeclareMathOperator{\WHILE}{\mathsf{while}} \DeclareMathOperator{\DO}{\mathsf{do}} \DeclareMathOperator{\OD}{\mathsf{od}} \DeclareMathOperator{\IF}{\mathsf{if}} \DeclareMathOperator{\FI}{\mathsf{fi}} \DeclareMathOperator{\THEN}{\mathsf{then}} \DeclareMathOperator{\ELSE}{\mathsf{else}} \DeclareMathOperator{\FOR}{\mathsf{for}} \DeclareMathOperator{\TERM}{\mathsf{Term}} \DeclareMathOperator{\HALT}{\mathsf{Halt}} \DeclareMathOperator{\BREAK}{\mathsf{Break}} \DeclareMathOperator{\CONTINUE}{\mathsf{Continue}} \DeclareMathOperator{\COND}{\mathsf{cond}} \DeclareMathOperator{\UNDEFINED}{\mathsf{undefined}} \)

1. Preliminaries

1.1. Machine Numbers

\(k\)-digit decimal machine numbers are of the form \[ \pm 0.d_1 d_2 \dots d_{k} \times 10 ^{n}, \quad d_1 \in \{ 1, \dots , 9\}, d_{i} \in \{ 0, \dots , 9 \}, \forall i = 2, \dots , k. \]

The (\(k\)-digit) floating-point form of a positive real number \(y = 0.d_1 d_2 \dots d_{k} d_{k+1} \times 10 ^{n} \), denoted by \(fl(y)\), is obtained by terminating the mantissa of \(y\) at \(k\) decimal digits using one of these two methods:

  1. Chopping: Chop off the digits \(d_{k+1} \dots \). \(fl(y) = y = 0.d_1 d_2 \dots d_{k} \times 10 ^{n}\).
  2. Rounding: Add \(5 \times 10 ^{n-(k+1)}\) to \(y\) (i.e., 5 to \(d_{k+1}\)) and then chops the result to obtain a \(k\)-digit decimal machine. Note that the exponent might change.

1.2. Numerical Stability

Suppose \(p ^{*}\) is an approximation to \(p\). The absolute error is \(|p - p ^{*}|\) and the relative error is \(\frac{|p - p ^{*}|}{|p|}\) (provided that \(p \neq 0\)).

Suppose \(E_0 > 0\) denotes an error at some stage in the calculations, and \(E_n\) represents the magnitude of the error after \(n\) subsequent steps. If \(E_n \approx C n E_0\) with \(C\) being a constant independent of \(n\), the growth of error is said to be linear. If \(E_n \approx C^n E_0\) with some \(C > 1\), the growth of error is said to be exponential.

Suppose \(\{\beta_n\}_{n=1}^{\infty}\) is a sequence that converges to zero, and \(\{\alpha_n\}_{n=1}^{\infty}\) converges to a number \(\alpha\). If a positive constant \(K\) exists for large \(n\) such that \(|\alpha_n - \alpha| \leq K|\beta_n|\), we say that \(\{\alpha_n\}_{n=1}^{\infty}\) converges to \(\alpha\) with an order of convergence \(O(\beta_n)\), written as \(\alpha_n = \alpha + O(\beta_n)\).

Suppose \(\{\beta_n\}_{n=1}^{\infty}\) is a sequence that converges to zero, and \(\{\alpha_n\}_{n=1}^{\infty}\) converges to a number \(\alpha\). If a positive constant \(K\) exists for large \(n\) such that \(|\alpha_n - \alpha| \leq K|\beta_n|\), we say that \(\{\alpha_n\}_{n=1}^{\infty}\) converges to \(\alpha\) with an order of convergence \(O(\beta_n)\), written as \(\alpha_n = \alpha + O(\beta_n)\).

Suppose a sequence \(x_{n}\) converges to \(l\). Then it is said to have order of convergence \(q \geq 1\) and rate of convergence \(\mu\) if \[ \lim_{n \to \infty} \frac{|x_{n+1} - l|}{|x_{n} - l| ^{q}} = \mu \] for some positive constant \(\mu\) if \(q > 1\) and \(\mu \in (0, 1)\) if \(q = 1\).

2. Solutions of Equations in One Variable

Let \(f\) be a function. A root of the equation \(f(x) = 0\) is called a zero of \(f\).

2.1. Bisection Method

Recall: Intermediate Value Theorem (IVT) says that for a continuous function \(f\) defined on \([a, b]\) and any value \(\gamma\) between \(f(a)\) and \(f(b)\), there exists a point \(p \in [a,b]\) for which \(f(p) = \gamma\).

Assumption:

  • By IVT, there is some zero in the interval.

Suppose \(f\) is defined and continuous on \([a, b]\) with \(f(a)\) and \(f(b)\) of opposite signs, and (for simplicity) the zero of \(f\) is unique in this interval. To find the zero \(p\) of \(f\) in \([a, b]\), the Bisection method generates a sequence \(p_{n}\) as follows:

  1. Initialize \(a_1 = a\) and \(b_1 = b\).
  2. For all \(i \geq 1\): Let \[ p_i = a_i + \frac{{b_i - a_i}}{2} = \frac{{a_i + b_i}}{2}. \]
    • (*) If \(f(p_i) = 0\), then \(p = p_i\), and we are done.
    • If \(f(p_i) \cdot f(a_i) < 0\), then \(p \in (p_i, b_i)\), set \(a_{i+1} = p_i\) and \(b_{i+1} = b_i\).
    • If \(f(p_i) \cdot f(a_i) > 0\), then \(p \in (a_i, p_i)\), set \(a_{i+1} = a_i\) and \(b_{i+1} = p_i\).

For (*), we could have stopping criteria with tolerance: Suppose we select a tolerance \(\epsilon > 0\) and the sequence of \(p\) is \(p_1, p_2, \dots \). Stop if one of the followings are met:

  • \(|p_{i} - p_{i-1}| < \epsilon\)
  • \(|f(p_{i})| < \epsilon\)
  • relative error: \(|p_{i} - p_{i-1}| / |p_{i}| < \epsilon\), \(p_{i} \neq 0\).

P.S.:

  • “root that is accurate to at least \(\epsilon\)” means the root calculated with the relative error test as the stopping criterion.
  • \(p_{0} = \frac{a + b}{2}\) is only used for computing the error.

Suppose that \(f \in C[a, b]\) (continuous on \([a, b]\)) and \(f(a) \cdot f(b) < 0\). The Bisection method (with any stopping criterion) generates a sequence \(\{ p_{n} \}_{n=1} ^{\infty}\) approximating a zero \(p\) of \(f\) with

\begin{align*} |p_{n} - p| \leq \frac{b - a}{2 ^{n}}, n \geq 1 \end{align*}

For each \(n \geq 1\), we have \(b_{n} - a_{n} = (b - a) / 2 ^{n-1}\) and \(p \in (a_{n}, b_{n})\). Thus, \(|p_{n} - p| \leq (b_{n} - a_{n}) / 2 = (b - a) / 2 ^{n}\).

2.2. Fixed-Point Problem

The root-finding problem can be converted into the fixed-point problem by constructing a function \(g\) such that \(g(p) = p\) whenever \(p\) is a zero of the original function \(f\).

Given a function \(g(p)\) on \([a, b]\), a \(p\) satisfying \(g(p) = p\) is called a fixed point of \(g\).

If \(g \in C[a, b]\) and \(g(x) \in [a, b]\) for all \(x \in [a, b]\), then \(g\) has a fixed point in \([a, b]\).

Inspired from the geometrical meaning of fixed point, we define \(h(x) = g(x) - x\) such that \(p\) is a fixed point of \(g\) iff it is a zero of \(h\) in \([a, b]\).

  • If \(g(a) = a\) or \(g(b) = b\), then the existence of a fixed point is obvious.
  • Otherwise, \(h(a) > a - a = 0\) and \(h(b) < b - b = 0\), by IVT, there exists a zero of \(h\) in \((a, b)\).

Suppose

  • \(g \in C[a, b]\) and \(g(x) \in [a, b]\) for all \(x \in [a, b]\),
  • and \(g'(x)\) exists on \((a, b)\) with \(|g'(x)| \leq k < 1\),

then \(g\) has a unique fixed point \(p\) ∈ \([a, b]\).

Suppose for contradiction there are two distinct fixed point \(p, q\). Intuitively, the two fixed points are both on the line \(y = x\), but the derivative is \(<1\) everywhere, so this cannot be true.

Formally, by MVT, there exists \(\xi\) between \(p\) and \(q\) (open interval, and hence in \((a, b)\) such that

\begin{align*} |g(p) - g(q)| = |g'(\xi)| |p-q| \leq k |p - q| < |p - q|. \end{align*}

However, \(|p - q| = |g(p) - g(q)|\), which implies \(|p - q| < |p - q|\). This is a contradiction.

2.2.1. Fixed-Point Iteration

Suppose \(g\) is continuous on \([a, b]\) and \(g([a, b]) \subseteq [a, b]\). To find the fixed point of \(g\) in \([a, b]\), given an initial approximation \(p_0 \in [a, b]\), the Fixed-Point Iteration generates a sequence \[ p_{n} = g(p_{n-1}). \] If the stopping criterion \(|p_{n} - p_{n-1}| < \epsilon\) is met, then it terminates.

fixed_point.svg
Figure 1: Fixed Point Iteration (from wiki)

Suppose (same conditions as uniqueness)

  • \(g \in C[a, b]\) and \(g(x) \in [a, b]\) for all \(x \in [a, b]\),
  • and \(g'(x)\) exists on \([a, b]\) with \(|g'(x)| \leq k < 1\) for all \(x \in [a, b]\),

then the sequence \(p_{n}\) starting from any \(p_0 \in [a, b]\) satisfies

\begin{align*} |p_{n} - p| \leq k ^{n} |p_0 - p| \end{align*}

where \(p\) is the unique fixed point in \([a, b]\).

  1. Since \(g\) maps \([a, b]\) to \([a, b]\), every term of the sequence lies within \([a, b]\).
  2. Similar to the proof of Uniqueness, using MVT, \(p_{n}, p \in [a, b]\) implies that

    \begin{align*} |p_{n} - p| = |g(p_{n-1}) - g(p)| \leq |g'(\xi)| |p_{n-1} - p| \leq k |p_{n-1} - p|, \end{align*}

    for all \(n \geq 1\). Thus, \(|p_{n} - p| \leq k ^{n} |p_0 - p|\). Therefore, \(\lim_{n \to \infty} p_{n} = p\).

Since we do not know \(p\) in advance, this theorem cannot tell us the convergence rate directly. However, the following theorem is capable of doing this using \(p_1\) and \(p_0\).

If \(g\) satisfies the hypotheses of the above theorem, then we also have

\begin{align*} |p_{n} - p| \leq \frac{k ^{n}}{1-k} |p_1 - p_0|. \end{align*}

Similarly, we have \(|p_{n+1} - p_{n}| \leq k |p_{n} - p_{n-1}| \leq k ^{n} |p_1 - p_0|\). Then, for any \(m > n \geq 1\),

\begin{align*} |p_{m} - p_{n}| &\leq |p_{m} - p_{m-1}| + \dots + |p_{n+1} - p_{n}|\\ &\leq (k ^{m-1} + \dots + k ^{n}) |p_1 - p_0|. \end{align*}

Both sides converge when \(m \to + \infty\), so by the order rule we have \(|p-p_{n}| \leq \frac{k ^{n}}{1-k} |p_1 - p_0|\).

If \(g\) satisfies the hypotheses of the above theorem, to achieve \(|p_{n} - p| < \epsilon\), the sufficient condition is \(n > \log _{k} \frac{(1-k) \epsilon}{|p_1 - p_0|}\).

2.3. Newton’s Method

Prerequisite: Fixed-Point Problem.

Suppose \(f \in C ^2[a, b]\). Let \(p_0 \in [a, b]\) be an approximation to \(p\) such that \(f'(p_0) \neq 0\) and \(|p - p_0|\) is small.

By the first-order Taylor expansion and remainder about \(p_0\) evaluated at \(p\) \[ f(p) = f(p_0) + (p - p_0) f'(p_0) + \frac{f''(\xi(p))}{2} (p - p_0) ^2, \] where \(\xi(p)\) is between \(p\) and \(p_0\).

Since \(f(p) = 0\) and \(p - p_0\) is small, \[ 0 \approx f(p_0) + (p - p_0) f'(p_0). \] Solving for \(p\) gives \[ p \approx p_0 - \frac{f(p_0)}{f'(p_0)}. \]

Given an initial approximation \(p_0\), the Newton’s generates a sequence \[ p_{n} = p_{n-1} - \frac{f(p_{n-1})}{f'(p_{n-1})}. \]

From the derivation, the convergence of Newton’s method requires that \(p_0\) is close to \(p\) to get an accurate Taylor expansion.

Suppose:

  • \(f \in C ^2[a, b]\),
  • \(p \in (a, b)\) satisfies that \(f(p) = 0\) and \(f'(p) \neq 0\).

Then there exists a \(\delta > 0\) such that Newton’s method generates a sequence \(\{ p_{n} \}_{n=1}^{\infty}\) converging to \(p\) for any initial apprimation \(p_0 \in [p - \delta, p + \delta]\).

Newton’s method can be written in the form of Fixed-Point Iteration: Let \[ g(x) = x - \frac{f(x)}{f'(x)}. \] Then the sequence Newton’s method generates is identical to the sequence Fixed-point iteration generates: \[ p_{n} = g(p_{n-1}). \]

The proof is based on the analysis of \(g\). It suffices to show the two conditions for Fixed-Point Theorem.

  • Firstly, we show that \(\exists \delta > 0, k \in (0, 1)\) such that \(g'(x)\) exists on \([p-\delta, p+\delta]\) with \(|g'(x)| \leq k\):
    1. \(g'\) is well-defined and continuous: Since \(f'\) is continuous and \(f'(p) \neq 0\), there exists a \(\delta_{1} > 0\) such that \(f'(x) \neq 0\) for \(x \in (p - \delta_{1}, p + \delta_{1})\). Thus, \(g\) is defined and continuous in this interval and \[ g'(x) = \frac{f(x) f''(x)}{[f'(x)] ^2} \] is also continuous in this interval.
    2. \(|g'(x)| \leq k < 1\): Since \(f(p) = 0\), \(g'(p) = 0\). For any \(0 < k < 1\), by the continuity of \(g'\), there exists a \(\delta \in (0, \delta_{1})\) such that \(|g'(x)| \leq k\) in \([p - \delta, p + \delta]\).
  • Then, we show that \(g\) maps \([p - \delta, p + \delta]\) to itself: By MVT, for any \(x \in [p - \delta, p + \delta]\), there exists a \(\xi\) between \(x\) and \(p\) such that

    \begin{align*} |g(x) - p| = |g(x) - g(p)| = |g'(\xi)| |x-p| \leq k |x-p| < |x-p| \leq \delta. \end{align*}

If \(f(x) \in C ^2(I)\), \(f'(x) \neq 0\) in \(I\), and the initial guess \(p_0\) is sufficiently close to the zero \(p\), then there exists \(\xi_{n}\) between \(p_{n}\) and \(p\) such that \[ \frac{|p_{n+1} - p|}{|p_{n} - p| ^2} = \frac{|g''(\xi_{n})|}{2}. \]

\(p_{n+1} - p = g(p_{n}) - g(p)\). We know from the last proof that \(g(x) \in C ^{2}(I)\). Using the Lagrange form of the Taylor expansion of \(g(p)\): \[ g(x) = g(p) + g'(p) (x - p) + \frac{g''(\xi_{x})}{2} (x - p) ^2 = g(p) + 0 + \frac{g''(\xi_{x})}{2} (x - p) ^2.\]

Plugging \(x = p_{n}\) into the equation, we have \[ \frac{|p_{n+1} - p|}{|p_{n} - p| ^2} = \frac{|g(p_{n}) - g(p)|}{|p_{n} - p| ^2} = |g''(\xi_{n})| / 2. \]

2.4. Secant Method

Recall that Newton’s method needs to know the value of the \(f'\), which is more complex to calculate than \(f\).

When \(p_{n-2}\) is close to \(p_{n-1}\), we have \[ f'(p_{n-1}) \approx \frac{f(p_{n-2}) - f(p_{n-1})}{p_{n-2} - p_{n-1}}. \] Replacing the \(f'(p_{n-1})\) in Newton’s Method by this approximation gives the followings:

Given two approximations \(p_0\) and \(p_1\), the Secant Method generates a sequence \[ p_{n} = p_{n-1} - f(p_{n-1})\frac{(p_{n-1} - p_{n-2})}{f(p_{n-1}) - f(p_{n-2})}. \] Geometrically, \(p_n\) is the \(x\)-intercept of the line joining \((p_{n-1}, f(p_{n-1}))\) and \((p_{n-2}, f(p_{n-2}))\).

secant.svg
Figure 2: Secant Method (from wiki)

Given a function \(f\) on the interval \([p_0, p_1]\) where \(f(p_0) \cdot f(p_1) < 0\), the Method of Falsi Position generates a sequence as follows:

  • Let \(a = p_0, b = p_1\)
  • For any \(n > 1\), compute \(p_{n} = p_{n-1} - \frac{f(b)(b - a)}{f(b) - f(a)}\).
    • If \(f(p_{n}) \cdot f(b) < 0\), then \(p_{n}\) and \(b\) bracket the zero. Update \(a = b\) and \(b = p_{n}\).
    • If \(f(p_{n}) \cdot f(b) > 0\), then \(p_{n}\) and \(a\) bracket the zero. Update \(b = p_{n}\).

In this case, the zero always lies between \(a\) and \(b\) in each step.

regula_falsib.svg
Figure 3: Regula Falsi (from wiki)

3. Interpolation and Polynomial Approximation

Suppose that \(f\) is defined and continuous on \([a, b]\). For each \(\epsilon > 0\), there exists a polynomial \(P(x)\) such that \[ \forall x \in [a, b], |f(x) - P(x)| < \epsilon. \]

3.1. The Lagrange Polynomial

We first construct, for each \(k = 0, 1, \ldots, n\), a function \(L_{n,k}(x)\) with the property that \(L_{n,k}(x_i) = 0\) when \(i \neq k\) and \(L_{n,k}(x_k) = 1\).

To satisfy \(L_{n,k}(x_i) = 0\) for each \(i \neq k\), it requires that the numerator of \(L_{n,k}(x)\) contain the term \((x - x_0)(x - x_1)\ldots(x - x_{k-1})(x - x_{k+1})\ldots(x - x_n)\).

To satisfy \(L_{n,k}(x_k) = 1\), the denominator of \(L_{n,k}(x)\) must be this same term but evaluated at \(x = x_k\), i.e., \((x_k - x_0)(x_k - x_1)\ldots(x_k - x_{k-1})(x_k - x_{k+1})\ldots(x_k - x_n)\).

Suppose \(x_0, x_1, \ldots, x_n\) are \(n+1\) distinct numbers, and \(f\) is a function whose values are given at these numbers. Then there exists a unique polynomial \(P_{n}(x)\) of degree at most \(n\) with \[f(x_k) = P_{n}(x_k), \text{ for each } k = 0, 1, \ldots, n.\] \(P_{n}(x)\) is given by the Lagrange polynomial for \(f\), defined as: \[P_{n}(x) = f(x_0)L_{n,0}(x) + \ldots + f(x_n)L_{n,n}(x) = \sum_{k=0}^{n} f(x_k)L_{n,k}(x),\] where \(L_{n,k}(x)\) is defined as \[ L_{n,k}(x) = \frac{(x - x_0)\ldots(x - x_{k-1})(x - x_{k+1})\ldots(x - x_n)}{(x_k - x_0)\ldots(x_k - x_{k-1})(x_k - x_{k+1})\ldots(x_k - x_n)} = \prod_{i=0, i \neq k}^{n} \frac{x-x_{i}}{x_{k}-x_{i}}. \]

The existence is clear from the construction: \(L_{n,k}(x_{k}) = 1\) and \(L_{n, k}(x_{i}) = 0\) for \(i \neq k\). Hence, \(P_{n}(x_{i}) = f(x_{i})\).

For the uniqueness, if two different polynomials \(p, q\) of degree at most \(n\) iterpolate the \(n+1\) points, then \(r = p - q\) has (at least) \(n+1\) zeros and is not constantly zero. However, \(r\) is also of degree at most \(n\), so it has at most \(n\) zeros, which is a contradiction.

Remark:

  • Sometimes we denote the Lagrange polynomial determined by \(x_0, \dots , x_{n}\) by \(P_{0, \dots , n}\).

Suppose \(f \in C ^{n}[a, b]\). If \(f(x) = 0\) at the \(n + 1\) distinct numbers \(a \leq x_0 < x_1 < \ldots < x_n \leq b\), then there exists a number \(c\) in \((x_0, x_n) \subseteq (a, b)\) with \(f^{(n)}(c) = 0\).

Suppose \(x_0, \ldots, x_n\) are distinct numbers in the interval \([a, b]\) and \(f \in C^{n+1}[a, b]\). Let \(P_{n}(x)\) be the \(n\)-th order Lagrange interpolating polynomial. Then, for any \(x\) in \([a, b]\), there exists a number \(\xi(x)\) between \(x_0, x_1, \ldots, x_n\) (and hence in \((a, b)\)) such that the error term is \[ f(x) - P_{n}(x) = \frac{f^{(n+1)}(\xi(x))}{(n + 1)!}(x - x_0)(x - x_1) \ldots (x - x_n). \]

  • If \(x = x_{k}\) for any \(k = 0, \dots , n\), then the theorem holds for any choice of \(\xi(x_{k})\).
  • If \(x \neq x_{k}\) for all \(k = 0, \dots , n\): Fix \(x\). Define \(g\) on \([a, b]\) \[ g(t) = f(t) - P_{n}(t) - [f(x) - P_{n}(x)] \prod_{i=0}^{n} \frac{t - x_{i}}{x - x_{i}} \] Notice that \(g(x_{j}) = f(x_{j}) - P_{n}(x_{j}) - 0 = 0\) and \(g(x) = f(x) - P_{n}(x) - [f(x) - P(x)] = 0\). Thus, \(g\) has \(n+2\) distinct zeros \(x, x_0, \dots , x_{n}\) in \([a, b]\). Since \(f \in C ^{n+1}[a, b]\) and \(P \in C ^{\infty} [a, b]\), \(g \in C ^{n+1}[a, b]\). By Generalized Rolle’s Theorem, there exists a number \(\xi\) in \((a, b)\) for which \(g ^{(n+1)}(\xi) = 0\). Then,

    \begin{align*} 0 &= f ^{(n+1)}(\xi) - P ^{(n+1)}(\xi) - [f(x) - P(x)] \frac{d ^{n+1}}{d t ^{n+1}} \left[ \prod_{i=0}^{n} \frac{t - x_{i}}{x - x_{i}} \right]\Big|_{t = \xi}\\ &= f ^{(n+1)}(\xi) - 0 - [f(x) - P(x)] \frac{(n+1)!}{\prod_{i=0}^{n} (x - x_{i})} \end{align*}

    This implies that \(f(x) = P(x) + \frac{f ^{(n+1)}(\xi)}{(n+1)!}\prod_{i=0}^{n} (x - x_{i})\).

Method: Upper bounding the error

\begin{align*} |f(x) - P_{n}(x)| \leq \frac{1}{(n+1)!} \max_{[a, b]}|f ^{(n+1)}(\xi(x))| \max_{[a, b]} \left| \prod_{i=0}^{n} (x - x_{i}) \right|. \end{align*}

3.2. Divided Differences

Let \(f\) be a function and \(x_0, x_1, \dots , x_{n}\) be distinct numbers. The divided differences of \(f\) are recursively defined as follows:

  • The zeroth divided difference of \(f\) at each \(x_{i}\) is \(f[x_{i}] = f(x_{i})\).
  • For \(1 \leq k \leq n\), the \(k\)-th divided difference w.r.t. \(x_{i}, x_{i+1}, \dots , x_{i+k}\) for \(0 \leq i \leq n-k\) is \[ f[x_{i}, \dots , x_{i+k}] = \frac{f[x_{i+1}, \dots , x_{i+k}] - f[x_{i}, \dots , x_{i+k-1}]}{x_{i+k} - x_{i}}. \]

The \(n\)-th order Newton’s Divided Difference interpolating polynomial of \(f\) is defined as \[ P_{n}(x) = f[x_0] + \sum_{k=1}^{n} f[x_0, x_1, \dots , x_{k}] (x-x_0) \cdots (x- x_{k-1}). \]

By the uniqueness of interpolating polynomial, this formula gives the same result as the Lagrange interpolating polynomial. However, this method excels in its extensibility. When adding a new point \(x_{n+1}\), there is no need to re-calculate those parts related the previous points; only the additional term \(f[x_0, x_1, \dots , x_{k+1}](x-x_0)\cdots (x-x_{k})\) needs to be computed.

4. Numerical Differentiation

4.1. Forward and Backward Difference

We want to approximate \(f'(x_0)\).

Suppose \(f \in C ^2[a, b]\) and \(x_0 \in (a, b)\). Let \(h \neq 0\) be a sufficiently small value such that \(x_1 = x_0 + h \in [a, b]\).

By Error Term 1 of the Lagrange polynomial determined by \(x_0\) and \(x_1\),

\begin{align*} f(x) &= P_{0,1}(x) + \frac{(x-x_0)(x-x_1)}{2!} f''(\xi(x))\\ &= f(x_0) \frac{x - (x_0+h)}{-h} + f(x_0+h) \frac{x-x_0}{h} + \frac{(x-x_0)(x-x_1)}{2} f''(\xi(x)), \end{align*}

for some \(\xi(x)\) between \(x_0\) and \(x_1\).

Differentiating both sides gives

\begin{align*} f'(x) &= \frac{f(x_0+h) - f(x_0)}{h} + \frac{d}{d x} \frac{(x-x_0)(x-x_1)}{2!} f''(\xi(x))\\ &= \frac{f(x_0+h) - f(x_0)}{h} + \frac{2(x-x_0)-h}{2} f''(\xi(x)) + \frac{(x-x_0)(x-x_1)}{2} \frac{d}{d x} f''(\xi(x)). \end{align*}

When \(x = x_0\), the coefficient of \(\frac{d}{d x} f''(\xi(x))\) is zero, and the formula simplifies to:

Let \(f \in C ^2\) and \(h\) be a small value. Then \[ f'(x_0) = \frac{f(x_0 + h) - f(x_0)}{h} - \frac{h}{2} f ''(\xi), \] for some \(\xi\) between \(x_0\) and \(x_0 + h\).

When \(h > 0\), the approximation \(f'(x_0) \approx \frac{f(x_0 + h) - f(x_0)}{h}\) is called the forward-difference formula. When \(h < 0\), it is called the backward-difference formula.

4.2. General Derivative Approximation

The above approximation can be generalized to multiple points.

Suppose \(I\) is an interval, \(f \in C ^{n+1}(I)\), and \(\{ x_0, \dots , x_{n} \}\) are \((n+1)\) distinct numbers in \(I\). Using the Lagrange polynomial determined by these points, we have \[ f(x) = \sum_{k=0}^{n} f(x_{k}) L_{k}(x) + \frac{(x-x_0) \cdots (x-x_{n})}{(n+1)!} f ^{(n+1)}(\xi(x)) \] for some \(\xi(x)\) in \(I\). Differentiating both sides gives \[ f'(x) = \sum_{k=0}^{n} f(x_{k}) L_{k}'(x) + \frac{d}{dx} \left[ \frac{(x-x_0) \cdots (x-x_{n})}{(n+1)!} \right] f ^{(n+1)}(\xi(x)) + \frac{(x-x_0) \cdots (x-x_{n})}{(n+1)!} \frac{d}{dx} f ^{(n+1)}(\xi(x)). \] For \(x = x_{j}\), \[ f'(x_{j}) = \sum_{k=0}^{n} f(x_{k}) L_{k}'(x_{j}) + \frac{d}{dx} \left[ \frac{(x-x_0) \cdots (x-x_{n})}{(n+1)!} \right] f ^{(n+1)}(\xi(x)), \] which simplies to:

Let \(f \in C ^{n+1}(I)\) and \(\{ x_0, \dots , x_{n} \}\) are \((n+1)\) distinct numbers in \(I\). Then \[ f'(x_{j}) = \sum_{k=0}^{n} f(x_{k}) L_{k}'(x_{j}) + \frac{f ^{(n+1)}(\xi(x_{j}))}{(n+1)!} \prod_{k=0, k \neq j}^{n} (x_{j} - x_{k}), \] where \(L_{k}(x)\) is the \(k\)-th Lagrange coefficient polynomial at \(x_0, \dots , x_{n}\).

4.3. Useful Three-Points Formulas

Then the 3-point (\(n=2\)) formula is

\begin{align*} f'(x_{j}) &= f(x_0) L_0'(x) + f(x_1) L_1'(x) + f(x_2) L_2'(x)\\ & + \frac{1}{6} f ^{(3)} (\xi_{j}) \prod_{k=0, k \neq j}^{2} (x_{j} - x_{k}), \end{align*}

where \[ L_0'(x) = \frac{2x - x_1 - x_2}{(x_0 - x_1)(x_0 - x_2)}, L_1'(x) = \frac{2x - x_0 - x_2}{(x_1 - x_0)(x_1 - x_2)}, L_2'(x) = \frac{2x - x_0 - x_1}{(x_2 - x_0)(x_2 - x_1)}, \] and \(\xi_{j}\) indicate that the values of \(\xi\) may be different for different \(j\).

Consider three points with even distance:

Endpoint formula
With \(x_0, x_0 + h, x_0 + 2h\), the formula for \(f'(x_0)\) simplies to \[ f'(x_0) = \frac{1}{2h} [-3 f(x_0) + 4 f(x_0 + h) - f(x_0 + 2h)] + \frac{h ^2}{3} f ^{(3)} (\xi_{0}), , \xi_{0} \in (x_0, x_0 + 2h). \]
  • With \(x_0 - 2h, x_0 - h, x_0\), the formula for \(f'(x_0)\) is the same as this one except that \(h\) is replaced by \(-h\).
Midpoint formula
With \(x_0 - h, x_0, x_0 + h\), the formula for \(f'(x_0)\) simplies to \[ f'(x_0) = \frac{1}{2h} [f(x_0 + h) - f(x_0 - h)] - \frac{h ^2}{6} f ^{(3)}(\xi_{0}), \xi_{0} \in (x_0 - h, x_0 + h). \]

4.4. Approximations to Higher Derivatives

We want to derive the approximation to \(f''(x_0)\). Consider the Taylor expansion about \(x_0\) evaluated at \(x_0 \pm h\): \[ f(x_0 + h) = f(x_0) + f'(x_0) h + \frac{1}{2} f''(x_0) h ^2 + \frac{1}{6} f'''(x_0) h ^3 + \frac{1}{24} f ^{(4)}(\xi_{1}) h ^{4} \] and \[ f(x_0 + h) = f(x_0) - f'(x_0) h + \frac{1}{2} f''(x_0) h ^2 - \frac{1}{6} f'''(x_0) h ^3 + \frac{1}{24} f ^{(4)}(\xi_{-1}) h ^{4} \] for some \(\xi_{1} \in (x_0, x_0 + h)\) and \(\xi_{-1} \in (x_0 - h, x_0)\). Adding them up, we get: \[ f''(x_0) = \frac{1}{h ^2} [f(x_0-h) - 2 f(x_0) + f(x_0 + h)] - \frac{h ^2}{24} [ f ^{(4)} (\xi_{1}) + f ^{(4)}(\xi_{-1})]. \] Suppose \(f ^{(4)} \in C[x_0 - h, x_0 + h]\). Then, by IVT, \([ f ^{(4)} (\xi_{1}) + f ^{(4)}(\xi_{-1})] = 2 f ^{(4)}(\xi)\) for some \(\xi \in (x_0 - h, x_0 + h)\). Thus:

Suppose \(f ^{(4)} \in C[x_0 - h, x_0 + h]\). Then \[ f''(x_0) = \frac{1}{h ^2} [f(x_0-h) - 2 f(x_0) + f(x_0 + h)] - \frac{h ^2}{12} f ^{(4)}(\xi) \] for some \(\xi \in (x_0 - h, x_0 + h)\).

5. Numerical Integration

5.1. Quadrature Rule Framework

Numerical quadrature: The basic method of approximating \(\int_{a}^{b} f(x) dx\) by formulas of the form: \[ \sum_{i=1}^{n} a_{i} f(x_{i}). \]

E.g., approximate by Lagrange polynomial \(P_{n}(x) = \sum_{i=0}^{n} f(x_{i}) L_{i}(x)\):

\begin{align*} \int_{a}^{b} f(x) dx &= \int_{a}^{b} \left[ P_{n}(x) + \frac{f ^{(n+1)}(\xi(x))}{(n+1)!} \prod_{i=0}^{n} (x-x_{i}) \right] dx\\ &= \int_{a}^{b} \sum_{i=0}^{n} f(x_{i}) L_{i}(x) dx + \frac{1}{(n+1)!} \int_{a}^{b} f ^{(n+1)}(\xi(x)) \prod_{i=0}^{n} (x-x_{i}) dx\\ &= \sum_{i=0}^{n} f(x_{i}) \int_{a}^{b} L_{i}(x) dx + \text{Error}. \end{align*}

In this case, \(a_{i} = \int_{a}^{b} L_{i}(x) dx\).

5.1.1. Trapezoidal Rule

Suppose:

  • \(f \in C[a, b]\)
  • \(g\) is Riemann integrable on \([a, b]\) and \(g(x)\) does not change sign on \([a, b]\).

Then, there exists a \(c \in (a, b)\) such that \[ \int_{a}^{b} f(x) g(x) d x = f(c) \int_{a}^{b} g(x) dx. \]

2 points + \(P_1\).

Choose \(x_0 = a, x_1 = b\). Let \(h = b - a\). Let \(P_1(x)\) be the Langrange determined by \(x_0, x_1\), then the area under \(P_1(x)\) forms a trapezoid:

trapezoidal_rule.svg
Figure 4: From wiki

Thus, the approximation is \(\int_{a}^{b} P_1(x) dx = \frac{h}{2} [f(x_0) + f(x_1)]\).

Error:

\begin{align*} \int_{a}^{b} f(x) - P_1(x) dx &= \frac{1}{2} \int_{a}^{b} f''(\xi(x)) (x-x_0)(x-x_1) dx\\ &= \frac{1}{2} f''(\xi) \int_{a}^{b} (x-x_0)(x-x_1) dx, \xi \in (a, b) \tag{WMVT}\\ &= \frac{1}{2} f''(\xi) \left[ \frac{x ^3}{3} - \frac{x_0+x_1}{2} x ^2 + x_0 x_1 x \right] \Big|_{x_0} ^{x_1}\\ &= - \frac{h ^3}{12} f''(\xi). \end{align*}

Suppose \(f \in C ^2[a, b]\), letting \(h = b-a\), then \[ \int_{a}^{b} f(x) dx = \frac{h}{2} [f(a) + f(b)] - \frac{h ^3}{12} f''(\xi), \xi \in (a, b). \]

5.1.2. Simpson’s Rule

Choose \(x_0 = a, x_2 = b, x_1 = \frac{a+b}{2}\). Let \(h = \frac{b-a}{2}\).

Approximation by the 3rd order Taylor expansion about the midpoint \(x_1\):

\begin{align*} \int_{a}^{b} T_3(x) dx &= \int_{a}^{b} f(x_1) + f'(x_1)(x - x_1) + \frac{f''(x_1)}{2} (x-x_1) ^2 + \frac{f ^{(3)}(x_1)}{6} (x - x_1) ^3 dx\\ &= \left[ f(x_1) x + \frac{f'(x_1)}{2} (x-x_1) ^2 + \frac{f''(x_1)}{6} (x-x_1) ^3 + \frac{f ^{(3)}(x_1)}{24} (x-x_1) ^{4} \right] \Big|_{a} ^{b}dx\\ &= 2 h f(x_1) + \frac{h ^3}{3} f''(x_1). \end{align*}

Further replacing \(f''(x_1)\) using the 2nd derivative midpoint formula, we have

\begin{align*} \int_{a}^{b} T_3(x) dx &= 2 h f(x_1) + \frac{h ^3}{3} \frac{1}{h ^2} \left[f(x_0) - 2 f(x_1) + f(x_1) - \frac{h ^2}{12} f ^{(4)}(\xi_{1})\right], \xi_{1} \in [a, b]\\ &= \frac{h}{3} [f(x_0) + 4 f(x_1) + f(x_2)] - \frac{h ^5}{12} f ^{(4)}(\xi_{1}). \end{align*}

Error:

\begin{align*} \int_{a}^{b} f(x) - T_3(x) dx &= \int_{a}^{b} \frac{f ^{(4)}(\xi(x))}{24} (x-x_1) ^{4}\\ &= \frac{1}{24} f ^{(4)}(\xi2) \int_{a}^{b} (x - x_1) ^{4} dx, \xi_{2} \in [a, b] \tag{WMVT}\\ &= \frac{h ^{5}}{60} f ^{(4)}(\xi_{2}). \end{align*}

It can be shown that the two error terms \(= - \frac{h ^{5}}{90} f ^{(4)}(\xi)\) for some \(\xi \in (a, b)\).

Suppose \(f \in C ^{4}[a, b]\), then \[ \int_{a}^{b} f(x) d x = \frac{h}{3} [f(x_0) + 4 f(x_1) + f(x_2)] - \frac{h ^{5}}{90} f ^{(4)}(\xi), \xi \in (a, b). \]

5.1.3. Midpoint Rule

Choose \(x_0 = a, x_2 = b, x_1 = \frac{a+b}{2}\). Let \(h = \frac{b-a}{2}\).

Approximate by the 1st order Taylor expansion about the midpoint \(x_1\):

\begin{align*} \int_{a}^{b} f(x) dx &= \int_{a}^{b} f(x_1) + f'(x_1) (x - x_1) + \frac{f''(\xi(x))}{2} (x-x_1) ^2 dx\\ &= \left[ f(x_1) x + \frac{f'(x_1)}{2} (x-x_1) ^2 \right] \Big|_{a} ^{b} + \frac{1}{2} \int_{a}^{b} f''(\xi(x)) (x-x_1) ^2 dx\\ &= 2 h f(x_1) + \frac{h ^3}{3} f''(\xi), \xi \in (a, b) \tag{WMVT}. \end{align*}

Suppose \(f \in C ^2[a, b]\), then \[ \int_{a}^{b} f(x) dx = 2 h f(x_1) + \frac{h ^3}{3} f''(\xi), \xi \in (a, b). \]

5.2. Composite Rules

Partition \([a, b]\) into subintervals and apply quadrature rules on each intervals.

5.2.1. Composite Simpson’s Rule

Let \(n\) be an even integer, \(h = \frac{b - a}{n}\), \(x_{j} = a + jh\) for \(j = 0, \dots , n\).

Every two subintervals are considered as one subinterval.

\begin{align*} \int_{a}^{b} f dx &= \sum_{j=1}^{n /2} \int_{x_{2(j-1)}}^{x_{2j}} f dx\\ &= \sum_{j=1}^{n /2} \left[ \frac{h}{3} [f(x_{2j-2}) + 4f(x_{2j-1}) + f(x_{2j})] - \frac{h ^{5}}{90} f ^{(4)}(\xi_{j}) \right]\\ &= \frac{h}{3} \left[ f(x_0) + f(x_{n}) + 2 \sum_{j=1}^{n /2} f(x_{2j}) + 4 \sum_{j=1}^{n /2} f(x_{2j-1}) \right] - \frac{h ^{5}}{90} \sum_{j=1}^{n / 2} f ^{(4)} (\xi_{j}). \end{align*}

\(f ^{(4)}\) reaches its min & max on \([a, b]\) by \(f \in C ^{4}[a, b]\) (Extreme Value Theorem). Since \(\sum_{j=1}^{n / 2} f ^{(4)} (\xi_{j}) \in [\frac{n}{2} \min, \frac{n}{2} \max]\), by IVT, \(\exists \mu \in (a, b)\) such that it \(= \frac{n}{2} f ^{(4)}(\mu)\). Thus, the error is \[ -\frac{h ^5}{90} \frac{1}{2} \frac{b-a}{h} f ^{(4)}(\mu) = - \frac{b-a}{180} h ^{4} f ^{(4)}(\mu). \]

Suppose \(f \in C ^{4}[a, b]\), then \[ \int_{a}^{b} f(x) dx = \frac{h}{3} \left[ f(x_0) + f(x_{n}) + 2 \sum_{j=1}^{n /2} f(x_{2j}) + 4 \sum_{j=1}^{n /2} f(x_{2j-1}) \right] - \frac{b-a}{180} h ^{4} f ^{(4)}(\mu), \mu \in (a, b). \]

5.2.2. Composite Trapezoidal Rule

Let \(n\) be an integer, \(h = \frac{b - a}{n}\), \(x_{j} = a + jh\) for \(j = 0, \dots , n\).

Similarly,

\begin{align*} \int_{a}^{b} f dx &= \sum_{j=1}^{n} \int_{x_{j-1}}^{x_{j}} f dx\\ &= \sum_{j=1}^{n} \left[ \frac{h}{2} [f(x_{j-1}) + f(x_{j})] - \frac{h ^3}{12} f ''(\xi_{j}) \right]\\ &= \frac{h}{2} \left[ f(x_0) + f(x_{n}) + 2 \sum_{j=1}^{n-1} f(x_{j}) \right] - \frac{h ^3}{12} \sum_{j=1}^{n} f''(\xi_{j})\\ &= \frac{h}{2} \left[ f(x_0) + f(x_{n}) + 2 \sum_{j=1}^{n-1} f(x_{j}) \right] - \frac{b-a}{12} h ^2 f''(\mu), \mu \in (a, b). \end{align*}

Let \(f \in C ^{2}[a, b]\), then \[ \int_{a}^{b} f(x) dx = \frac{h}{2} \left[ f(x_0) + f(x_{n}) + 2 \sum_{j=1}^{n-1} f(x_{j}) \right] - \frac{b-a}{12} h ^2 f''(\mu), \mu \in (a, b). \]

5.2.3. Composite Midpoint Rule

Let \(n\) be an even integer,

\begin{align*} \int_{a}^{b} f dx &= \sum_{j=1}^{n /2} \int_{x_{2j-2}}^{x_{2j}} f(x) dx\\ &= \sum_{j=1}^{n /2} \left[ 2h f(x_{2j-1}) + \frac{h ^3}{3} f''(\xi_{j}) \right]. \end{align*}

Let \(f \in C ^2[a, b]\), \(n\) be an even integer, then \[ \int_{a}^{b} f(x) dx = 2h \sum_{j=1}^{n /2} f(x_{2j-1}) + \frac{b-a}{6} h ^2 f''(\mu), \mu \in (a, b) \]

6. Initial-Value Problems for ODEs

Let \(y\) be a function in \(t\). An initial value problem is a differential equation \[ y'(t) = f(t, y(t)), a \leq t \leq b \] together with an initial condition \(y(a) = y_0\). A solution to this problem is a function \(y\) that satisfies this conditions.

Here, we just want to approximate \(y\) at some values of \(t\), called mesh points, in \([a, b]\), instead of finding a continuous approximation to the function.

6.1. Euler’s Method

We stipulate that the mesh points are equally distributed throughout the interval \([a, b]\): \(t_{i} = a + ih\) for \(h = 0, \dots , N\), \(h = (b - a) / N\).

Suppose \(y(t) \in C ^2[a, b]\). For each \(i = 0, \dots , N-1\), we approximate \(y(t_{i+1})\) with the taylor expansion at \(t_{i}\):

\begin{align*} y(t_{i+1}) &= y(t_{i}) + (t_{i+1} - t_{i}) y'(t_{i}) + \frac{(t_{i+1} - t_{i}) ^2}{2} y ''(\xi_{i})\\ &= y(t_{i}) + h y'(t_{i}) + \frac{h ^2}{2} y ''(\xi_{i})\\ &= y(t_{i}) + h f(t_{i}, y(t_{i})) + \frac{h ^2}{2} y ''(\xi_{i}) \end{align*}

for some \(\xi_{i} \in (t_{i}, t_{i+1})\). Ignoring the remainder term, we have:

Suppose \(f(t) \in C ^2[a, b]\). Given the initial value problem, and a step size \(h = (b-a) / N\), the Euler’s method constructs \(w_{i} \approx y(t_{i}), t_{i} = a + i h\) for \(i = 0, 1, \dots , N-1\) as follows:

\begin{align*} w_0 &= y(t_0) = y_0,\\ w_{i+1} &= w_{i} + h f(t_{i}, w_{i}), i = 0, \dots , N-1. \end{align*}

This equation is called the difference equation associated with Euler’s method.

euler.svg
Figure 5: Euler’s Method (from wiki)

Suppose \(s, t > 0\), \(a_0 \geq - \frac{t}{s}\), and \(a_{i+1} \leq (1 +s)a_{i} + t\), then \[ a_{i+1} \leq e ^{(i+1)s} \left( a_0 + \frac{t}{s} \right) - \frac{t}{s}. \]

We have

  1. \(y(t_{i+1}) = y(t_{i}) + h y'(t_{i}) + \frac{h ^2}{2} y''(\xi_{i})\)
  2. \(w_{i+1} = w_{i} + h f(t_{i}, w_{i})\).

(1) - (2):

\begin{align*} y(t_{i+1}) - w_{i+1} &= y(t_{i}) - w_{i} + h [y'(t_{i}) - f(t_{i}, w_{i})] + \frac{h ^2}{2} y''(\xi_{i})\\ &= y(t_{i}) - w_{i} + h [f(t_{i}, y(t_{i})) - f(t_{i}, w_{i})] + \frac{h ^2}{2} y''(\xi_{i}). \end{align*}

By the Lipschitz condition and the bound on \(y''\), we get \[|y(t_{i+1}) - w_{i+1}| <=(1 + hL) |y(t_{i}) - w_{i}| + \frac{h ^2M}{2}.\]

Apply the lemma with \(s = hL\), \(t = \frac{h ^2M}{2}\), \(a_{i} = |y(t_{i}) - w_{i}|\). Notice that \(a_0 = 0\).

\begin{align*}|y(t_{i+1}) - w_{i+1}| &\leq e ^{(i+1) hL} \left( 0 + \frac{hM}{2L} \right) - \frac{hM}{2L}\\ &= \left[ e ^{(t_{i+1} - a)L} - 1 \right] \frac{hM}{2L}. \end{align*}

Suppose

  • \(f\) is continuous and \(L\)-Lipshitz in \(y\) on \(D = \{ (t, y) | t \in [a, b], y \in (-\infty, + \infty) \}\),
  • and \(|y''(t)| \leq M, \forall t \in [a, b]\) for some constant \(M\).

Let \(w_0, \dots , w_{n}\) be the approximations generated by Euler’s method. Then, for each \(i\), \[|y(t_{i}) - w_{i}| \leq \frac{hM}{2L} (e ^{L(t_{i} - a)} - 1) \]

Remark: More simply, \(L = \max \left| \frac{\partial}{\partial y} f(t, y) \right|\).

7. Linear System of Equations

For a linear system

\begin{align*} a_{11} x_1 + \dots + a_{1n} x_{n} &= b_1,\\ \vdots & \vdots \\ a_{n_1} x_1 + \dots + a_{nn} x_{n} &= b_2, \end{align*}

we define \(\mat{A} = [a_{ij}]\) and \(\vec{b} = (b_1, b_2, \dots , b_{n}) ^{T}\) (column). The the augmented matrix \(\tilde{\mat{A}} = [\mat{A}, \vec{b}]\) is a \(n \times (n+1)\) matrix representing the linear system.

Let \(E_{i}\) be the \(i\)-th equation of the system. The following transformations do not affect the solutions:

  • Multiply \(E_{i}\) by any nonzero constant.
  • Add \(E_{i}\) to \(E_{j}\).
  • Swap \(E_{i}\) and \(E_{j}\).
  • Swap two rows

7.1. Gaussian Elimination

  1. At the \(k\)-th step, provided \(a_{kk} \neq 0\), we eliminate the coefficient of \(x_{k}\) in rows after the \(k\)-th row: \[ E_{j} \gets E_{j} - \frac{a_{i,k}}{a_{k,k}} E_{i}, \forall j = k + 1, k + 2, \dots , n. \]
  2. Use backward substitution to solve the \(i\)-th equation for \(x_{k}\), for \(k = n, n-1, n-2, \dots \). This requires that \(a_{nn} \neq 0\) (at ).

(P.S.: the coefficients / equations always refer to the updated ones.)

The procedure can be described as a sequence of augmented matrices \(\tilde{A} ^{(0)}, \dots , \tilde{A} ^{(n-1)}\), where \(\tilde{A} ^{(0)}\) is the original matrix and \(\tilde{A} ^{(k)}\) is the matrix after the \(k\)-th GE: Let \(a_{ij} ^{(k)}\) be the \((i,j)\)-th entry of \(\tilde{A} ^{(k)}\) (\(a_{i,n+1} = b\)).

\begin{align*} a_{ij} ^{(k)} = \begin{cases} a_{ij} ^{(k-1)}, & i \leq k, j \geq i \text{(does not change)}\\ 0, & i \leq k, j < i \text{(previously eliminated)}\\ 0, & i > k, j = k \text{(just eliminated)}\\ a_{ij} ^{(k-1)} - \frac{a_{i, k}^{(k-1)}}{a_{k, k} ^{(k-1)}} a_{k, j}, & i > k, j > k \end{cases} \end{align*}

And the backward substitution can be represented as: \[ x_{k} = \frac{a_{k,n+1} - \sum_{j=k+1} ^{n} a_{k,j}}{a_{k,k}} \]

Gaussian Elimination with pivoting: If at the \(k\)-th step the pivot element \(a_{k, k} ^{(k-1)}\) is 0, then a search is made of \(a_{k+1, k} ^{(k-1)}, \dots , a_{n, k} ^{(k-1)}\) (on the \(k\)-th column after the \(k\)-th row) for the first nonzero element. Suppose the nonzero element is \(a_{p, k} ^{(k-1)}\), then swap \(E_{k}\) and \(E_{p}\).

\(\mat{A} \vec{x} = \vec{b}\) can be solved using GE with pivoting iff \(\mat{A}\) is invertible.

Computational cost: \(O(n ^3)\).

Round-up error analysis:

  • When \(\frac{a_{i, k}^{(k-1)}}{a_{k, k} ^{(k-1)}}\) is much larger than \(1\), the original error is magnified.
  • When \(a_{kk}\) is small, the error in the numerator of \(\frac{a_{k,n+1} - \sum_{j=k+1} ^{n} a_{k,j}}{a_{k,k}}\) is magnified.

To avoid the pivot \(a_{kk}\) being too small, we may select an element with a larger magnitude by swapping rows and columns. The followings are several strategies for selecting pivot element.


At the \(k\)-th step, select the element with the largest absolute value in the \(k\)-th column below the \(k\)-th row as the pivot: \[ E_{k} \leftrightarrow E_{p}, p = \argmax _{k \leq i \leq n} |a_{i,k} ^{(k-1)}|. \]


Compute the scale factor (only once at the start) \(s_{i} = \max_{1 \leq j \leq n} |a_{ij}|\) for each row. Assume \(s_{i} \neq 0\) for all rows. Then, at the \(k\)-th step, select the element with the largest scaled absolute value in the \(k\)-th column below the \(k\)-th row as the pivot: \[ E_{k} \leftrightarrow E_{p}, p = \argmax_{k \leq i \leq n} \frac{|a_{i,k} ^{(k-1)}|}{s_{i}}. \] P.S.: Though scale factors have been computed, they have to be interchanged with the row.

7.2. Matrix Factorization

\(\mat{A} = \mat{L} \mat{U}\), where \(\mat{L}\) is lower triangular and \(\mat{U}\) is upper triangular, is called the LU factorization.

Suppose that Gaussian elimination can be performed on the system \(\mat{A} \vec{x} = ?\) without row interchanges, i.e., \(a_{kk} ^{(k-1)} \neq 0\) at each step. (The r.h.s. is not our interest and is independent of this condition)

At the \(k\)-th step, for each \(i = k+1, \dots , n\), the operation \[ E_{i} \gets E_{i} - m_{i,k} E_{k}, m_{i,k} = \frac{a_{i,k}}{a_{k,k}} \] can be represented as

\begin{align*} \begin{bmatrix} \text{---} E_1 \text{---}\\ \vdots \\ \text{---} E_{n} \text{---} \end{bmatrix} \gets \begin{bmatrix} \text{---} E_1 \text{---}\\ \vdots \\ \text{---} E_{n} \text{---} \end{bmatrix} - \begin{bmatrix} 0_{1}\\ \vdots \\ 0_{k} \\ m_{k_{1},k}\\ \vdots \\ m_{n,k} \end{bmatrix} \begin{bmatrix} \text{---} E_{k} \text{---} \end{bmatrix}. \end{align*}

Here, we are only interested in the coefficient part \(\mat{A}\). And the transformation at this step can be represented as

\begin{align*} \mat{A} ^{(k)} = \mat{M} ^{k} \mat{A} ^{(k-1)}, \mat{M} ^{k} = \mat{I} + \begin{bmatrix} 0_{1} & & & & O\\ & \ddots & & & \\ & & 0_{k} & & \\ & & -m_{k+1, k} & & \\ & & \vdots & \ddots & \\ 0& & - m_{n,k} & & 0_{n} \end{bmatrix}. \end{align*}

The matrix \(\mat{M} ^{(k)}\) is called the \(k\)-th Gaussian transformation matrix.

Then, \(\mat{A} ^{(n-1)} = \mat{M} ^{(n-1)} \dots \mat{M} ^{(1)} \mat{A}\). We construct:

  • \(\mat{U} := \mat{A} ^{(n-1)}\).
  • We want \(\mat{L} = [\mat{M} ^{(n-1)} \dots \mat{M} ^{(1)}] ^{-1}\), i.e., the reverse of the overall transformation. Fortunately, we can obtain the closed form for the reverse of every single transformation: \(E_{i} \gets E_{i} + m_{i,k} E_{k}, \forall i = k+1, \dots , n\) (Note that this is doable thanks to \(E_{k}\) being unchanged). This can be represented as \(\mat{L} ^{(k)} := [\mat{M} ^{(k)}] ^{-1}\), which is simply the positive version of \(\mat{M} ^{(k)}\):

    \begin{align*} \mat{L} ^{(k)} = \mat{I} + \begin{bmatrix} 0_{1} & & & & O\\ & \ddots & & & \\ & & 0_{k} & & \\ & & m_{k+1, k} & & \\ & & \vdots & \ddots & \\ 0& & m_{n,k} & & 0_{n} \end{bmatrix}. \end{align*}

    Thus, \(\mat{L} :=\mat{L} ^{(1)} \dots \mat{L} ^{(n-1)}\). (See the formula below)

Then, \(\mat{A} = \mat{L} \mat{U}\).

Suppose Gaussian elimination can be performed on \(\mat{A} \vec{x} = ?\) without row interchanges. Then \(\mat{A}\) has a LU factorization \(\mat{L} \mat{U}\), with the two given by:

\begin{align*} \mat{U} = \mat{A} ^{(n-1)} = \begin{bmatrix} a_{1,1} ^{(0)} & \dots & a_{1,n} ^{(0)}\\ \vdots & \ddots & \vdots \\ 0 & \dots & a_{n,n}^{(n-1)} \end{bmatrix} \end{align*}

and

\begin{align*} \mat{L} = \begin{bmatrix} 1_{1} & & & & O \\ m_{2,1} & \ddots & & & \\ & & 1_{k} & & \\ \vdots & & m_{k+1, k} & & \\ & & \vdots & \ddots & \\ m_{n,1}& & m_{n,k} & & 1_{n} \end{bmatrix}. \end{align*}

Computational cost: \(O(n ^2)\).


When \(\mat{A}\) has a LU factorization, the linear system \(\mat{A} \vec{x} = \vec{b}\) can be solved in this way:

  1. Let \(\vec{y} = \mat{U} \vec{x}\). Then solve the lower triangular system \(\mat{L} \mat{U} \vec{x} = \mat{L} \vec{y} = \vec{b}\) for \(\vec{y}\): \[ y_{i} = \frac{b_{i} - \sum_{j=1}^{i-1} l_{i,j} y_{j}}{l_{i,i}}. \]
  2. Then solve the upper triangular system \(\mat{U} \vec{x} = \vec{y}\) for \(\vec{x}\): \[ x_{i} = \frac{y_{i} - \sum_{j=i+1}^{n} u_{i,j} x_{j}}{u_{i,i}}. \]

The overall time is \(O(n ^2)\), which is faster than \(O(n ^3)\) of GE.


Let \(\pi_{1}, \dots \pi_{n}\) be a permutation of the numbers \(1, \dots , n\). Then the corresonding permutation matrix is obtained by rearranging the rows of identity matrix according to \(\pi\), i.e., \(\mat{P} = [\vec{e}_{\pi_{1}} \dots \vec{e}_{\pi_{n}}] ^{T}\). \(\mat{P} \mat{A}\) has the effect of rearranging the rows of \(\mat{A}\) according to \(\pi\).

Property: \(\mat{P} ^{-1} = \mat{P} ^{T}\).

Recall the GE with pivoting, it is equivalent to perform the row exchanges at the beginning and then the normal GE without row interchanges. That is, for any invertible matrix \(\mat{A}\), there exists a permutation \(\mat{P}\) such that \(\mat{P} \mat{A} \vec{x} = \mat{P} \vec{b}\) can be solved without row interchanges.

Then, we can apply the LU factorization method to solve for \(\vec{x}\).

We can derive a factorization (not LU): \(\mat{A} = \mat{P} ^{-1} \mat{L} \mat{U} = \mat{P} ^{T} \mat{L} \mat{U}\).

7.3. Iterative Techniques

7.3.1. General Iterative Method

An iterative technique to solve the linear system \(\mat{A} \vec{x} = \vec{b}\) starts with an initial approximation \(\vec{x} ^{(0)}\) and generates a sequence \(\{ \vec{x} ^{(k)} \}_{k=0}^{\infty}\) that converges to the solution \(\vec{x}\).

In general,

  • The linear system \(\mat{A} \vec{x} = \vec{b}\) is analogous to the zero point problem (for linear maps) and it can be reorganized into another system of the form that is analogous to the fixed point problem: \[ \vec{x} = \mat{T} \vec{x} + \vec{c} \] for some (properly chosen) constant matrix \(\mat{T}\) and vector \(\vec{c}\).
  • And the iterative techniques associated with this system can be represented by \[ \vec{x} ^{(k)} = \mat{T} \vec{x} ^{(k-1)} + \vec{c}. \]

Intuitively, the iterative equation should holds when plugging the actual solution \(\vec{x}\) into it. It fixes \(\vec{x}\) on the r.h.s. to be the approximation given by the last iteration and solve for \(\vec{x}\) on the l.h.s..

Stopping criteria: \(\frac{\| \vec{x} ^{(k)} - \vec{x} ^{(k-1)}\|_{\infty}}{\| \vec{x} ^{(k)} \|_{\infty}} < \epsilon\).

7.3.2. Jacobi’s Iterative Method

For each \(k \geq 1\), generate \(\vec{x} ^{(k)}\) by solving the \(i\)-th equation for \(x_{i} ^{(k)}\), assuming other variables are given by \(\vec{x} ^{(k-1)}\): \[ x_{i} ^{(k)} = \frac{1}{a_{i,i}} \left[ b_{i} - \sum_{j=1, j \neq i}^{n} a_{i,j} x_{j} ^{(k-1)} \right], \forall i = 1, \dots , n. \]

  • Equivalent system: Let \(\mat{D}\) be the diagonal matrix whose diagonal entries are those of \(\mat{A}\), \(\mat{L}\) be the negation of the strictly (diagonal not included) lower-triangular part of \(\mat{A}\), and \(\mat{U}\) be the negation of the strictly upper-triangular part of \(\mat{A}\). Then, \(\mat{A} = \mat{D} - \mat{L} - \mat{U}\) and \(\mat{A} \vec{x} = \vec{b}\) is transformed into \[ \mat{D} \vec{x} = (\mat{L} + \mat{U}) \vec{x} + \vec{b}. \] If \(\mat{D} ^{-1}\) exists, i.e., the diagonal entries are non-zero, then \[ \vec{x} = \mat{D} ^{-1}(\mat{L} + \mat{U}) \vec{x} + \mat{D} ^{-1} \vec{b}. \]
  • Matrix representation of the Jacobi’s iterative method: \[ \vec{x} ^{(k)} = \mat{D} ^{-1}(\mat{L} + \mat{U}) \vec{x} ^{(k-1)} + \mat{D} ^{-1} \vec{b}. \]

To speed convergence, the equations should be arranged (column interchanges) so that \(a_{i,i}\) is as large as possible.

7.3.3. Gauss-Seidel Iterative Method

Use the most recently calculated values to compute \(x_{i} ^{(k)}\): \[ x_{i} ^{(k)} = \frac{1}{a_{ii}} \left[ b_{i} - \sum_{j=1}^{i-1} a_{i,j} x_{j} ^{(k)} - \sum_{j=i+1}^{n} a_{i,j} x_{j} ^{(k-1)} \right]. \]

The GS method can also be represented in matrix form, through a slightly more complicated derivation: Collect all the \(k\)-th iteration terms to one side, \[ a_{i,1} x_1 ^{(k)} + \dots + a_{i,i} x_{i} ^{(k)} = - a_{i,i+1} x_{i+1} ^{(k-1)} - \dots - a_{i,n} x_{n} ^{(k-1)} + b_{i}, \forall i = 1, \dots , n. \]

  • Equivalent system: The equation above corresponds to \((\mat{D} - \mat{L}) \vec{x} = \mat{U} \vec{x} + \vec{b}\). Reorganizing it into (if \((\mat{D} - \mat{L}) ^{-1}\) exists, i.e., all the diagonal entries are non-zero), \[ \vec{x} = (\mat{D} - \mat{L}) ^{-1} \mat{U} \vec{x} + (\mat{D} - \mat{L}) ^{-1} \vec{b}. \]
  • GS method: \[ \vec{x} ^{(k)} = (\mat{D} - \mat{L}) ^{-1} \mat{U} \vec{x} ^{(k-1)} + (\mat{D} - \mat{L}) ^{-1} \vec{b}. \]

7.4. Convergence Results for General Iteration Methods

Consider the general iteration \(\vec{x} ^{(k)} = \mat{T} \vec{x} ^{(k-1)} + \vec{c}\), where \(\vec{x} ^{(0)}\) is arbitrary.

A square matrix \(\mat{A}\) is convergent if \(\lim_{k \to \infty} (\mat{A} ^{k})_{i,j} = 0\) for all \(i, j\).

The following are equivalent conditions for \(\mat{A}\) to be convergent:

  1. The spectral radius (the maximum of the absolute values of its eigenvalues) \(\rho(\mat{A}) < 1\).
  2. \(\lim_{k \to \infty} \| \mat{A} ^{k} \| = 0\) for any matrix norm (https://mathworld.wolfram.com/NaturalNorm.html).
  3. \(\lim_{k \to \infty} \mat{A} ^{k} \vec{x} = \vec{0}\) for every \(\vec{x}\).

If the spectral radius of \(\mat{T}\) satisfies \(\rho(\mat{T}) < 1\), then \((\mat{I} - \mat{T}) ^{-1}\) exists and \[ (\mat{I} - \mat{T}) ^{-1} = \mat{I} + \mat{T} + \mat{T} ^2 + \dots . \]

  1. \(\mat{T} \vec{x} = \lambda \vec{x}\) iff \((\mat{I} - \mat{T}) \vec{x} = (1 - \lambda) \vec{x}\). Since \(|\lambda| \leq \rho(\mat{T}) < 1\), \(0\) is not an eigenvalue of \(\mat{I} - \mat{T}\), so \((\mat{I} - \mat{T}) ^{-1}\) exists.
  2. By the last lemma, \(\mat{T}\) is convergent. Hint: Consider the partial sum \(\mat{S}_{m} := \sum_{k=0}^{m} \mat{T} ^{k}\) and \((\mat{I} - \mat{T}) \mat{S}_{m}\).

For any \(\vec{x} ^{(0)}\), the sequence \(\{ \vec{x} ^{(k)} \}\) defined by \[ \vec{x} ^{(k)} = \mat{T} \vec{x} ^{(k-1)} + \vec{c} \] converges to the unique solution of \[ \vec{x} = \mat{T} \vec{x} + \vec{c} \] iff \(\rho(T) < 1\).

  1. \((\mat{I} - \mat{T}) \vec{x} = \vec{c}\) has a unique solution iff \(\rho(T) < 1\).
  2. Assume \(\rho(T) < 1\). Expanding the iterative equation, we have \[ \vec{x} ^{(k)} = \mat{T} ^{k} \vec{x} ^{(0)} + (\mat{T} ^{k-1} + \dots + \mat{I}) \vec{c}. \] By the convergence of \(\mat{T}\) and the last lemma, \(\lim_{k \to \infty} \vec{x} ^{(k)} = (\mat{I} - \mat{T}) ^{-1} \vec{c}\).
  3. Conversely, assume \(\vec{x} ^{(k)}\) converges to the unique solution \(\vec{x}\) for any \(\vec{x} ^{(0)}\). Expanding both sides, we have \[ \lim_{k \to \infty} \mat{T} ^{k} \vec{x} + (\mat{T} ^{k-1} + \dots \mat{I}) \vec{c} = \lim_{k \to \infty} \mat{T} ^{k} \vec{x} ^{(0)} + (\mat{T} ^{k-1} + \dots \mat{I}) \vec{c}. \] Thus, \(\vec{0} = \lim_{k \to \infty} \mat{T} ^{k} (\vec{x} - \vec{x} ^{(0)})\). This means that \(\vec{0} = \lim_{k \to \infty} \mat{T} ^{k} \vec{z}\) for any \(\vec{z}\). By the convergence conditions, this implies that \(\rho(\mat{T}) < 1\).

If \(\| \mat{T} \| < 1\) (any norm), then

  • the sequence \(\{ \vec{x} \}\) defined by \(\vec{x} ^{(k)} = \mat{T} \vec{x} ^{(k-1)} + \vec{c}\) converges to a vector \(\vec{x}\) with \(\vec{x} = \mat{T} \vec{x} + \vec{c}\) for any \(\vec{x} ^{(0)}\),
  • and
    • \(\| \vec{x} - \vec{x} ^{(k)} \| \leq \| \mat{T} \| ^{k} \| \vec{x} ^{(0)} - \vec{x} \|\).
    • \(\| \vec{x} - \vec{x} ^{(k)} \| \leq \frac{\| \mat{T} \| ^{k}}{1 - \| \mat{T} \|} \| \vec{x} ^{(1)} - \vec{x} ^{(0)}\|\).
  1. \(\| \mat{T} ^{k} \| \leq \| \mat{T} \| ^{k}\). Thus, the sequence converges to a solution by the theorem before.
  2. Hint: Use \(\vec{x} - \vec{x} ^{(k)} = \mat{T} (\vec{x} - \vec{x} ^{(k-1)})\), \(\vec{x} ^{(k)} - \vec{x} ^{(k-1)} = \mat{T} (\vec{x} ^{(k-1)} - \vec{x} ^{(k-2)})\), and \(\vec{x} - \vec{x} ^{(k)} = \lim_{m \to \infty} \vec{x} ^{(m)} - \vec{x} ^{(k)}\).

7.5. Application to the Jacobi & Gauss-Seidel Methods

A square matrix \(\mat{A}\) is strictly diagonally dominant if \[|a_{ii} > \sum_{j \neq i} |a_{i,j}|.\]

TODO: strictly dominant => rho < 1.

If \(\mat{A}\) is strictly diagonally dominant, then for any \(\vec{x} ^{(0)}\), both the Jacobi and Gauss-Seidel methods generate sequences \(\{ \vec{x} ^{(k)} \}\) that converge to the unique solution of \(\mat{A} \vec{x} = \vec{b}\).

If \(a_{ij} \leq 0\) for each \(i \neq j\) and \(a_{ii} > 0\) for each \(i\), then one and only one of the following statements holds:

  1. \(0 \leq \rho(\mat{T}_{G}) < \rho(\mat{T}_{J}) < 1\).
  2. \(1 < \rho(\mat{T}_{J}) < \rho(\mat{T}_{G})\).
  3. \(\rho(\mat{T}_{J}) = \rho(\mat{T}_{G}) = 0\).
  4. \(\rho(\mat{T}_{J}) = \rho(\mat{T}_{G}) = 1\).

where \(\mat{T}_{J} = \mat{D} ^{-1}(\mat{L} + \mat{U})\) and \(\mat{T}_{G} = (\mat{D} - \mat{L}) ^{-1} \mat{U}\).

7.6. Successive Over-Relaxation

Introduce parameter \(\omega\). Then

\begin{align*} \omega (\mat{D} - \mat{L} - \mat{U}) \vec{x} &= \omega \vec{b}\\ (\mat{D} - \omega \mat{L}) \vec{x} &= ((1 - \omega) \mat{D} + \omega \mat{U}) \vec{x} + \omega \vec{b} \tag{*}\\ \vec{x} &= (\mat{D} - \omega \mat{L}) ^{-1} [((1 - \omega) \mat{D} + \omega \mat{U}) \vec{x} + \omega \vec{b}]. \end{align*}

\(\mat{T}_{\omega} = (\mat{D} - \omega \mat{L}) ^{-1} [(1 - \omega) \mat{D} + \omega \mat{U}]\), \(\vec{c} _{\omega} = (\mat{D} - \omega \mat{L}) ^{-1} \omega \vec{b}\).

We can extract term-wise formulation from (*):

\begin{align*} a_{i,i} x_{i} ^{(k+1)} + \omega \sum_{j=1}^{i-1} a_{i,j} x_{j} ^{(k+1)} &= (1 - \omega) a_{i,i} x_{i} ^{(k)} - \omega \sum_{j=i+1}^{n} a_{i,j} x_{j} ^{(k)} + \omega b_{i}\\ x_{i} ^{(k+1)} &= (1 - \omega) x_{i} ^{(k)} + \frac{\omega}{a_{i,i}} \left[b_{i} - \sum_{j=1}^{i-1} a_{i,j} x_{j} ^{(k+1)} - \sum_{j=i+1}^{n} a_{i,j} x_{j} ^{(k)} \right]. \end{align*}