Numerical representation 
The current positional number notation was introduced in
the Middle Ages by influence of
AlKharizmi
works (IX century), and allowed a considerable development
of Calculus in Western Europe. The Hindu notion of zero generated a
concept of representation that was absent in earlier millenia.
The ease of calculations with this new representation has meant that
the GrecoRoman cumulative notation was gradually abandoned ...
nevertheless at a medieval rhythm.
The choice of a basis still posed some problems for
axiomatic medieval scholars.
For example, the representation of 1/3 = 0.3333… could never be
exact in decimal notation.
If fractions (rational numbers) have a periodical decimal sequence,
real numbers like
presented an unpredictable sequence of digits. Its
decimal representation is always limited to a finite number of
digits, and to a consequent error.
It was only at the end of the XIX century, with
G. Cantor,
that real numbers were defined as (equivalence classes of)
Cauchy sequences of rational numbers, where the sequence:
x_{0}=1, x_{1}=1.4, x_{2}=1.41,
x_{3}=1.414, x_{4}=1.4142, ... 
is just a representative of all rational sequences that
converge to the real number √ 2.
This interpretation of Cantor represented a great conceptual advance, allowing further to distinguish countable infinity from
continuous infinity.
It is clear that the choice of the decimal basis for a real number
representative is just the most common choice from an uncountable
number of other possibilities.
Already in the XVII century,
Leibniz
advocated advantages in binary representation − which
is adopted nowadays in internal computer routines.
In general we can use any natural base β > 1, to write
a real number.
For instance, in Scientific Notation (usually β=10),
x = ± (0. a_{1} a_{2} ... a_{n} ...)_{β} × β^{t} 
= ±  ∞ ∑ k=1  a_{k} β^{tk} 
the digits a_{k} ranging from 0 to β1. To have uniqueness
of exponent representation, a_{1}≠ 0 and the zero is represented
separately. This avoids several possibilities of representation as in
0.1 × β^{1} = 0.01 × β^{2} = 0.001 × β^{3} = 1 
by choosing only the first.
Note that, by convention, 0.999… is identified
with the representation 1.000… (both sequences converge to 1).

Floating Point Representation 
Unless you were in possession of a machine with infinite memory
(... a theoretical Turing machine),
the representation of a number must be finite.
Thus we must consider a finite number of digits in the mantissa and a limitation of the exponents allowed.

Definition: Floating Point System
The FP(β, n, t^{}, t^{+}) system is the set of rational numbers
FP(β, n, t^{}, t^{+}) =
{0} ∪
{ ± 0. a_{1} a_{2} ... a_{n} × β^{t}
: a_{k}∈ {0,..., β1}, a_{1}≠ 0, t∈ {t^{}, ..., t^{+} } } 

Other Remarks:
(i) Note that as a_{1} ≠ 0, zero must be represented
separately (it might be a bit, a zero flag, as there is a bit
that denotes the sign sign flag).
(ii) Usually β=10, but binary representation β=2 is
preferred in computers (the digits a_{i} might be 0 or 1).
Details on computer implementation may be found in the link
IEEE 7541985 .
(iii) In programming languages (such as C, Fortran, Matlab) these FP systems are
set when we declare the variables. Usually words like single, float, real*4,...
stand for a single precision FP system with 4 bytes (23 bits − binary digits in the mantissa
~ up to 8 decimal digits). More recently it is standard to use double precision
FP systems with 8 bytes (53 bits − binary digits in the mantissa ~ up to 16 decimal digits)
and the variables are declared as double, float, real*8,...
Thus, we must map the real numbers to the rational number set FP,
fl: R → FP(β, n, t^{}, t^{+}) ⊂ Q 
and two type of limitations occur, leading to errors:
 Finite exponent limitations (errors that may abort execution):
 Overflow: when the exponent t is greater than t^{+}.
 Underflow: when the exponent t is less than t^{}.
 Finite mantissa limitations (errors that we will study):
 Rounding errors: truncation and rounding.

Rounding 
Given a real number, represented in scientific notation
x = ± 0. a_{1} a_{2} ... a_{n} a_{n+1} .... × β^{t} 
while storing it in a system FP(β,n) we might be forced to
suppress further digits in the mantissa.
To proceed with that conversion we consider two types of rounding:
 Truncation
fl_{c}(x) = ± 0. a_{1} a_{2} ... a_{n} × β^{t} 
 Rounding (halfup) (note: β > 2 is even)
fl_{s}(x) = ± 0. a_{1}' a_{2}' ... a_{n}' × β^{t'} 
The digits a_{k}' coincide with a_{k}
and the exponent t' coincides with t, if a_{n+1} < β/2.
If a_{n+1} > β/2 then the digits a_{i}' and the exponent t'
result from the representation of the sum
0. a_{1} a_{2} ... a_{n} × β^{t} + β^{t − n}. 
Remarks:
(i) We may also consider
fl_{s}(x) = fl_{c}(x + ^{1}/_{2} β^{t − n}) 
(ii) It should be noted that for statiscal purposes there is also a
Roundtoeven method that ensures a better distribution
(in this method, the final rounding number should have an even
mantissa − for instance, both 11.5 and 12.5 are rounded to the
nearest even number 12).

Definition:
Consider x an exact value and x^{~}
an approximated value of x, we define:
 Error : e_{x}( x^{~} ) = x − x^{~}
 Absolute error : e_{x}( x^{~} ) = x − x^{~}
 Relative error (x ≠ 0): δ_{x}( x^{~} ) with
δ_{x}( x^{~} ) =  e_{x}( x^{~} ) x  
 If there is no confusion with the approximation of x,
we simply write e_{x} to denote e_{x}( x^{~} ), and likewise δ_{x} for the relative error.
 We also use δ_{x~} when it is more important to emphasize which is the approximation.

As in a FP system the rounding errors are inherent, we bound the
rounding relative error δ_{arr(x)}, meaning
δ_{arr(x)} = δ_{x}(fl(x)) =  x − fl(x) x  

Theorem:
For any x: fl(x)∈ FP(β, n),
we have the relative error estimates:
 Truncation :
δ_{arr(x)}  < u_{c} = β^{1n}
 Rounding :
δ_{arr(x)}  < u_{s} = ^{1}/_{2} β^{1n}
The bounds u are sometimes called rounding units.


proof:
We prove for truncation. Consider any real number x
x = ± 0.a_{1} ... a_{n} a_{n+1} ... × β^{t} 
As a_{1}≠ 0 then a_{1}≥ 1 and
x≥ 0.1 × β^{t} = β^{t − 1} 
On the other hand
e_{arr} = x − fl_{c}(x)
= 0.a_{1} ... a_{n} a_{n+1}... − 0.a_{1} ... a_{n} × β^{t}

= 0.0 ... 0 a_{n+1}... × β^{t}
≤ 0.0 ... 1 0 ... × β^{t} = β^{t − n} 
implies
δ_{arr} =
e_{arr}/x ≤ β^{t − n} / β^{t − 1} = β^{1 − n} 
(The proof for rounding is similar, and these
estimates may be slightly improved...)

Remark
Note that in a FP system we never round to zero.
In some calculator machines, or in some ill implemented FP systems, it is common
to see an overflow error, but not an underflow error, meaning
that small absolute values are improperly rounded to zero.
This leads to odd situations, like null values for exponentials... for instance,
exp(exp(8)) = 0.243… × 10^{1294} 
may be zero in your calculator, without an "underflow warning".
 How to present the result, using a conventional machine?
In these cases you should separate calculations − what you need to compute
from what is not representable. Thus, we may reduce to powers of 10,
using logarithms, and presenting the result afterwards:
e^{exp(8)} = 10^{exp(8)/log(10)} = 10^{1294.6136} 
note that the operation exp(8)/log(10) presents no problems
for the calculator, and separating the decimal part of the exponent
10^{0.6136}=0.2434… we get the mantissa.
These type of errors may get worst while proceeding with further calculations...
just note that if the result was 0 we could not get back
the logarithm that should be exp(8).

Error propagation in functions 
Consider now the study of error propagation that results from applying
a function or an operation to real numbers. Generally, we want to know
if a function f does or does not affect much the proximity of the
original values:
x^{~} ≈ x ⇒ f(x^{~} ) ≈ f(x), 
in terms of the relative error. Applying Lagrange's Theorem,
e_{f(x)} = f(x) − f(x^{~} ) = f '(ξ) (xx^{~}) = f '(ξ)e_{x} 
with ξ∈ [ x; x^{~} ], supposing that f∈ C^{1} (in the interval).
It results (for x≠ 0, f(x)≠ 0)
δ_{f(x)} = e_{f(x)}/f(x) =
 f '(ξ) f(x)  e_{x} =  x f '(ξ) f(x)  δ_{x} 
and we may get the estimate for the relative error:
δ_{f(x)}  ≤
 x max_{ξ∈ [x,x~]} f '(ξ) f(x)  δ_{x} . 
This estimate is not very useful, and it is more simple to consider the approximation
ξ≈ x, which is a 1^{st} order (or linear) approximation, in terms of
the Taylor expansion.
In fact, neglecting the term o(e_{x}) in
f(x^{~} ) = f(x) + f '(x)e_{x} + o(e_{x}) 
we get e_{f(x)}≈ f '(x)e_{x}, and dividing by f(x),
we again get the approximation
δ_{f(x)} ≈  x f '(x) f(x)  δ_{x} 
which allows us to evaluate the error propagation from a function.

Definition: The value P_{f}(x) = x f '(x)/f(x)
is called the condition number of f in x, and we have
δ_{f(x)} ≈ P_{f}(x) δ_{x} 

If P_{f}(x) is large this means that the relative error is
considerably increased through the application of the function f.
On the other hand if it is small, this means that you may get
a relatively good approximation of f(x) even with considerable
errors in x.

Examples:
 If f(x)=x^{p}, we get P_{f}(x)=p, and therefore
δ_{x^p} ≈ p δ_{x}.
We conclude that for larger powers the initial relative error
will be quite increased.
 If f(x)=e^{x}, we get P_{f}(x)=x, thus
δ_{exp(x)} ≈ x δ_{x}.
We conclude that for large x elevado,
the initial relative error will be quite increased.
 If f(x)=xa, (a ≠ 0, constant) we get P_{f}(x)=x/(xa),
and therefore δ_{xa} = x/(xa) δ_{x}.
We conclude that for x ≈ a,
the initial relative error will be largely increased.
This example shows that the sum/subtraction may lead to
significative perturbations on the calculations, due to}
"subtractive cancellation".
 If f(x)=ax, (a ≠ 0, constant) we get P_{f}(x)=1, and therefore
δ_{ax} = δ_{x}.
In this case the original error is not affected.
As counterpart to some of the previous examples,
the propagation may lead to advantageous situations
... while evaluating x^{p} if p ≈ 0
(root extraction), the relative error decreases, and the
same happens for e^{x} or xa when x≈ 0.
In these cases, a reasonable approximation of x may lead
to an excellent result in terms of f(x).


Definition: We say that a function f is ill conditioned
near a point z if, in the limit, P_{f}(z) = ∞.

Operations (functions with two or more variables)
In the case of two (or several) variables, we must consider simultaneous errors in both variables.
Considering the Taylor expansion with two variables (∂_{m} represents the
partial derivative in the mth variable):
f(x^{~}, y^{~}) = f(x, y) +
∂_{1}f(x, y)e_{x} + ∂_{2}f(x, y)e_{y} + o(e_{x}, e_{y}) 
we get
e_{f(x, y)}≈ ∂_{1}f(x, y)e_{x} + ∂_{2}f(x, y)e_{y}
which is a linear approximation neglecting the term o(e_{x}, e_{y}),
and we get
δ_{f(x,y)}
≈  x ∂_{1}f(x,y) f(x,y)  δ_{x} +
 y ∂_{2}f(x,y) f(x,y)  δ_{y} 
In a similar way we get the condition numbers
P_{f}^{x}(x,y) e P_{f}^{y}(x,y)
as coefficients of the terms
δ_{x} and δ_{y} (respect.), and
δ_{f(x,y)}≈ P_{f}^{x}(x,y)δ_{x} + P_{f}^{y}(x,y)δ_{y} 

When any of these condition numbers tends to infinity in a value (x,y)
then we say that there is ill conditioning of f near (x,y),
as previously presented for a single variable.

Likewise these notions are extended for functions with more than two variables.
The most interesting cases are simple operations like sums or products,
that may be seen as functions with two variables.
 For the sum (subtraction) we get
δ_{x+y} =  x x+y  δ_{x} +  y x+y  δ_{y} ;
δ_{x − y} =  x x − y  δ_{x} −  y x − y  δ_{y} 
note that it is an equality and not just an approximation!
This formula shows immediately that there will be problems while
subtracting similar quantities, i.e. x ≈ y,
which is the case of subtractive cancellation, associated with
significative rounding error propagation.
 In the case of products (or divisions), the application of the formula gives
δ_{xy} ≈ δ_{x} + δ_{y}
;
δ_{x/y} ≈ δ_{x} − δ_{y} 
Note that this is no longer an equality, it is just an approximation − the exact
formula holds a nonlinear term δ_{x}δ_{y} that it is neglected
δ_{xy} = δ_{x} + δ_{y} − δ_{x}δ_{y} 

Error propagation in algorithms 
The previous study of conditioning does not depend in the way the
calculations are made, it just depends on the function.
However, when we calculate an expression it is not possible to
perform it is a single step − we are lead to intermediary calculations
and to the notion of algorithm.
Algorithm: While calculating z=f(x), we consider an
algorithm as a mathematically equivalent form
(but not numerically equivalent) defined by the finite composition
of elementary functions F_{k}:
z_{1}=F_{1}(x); z_{2}=F_{2}(x, z_{1});
... z_{n}=F_{n}(x,...,z_{n1});

such that z=z_{n} . 
(generally, the variables or functions might be multidimensional).
For instance, to calculate z=sin(x^{2})+x^{2},
we may consider the algorithm
z_{1}=x^{2}; z_{2} = sin(z_{1}); z_{3} = z_{2}+z_{1}. 
where at each step a single calculation is performed,
using the previously stored results.
It is now important to define what do we consider as elementar function.

Definition: In a FP(β,n) system
we say that F is an elementary computational function
associated to a function f, if
F(y) = fl(f(y)), ∀ y∈ D ⊆ FP(β,n) 
where D is a subset where the function is defined.

For instance, the computational function F=LOG is associated
to the function f(x)=log(x),
which is defined for x > 0, and we must consider
D = { x∈ FP(β,n) : x > 0 }. 
The computational function LOG is considered elementary if the result
(for the allowed numbers in the FP system) is considered exact up
to rounding errors [that is what means F(y) = fl(f(y))].
 In standard programming languages (C, Fortran, Matlab ...) the simplest
functions/operations are implemented to be elementary and they usually return
the exact value up to a rounding error in the 15^{th} decimal place (in double
precision). For instances, we will consider as elementary:
 the functions: exp, log, sin, cos,
 the operations: + , − , * , / , ^ (sum, subtraction, product, division, power);
 In principle the admissible set D should be restricted to avoid
Overflow or Underflow errors.
For instance, if x=10^{q} then in a system FP(10,n,t,t) when we program
the elementary operation x^p we should have an error when
x^{p }= 0.1 × 10^{1+qp}
has a larger exponent than the one admissible by the system, for instance, for q > 0:
 when p > (t1)/q : an Overflow error message should occur,
 when p < (t+1)/q : an Underflow error message should occur.

Condition numbers in an algorithm 
 Condition numbers without rounding error
The linear approximation allows to establish a sequence
between the partial results and the final result for the condition number.
We may establish this relation decomposing the
function in successive compositions. It is then enough to show
for a single composition to conclude the result for the whole sequence.
Let z=f(g(x)) that we decompose in z_{1} = g(x); z_{2} = f(z_{1}).
Applying the formula at each step, we get
δ_{z2} ≈  z_{1} f '(z_{1}) f(z_{1})  δ_{z1}
=  g(x) f '(g(x)) f(g(x))  δ_{z1}
≈  g(x) f '(g(x)) f(g(x))   x g'(x) g(x)  δ_{x} 
=  x g'(x)f '(g(x)) f(g(x))  δ_{x}
= P_{f ◦ g}(x) δ_{x} 
and therefore the final result gives exactly the condition number.
This is resumed in the next proposition:

Proposition:
With the linear approximation, without rounding errors,
the composition of the condition numbers at each step equals the
condition number of the whole composition.

 Condition numbers with rounding errors
Even with well implemented elementary functions, at each
step we must consider a rounding error.
Thus, to calculate the error propagation in the algorithm
we should add a rounding error, at each step.
For instance, in the first step:
δ_{z1}≈ P_{f}(x)δ_{x} + δ_{arr1} 
and similarly in the next steps, adding the term
δ_{arrk} in each step k of the algorithm.

Example:
Consider the hyperbolic sinus function,
sinh(x)=(e^{x} − e^{x})/2 
defined using the elementary function exp , by the algorithm
 z_{1} = exp (x);
 z_{2} = 1/z_{1};
 z_{3} = z_{2} − z_{1};
 z_{4} = z_{3}/2;
To study the error propagation in this algorithm, we have at
each step:
 δ_{z1} ≈ x δ_{x} + δ_{arr1};
 δ_{z2} ≈ − δ_{z1} +δ_{arr2};

δ_{z3} ≈ z_{1}/(z_{1}z_{2})δ_{z1} − z_{2}/(z_{1}z_{2})
δ_{z2} +δ_{arr3};
 δ_{z4} ≈ δ_{z3} +δ_{arr4};
the expressions may be substituted, obtaining
 δ_{z2} ≈ − (x δ_{x} + δ_{arr1}) + δ_{arr2};

δ_{z3} ≈
e^{x}/(e^{x}e^{x})δ_{z1} − e^{x}/(e^{x}e^{x})δ_{z2} + δ_{arr3}

≈ e^{x}/(e^{x}e^{x})(x δ_{x} + δ_{arr1})
− e^{x}/(e^{x}e^{x}) ( x δ_{x} δ_{arr1} + δ_{arr2})
+ δ_{arr3}

= x(e^{x}+e^{x})/(e^{x}e^{x})δ_{x}+(e^{x}+e^{x})/(e^{x}e^{x})
δ_{arr1} − e^{x}/(e^{x}e^{x})δ_{arr2} + δ_{arr3}; 

δ_{z4} ≈
x(e^{x}+e^{x})/(e^{x}e^{x})δ_{x}
+(e^{x}+e^{x})/(e^{x}e^{x})δ_{arr1} − e^{x}/(e^{x}e^{x})δ_{arr2}
+δ_{arr3}+δ_{arr4}; 
The last expression may be written in a general way
δ_{z4} ≈
P_{f}(x) δ_{x}
+ Q_{1}(x)δ_{arr1}
+ Q_{2}(x)δ_{arr2}
+ Q_{3}(x)δ_{arr3}
+ Q_{4}(x)δ_{arr4} 
and we emphasize that δ_{x} coefficient is
exactly the condition number P_{f}(x), while all the
other coefficients Q_{m}, associated with the rounding errors
δ_{arrm}, depend on the chosen algorithm
(the last is always one, here Q_{4}=1).
Remark: In fact, we may simplify the increasing sequence of
calculations to obtain δ_{z4} considering from the start that
δ_{x}=0.
This way, we just calculate the terms Q_{k}(x) as the terms
P_{f}(x) are given by the explicit formula.
It is an exercise to repeat the previous steps in this simpler fashion:
δ_{z1} ≈ x δ_{x} + δ_{arr1} = δ_{arr1};
δ_{z2} ≈ δ_{z1} +δ_{arr2} = δ_{arr1} +δ_{arr2};
δ_{z3} ≈ ( (e^{x} + e^{x})δ_{arr1} − e^{x}δ_{arr2} ) /(e^{x}e^{x}) + δ_{arr3};
δ_{z4} ≈ δ_{z3} +δ_{arr4};
and the coefficients are exactly the same!
 Note that sinh is not ill conditioned at
any point x ∈ R, since even for x ≈ 0 we get
[Remark: it will be ill conditioned for large values, x ≈ ∞, because
lim_{x→ ∞} P_{f}(x) = ∞ .]
 Although being well conditioned for x ≈ 0, we see that
lim_{x→ 0} Q_{1}(x) = + ∞;
lim_{x→ 0} Q_{2}(x) = − ∞ 
and we conclude that in this algorithm the rounding errors change
significantly the final result when x ≈ 0.
Thus, besides conditioning problems in functions, in an algorithm
we consider also another notion − numerical stability.

Definition:
Consider an algorithmo that uses m elementary functions
to calculate the function value z=f(x), in a FP(β,n) system.
The expression for the relative error is linearly approximated by
δ_{zm} ≈ P_{f}(x) δ_{x}
+ Q_{1}(x)δ_{arr1} + ... + Q_{m}(x)δ_{arrm} 
If for a certain w we have
lim_{x→ w} P_{f}(x) = ∞
or
lim_{x→ w} Q_{k}(x)
= ∞
(for some k) 
we say that the algorithm is numerically unstable for the
values x ≈ w.

We present the definition for one variable, but it is clear
the generalization for algorithms that depend on a larger number
of variables. For instance, for two variable functions we have
the condition numbers P_{f}^{ x}(x,y) e P_{f}^{ y}(x,y).

Remark:
An immediate consequence of the definiton is that
ill conditioning implies numerical instability.
Assim, se é impossível resolver os problemas de mau condicionamento,
sem mudar o número n no sistema FP(β,n)
(o que pode ser feito em linguagens mais sofisticadas, como o Mathematica),
devemos procurar evitar os problemas de instabilidade numérica, no cálculo de funções
bem condicionadas. Isso pode ser feito procurando algoritmos estáveis, ou pelo menos
evitando introduzir instabilidades numéricas no algoritmo.

Examples:
 Consider an algorithm to calculate f(x)=(x1)^{2}+1:
z_{1} = x1; z_{2} = x+1;
z_{3} = z_{1}z_{2};
z_{4} = z_{3}+1; 
where, with simplifications made, it resumes to the square function
f(x)=x^{2}. This is clearly a well conditioned function, with
However, the unsimplified expression in the algorithm leads
to numerical instabilities as the study of the propagation error
shows
δ_{f(x)} ≈ 2δ_{x} + (1 − x^{2})
(δ_{arr1} + δ_{arr2} + δ_{arr3}) + δ_{arr4} 
When x → 0 we obtain Q_{1,2,3}(x) = − ∞,
showing the numerical instability problem when x ≈ 0.
In fact, we may check that in any FP(10,n) system with
rounding, if x=10^{n1} we get z_{1} = 1; z_{2} = 1;
and therefore z_{4} = 0, giving 100% relative error, i.e.
δ_{z4}=1 (even considering that x is exact and
δ_{x}=0). For that FP system we know that the
rounding unit verifies
δ_{arr} < u = 0.5 × 10^{1n}, 
we have δ_{z4} = 1 ≈ (2 × 10^{n1}) 0.5×10^{1n} .
 Recovering the hyperbolic sinus example, that function presents no
ill conditioning problems (for fixed x ∈ R),
but when x ≈ 0 the algorithm presents numerical instabilities, that
we see in the infinite limits for Q_{1} and Q_{2}.
Based on the Taylor expansion, we may present an approximation that
avoids that problem near zero. Thus,
sinh(x)=(e^{x} − e^{x})/2 = (1+x+x^{2}/2+O(x^{3}) − (1 − x+x^{2}/2+O(x^{3})))/2
= x+O(x^{3}) 
and we may consider sinh(x) ≈ x, with an O(x^{3}) error term.
That error is neglectable for x < 10^{t}
in a FP(10,n) system when t < n/3.
Therefore we may propose a stable algorithm to calculate z=sinh(x):
 if t < n/3 then z=x;
 if t ≥ n/3 then z=z_{4} (using the sinh algorithm for z_{4});


Nonlinear scalar equations 
The resolution of algebraic (polynomial) equations has
been an interesting subject even to the oldest civilizations.
Some ancient records show that some second degree equations
were solved in Babylon (IV century BC) and also appear in the works of
Euclid
in the III century BC
(the general setting was considered by
Brahmagupta, VII century AD, and by
AlKharizmi, IX century AD).
However, no closed formula for third degree equations was found in the next millenia.
Some advances made by medieval scholars
(e.g. Fibonacci)
were only completed in the XVI century, with the Italian Renaissance.
Afterwards, new difficulties and impossibilities for higher degree equations
lead to numerical methods for the resolution of any equation.
These numerical methods also showed a performance that was better
than the use of the closed formulas and it was not restricted to algebraic equations.
Algebraic Equations
 1^{st} Degree:
First degree equations have a trivial solution:
a x + b = 0 ⇔ x = a^{1} b (quando a ≠ 0) 
However we note that even in this trivial case, an elementary
division algorithm is needed for b/a, and this is not easy to program
in a FP system (it is probably more complicated than any other
method here presented).
While writing the solution in the form x = a^{1} b, we consider
a general case, that includes matrices, if the condition a ≠ 0,
is understood as the invertibility condition, det(a) ≠ 0.
 Quadratic equation (2^{nd} degree):
Second degree equations have a closed simple formula:
a x^{2} + b x + c = 0 ⇔
x = (2a)^{1} (  b ± (b^{2} − 4ac)^{1/2} ) (quando a ≠ 0) 
These solutions are complex conjugates for negative discriminants, b^{2} − 4ac < 0
(assuming real coefficients).
This expression includes a solution in any field structure:
To solve x^{2} − x − 1 = 0 in Z_{7}:
x = (2)^{1} (1 ± (1 + 4)^{1/2}) = 4 ( 1 ± 3 i) = 4 ± 2 i
 Cubic and Quartic (3^{rd} and 4^{th} degree):
After the work of Khayyam
and other slow medieval advances,
del Ferro and
Tartaglia
found the formula to solve the cubic equation in the beginning of the XVI century.
These works were further explored to solve the quartic equation by
G. Cardan and
L. Ferrari
and published by Cardan.
Links to these formulas may be found here:
cubic formula ,
quartic formula.
These techniques developed at Bologna University failed during the
next centuries to solve the quintic (5^{th} degree).
 Quintic (5^{th} degree):
One of the most important problems in the next three centuries was to
prove or disprove the existence of a closed formula for the quintic.
After some substancial progresses by
Lagrange,
the non existence was proven independently in the begining of the XIX century by
Galois and
Abel.
This impossibility answer made it clear that there would be no
way to express explicitly the solution of some algebraic equations with degree ≥ 5.
 Degree p: Despite the non existence of an explicit formula for the quintic
(or higher degree equations) it was analytically established (by
Argand,
Gauss,
early XIX century) that any algebraic equation with degree p would have exactly
p complex roots − it is the well known
fundamental theorem of algebra.
Non algebraic equations: For other, non algebraic equations,
defined also with non polynomial functions (sinus, exponentials, ...),
from the beginning there was no hope in finding closed explicit formulas,
except in trivial situations. We note that numbers like π or e
may be solutions of simple non algebraic equations, but they can
not be expressed by a solution of any algebraic equation
(with rational coefficients). These numbers are called
non algebraic, or transcendental.
Numerical Methods for Root Finding
Since explicit formulas were not available, even in the medieval times
some numerical procedures were attempted.
Fibonacci (XIII century) presents fixed point iterations to solve
some specific algebraic equations, but he does not explore the method.
It was only in the XVII century, with the development of calculus in Europe,
that numerical methods were clearly established, specially after the works of
Newton
(see also
Raphson).
The Newton Method that we will study is probably the most well known general
method for root finding.
Rootfinding − general setting:
Given a real function f ∈ C[a,b],
we want to approximate the solution z ∈ [a,b] verifying
The solutions z are also called roots or
zeros of the function f.
 The iterative methods consist in finding a sequence
(x_{n}) that converges to the solution z.
This sequence is constructed in a recursive (or iterative) way,
starting with some initial values and calculating x_{n+1}
from the previous x_{n} (or more x_{n}, x_{n1}, ...).

We note that once we establish the convergence of the sequence (x_{n})
we have not only an approximation in each term x_{n}, but also an
exact representative of the real number through the convergent
sequence of rational numbers (if x_{n} are real, we can always consider
a similar rational number that does not interfere in the convergence
 the usual way is by rounding the decimals).

The methods presented in this chapter are mainly to find the roots
of an equation with a single real variable, but they may be used in
more general contexts − systems, functional or differential equations
(bissection is mostly restricted to ordered structures;
fixed point, Newton and secant methods have been generalized to
equations in complexes, matrices, operators...)
\end{itemize
We start by reviewing some basic Calculus results that ensure
uniqueness and existence of solution in an interval (these are non
constructive results).

Root localization 
A simple graph inspection may give us an approximate idea
of root localization (as the points where the X axis is crossed).
However to ensure rigorously that one and only one root exists
in an interval, we recall some elementary theorems:

Theorem (Bolzano's (intermediate value) theorem : Existence ):
Let f ∈ C[a,b] : f(a) f(b) ≤ 0. Then ∃ z ∈ [a,b] : f(z) = 0.

We will use this theorem with the computational notation:
BolzanoTheorem[f ∈ C[a,b] , f(a) f(b) ≤ 0] :: z ∈ [a,b] : f(z) = 0
The intermediate value theorem only ensures existence,
to ensure uniqueness we may use:

Theorem (Rolle's theorem : Uniqueness) :
Let f ∈ C^{1} [a,b], f(a)=f(b) implies ∃ ξ ∈ [a,b] : f ' (ξ) = 0.
Thus, if f ' ≠ 0 then f is injective, and there exists
at most one zero of f in the interval.

Lagrange's mean value theorem is a simple consequence of Rolle's theorem
and leads to an a posteriori error estimate for any approximation.

Theorem (Lagrange's mean value theorem):
Let f ∈ C^{1} [a,b]. Then
∃ ξ ∈ [a,b]: f(a)f(b) = f '(ξ) (ab)
Corollary (Lagrange error estimate − a posteriori):
Let f ∈ C^{1} [a,b] with f '(x) ≠ 0 ∀ x∈ [a,b].
If z^{~} is an approximation of the root z ∈ [a,b], then
z − z^{~} ≤   f(z^{~}) min_{x∈[a,b]}f '(x) 




proof of the theorem:
Take g(x) = (b − a)f(x) − (f(b) − f(a))x.
We have g(a)=
proof of the corollary:
Since z , z^{~}∈ [a,b] there exists ξ∈ [z; z^{~} ] ⊆ [a,b]:
f(z^{~}) − f(z) = f '(ξ)(z^{~} − z) 
and as f(z)=0, we have z^{~} − z = f(z^{~}) / f '(ξ).
Thus
z − z^{~} ≤ f(z^{~}) / min_{x∈[a,b]} f '(x). 

These theorems do not give any method to approximate the root.
Nevertheless we may use the Intermediate value theorem to develop
a very simple iterative method.

Bissection Method 
Assume that in the interval [a, b] the equation f(x) = 0
has one and only one root (which may be verified with the help of
the first two previous theorems).
We generate intervals [ a_{n}, b_{n} ] half size of the
previous ones [ a_{n1}, b_{n1} ], where we ensure the
existence of the root, using the Intermediate value theorem.
The method by set with this algorithm :
 Initial interval : [ a_{0} , b_{0} ] = [ a, b ]
 Iterate − Step n=1, 2, ... :
 x_{n} = ½ ( a_{n1} + b_{n1} )
 If f (x_{n}) f(a_{n1}) < 0 then a_{n} = a_{n1} ; b_{n} = x_{n} ;
 If f (x_{n}) f(a_{n1}) > 0 then a_{n} = x_{n} ; b_{n} = b_{n1} ;
 If f(x_{n}) = 0, or some other criteria, the method stops.
At the end of step n we have the approximation x_{n}
and the subsequent error of approximation
We can always obtain an a posteriori error estimate using
the previous corollary,
e_{n} ≤ f(x_{n}) / min_{x∈[an1, bn1]} f '(x) , 
but we may also deduce a priori error estimates.
 A priori estimate in the bissection method.
As x_{n} is the middle point, the new interval [ a_{n}, b_{n} ]
is half size the previous one [ a_{n1}, b_{n1} ], meaning
a_{n} − b_{n} = ½a_{n1} − b_{n1}  
and this leads to
a_{n} − b_{n} = (½)^{n}a_{0} − b_{0}  = 2^{n}a − b . 
On the other hand, z∈ [ a_{n}, b_{n} ], with x_{n}=a_{n} or x_{n}=b_{n},
e_{n} = z − x_{n} ≤ a_{n} − b_{n} = 2^{n}a − b . 
which is an a priori estimate (we may bound the absolute error
in the step n without performing a single iteration).
 If we want that e_{n} ≤ ε it is enough to demand
2^{n}a − b ≤ ε
⇔ a − b / ε ≤ 2^{n}
⇔ n ≥ log _{2} ( a − b / ε ) 
using the a priori estimate.
We conclude, for instance, that for an interval a − b=1,
double precision is ensured after a little more than fifty iterations,
because to have ε < 2^{56}≈ 1.4× 10^{17}
it is enough that n > 56.
The bissection method may be seen as the last method to
try, when all the others fail. We will study other methods, like
Newton's method that converges in a fewer number of iterations
(usually 5 or 6 iterations are enough to have double precision accuracy).

Fixed point method 
The fixed point method starts by considering an equivalence
where the zero z of the function f is the fixed point
of another function g
(called iterative function).
Then, the method is resumed to the following iteration:
 Given any initial x_{0} ;
 Iterate x_{n+1}=g(x_{n}).

Examples:

In a calculator machine, starting with 0 (or any other value), if we
repeatedly press the COS key, we obtain a sequence of values (in radians)
0; 1.0 ; 0.54… ; 0.857… ; ... ; 0.737… ; ... ; 0.73907… ; ... ; 0.7390851… ; etc... 
After 50 times, meaning 50 iterations, we see that the values of the sequence
are getting stable and the first digits remain the same.

In this process we are aplying the fixed point method, and the iterative
function is the cosine function.
Starting with x_{0}=0 we repeatedly make x_{n+1}= cos (x_{n}).

When it converges (as in this case), x_{n} → z then
because the cosine is a continuous function. Thus, in the limit
x_{n+1}= cos (x_{n}) → z = cos (z). 
We conclude that the limit value z=0.7390851…
is the solution of the equation z = cos (z).

Let f(x) = x − cos (x) then
f(x) = 0 ⇔ x = g(x) = cos (x). 
Applying the previous theorems we see that in the interval [0,1] we have
f(0) = 0 − cos(0) = 1 < 0, f(1) = 1 − cos(1) > 0 
and we ensure the existence of solution. The uniqueness results from
f '(x)= 1 + sin (x)≥ 0 which is only null for x=π/2+k π ∉ [0,1].

Furthermore, as
min_{x∈[0,1]} f '(x) = min_{x∈[0,1]} 1+ sin (x) = 1 
the a posteriori estimate allows to conclude that the absolute error
for the presented value is very small:
z − 0.7390851… ≤ ^{0.7390851… − cos (0.7390851…)}_{/1 }
= 0.55…× 10^{7} 
 Other simple examples may occur by pressing other calculator keys.
Using the SQRT (square root) key, we have g(x)=√ x
and the sequence x_{n+1}=√ x_{n} converges to a solution of
z=√ z, which is also a solution of z=z^{2},
meaning z=1. In this case there is another solution z=0 but the method
never converges to it, unless you start with it.
As a counterpart, if you have a SQR (square) key you are considering g(x)=x^{2}
and the sequence x_{n+1}=x_{n}^{2} converges to a solution of z=z^{2},
which is z=0 when you start with x_{0} < 1. If we start with x_{0} > 1
the sequence goes to infinity. The iteration only gets the other solution
z=1 when we start exactly with x_{0}=± 1.

From this examples we saw different situations, leading to convergence and
non convergence, that we will analyze.
Generally, the "key" that you repeatedly press is the iterative function g.
The value with which you start is the initial iterate, and the sequence
of iterations is obtained by "pressing the key". We may get a rough idea of
convergence by checking that the decimal digits are stabilizing, but this
may be misleading in some unclear situations.
Remark: No matter there is convergence or not, at each iterate
you may evaluate its accuracy using the a posteriori estimate
with f(x)= x − g(x) e z^{~}=x_{n}:
z − x_{n} ≤   x_{n} − g(x_{n}) min_{x∈[a,b]}1 − g'(x) 
≤   x_{n+1} − x_{n} 1 − max_{x∈[a,b]}g'(x)  
with g'(x) < 1, a condition that will define the convergence.
Based on general plots for different iterative functions g
that we present in Fig. 1, we follow the iterations for convergence
and non convergence ("divergence") situations.
Figura 1: Fixed Point Method − different cases.
Convergence situations are presented in the plots above
(monotone convergence − on the left, alternate convergence − on the right).
The fixed point acts like an iteration atractor. On the other hand,
in the two plots below the fixed point acts like an iteration repulsor.
Even if we start near the fixed point the iterations diverge from that value.

Definition:
We say that g is a lipschitzian function in [a, b]
if there exists L > 0 such that :
 g(x) − g(y)  < L  x − y  , ∀ x, y ∈ [a, b] 
The function is called a Contraction if the Lipschitz constant
is less than one, meaning L < 1.

Note that lipschitzian functions are continuous
on the interval
(x→ y implies immediately g(x)→ g(y)).
When the function has derivatives we relate this notion
to the bounds of the derivative:

Proposição:
Let g ∈ C^{1}[a, b] such that g '(x) ≤ L, ∀ x ∈ [a, b],
then the function g is lipschitzian in that interval, and when L < 1
it is a contraction.


dem:
Using Lagrange's theorem, for any x, y in [a,b],
there exists ξ ∈ ]x; y[ ⊂ [a, b] such that
 g(x) − g(y)  = g'(ξ) xy ≤ L x − y. 

With these notions we may now present a fixed point theorem
(here only for an interval).

Theorem (fixed point in a bounded interval).
Let g be an iterative function in [a, b] such that:
 g is contractive in [a, b],
 g is invariant in [a, b], meaning g([a, b]) ⊆ [a, b] ,
then
 g has one and only one fixed point z ∈ [a,b] ;
 for any x_{0} ∈ [a, b],
the fixed point sequence defined by x_{n+1} = g(x_{n})
converges to z;
 we have the following error estimates:
 (a posteriori)
 (i)  e_{n}  < L  e_{n1} 
 (ii)  e_{n}  < L/(1L)  x_{n} − x_{n1}
 (a priori)
 (iii)  e_{n}  < L^{n}  e_{0} 
 (iv)  e_{n}  < L^{n} (1L)^{1}  x_{1} − x_{0} 
 Here L < 1 is the contractivity constant.
When g∈ C^{1}[a,b], L = max_{x∈ [a,b]} g'(x) < 1.

proof:
 Existence (fixed point in an interval):
it may be seen as simple consequence of the intermediate value theorem.
Take f(x)=g(x)x, the invariance hypothesis implies
a≤ g(a), g(b)≤ b.
Therefore, f(a)f(b)≤ 0 and there exists
z∈ [a,b]: f(z)=0 ⇔ z=g(z).
 Uniqueness (fixed point in an interval)
Suppose z,w∈ [a,b] both fixed points, z=g(z) and w=g(w).
Using the contraction hypothesis
z − w = g(z) − g(w) ≤ L z − w ⇔ (1L)z − w≤ 0 
and as L < 1 temos z − w≤ 0 that is z = w.
 Convergence and estimates:
Using the contraction hypothesis,
e_{n}=z − x_{n} = g(z)g(x_{n1}) ≤ L zx_{n1} = L e_{n1} 
we obtain the first estimate that, applied repeatedly, gives (iii)
As L < 1 this estimate implies e_{n}→ 0
(note that e_{0}≤ ab),
and we conclude the convergence.
Finally (ii) results from
e_{n}≤ Le_{n1} = Lz − x_{n} + x_{n} − x_{n1} ≤ Le_{n} + Lx_{n} − x_{n1} 
thus (1 − L)e_{n} ≤ Lx_{n} − x_{n1} implies (ii).
Also, (ii) implies e_{1} ≤ L/(1 − L) x_{1} − x_{0}
considering n=1, and (iv) results from
e_{n} ≤ L^{n1} e_{1}.


Proposição:
Under the hypothesis of the fixed point theorem, we have:
 (a) If 0 < g'(x) < 1, ∀ x ∈ [a, b] then we have monotone convergence
(all iterations are on the same side: x_{n} > z or x_{n} < z)
 (b) If 1 < g'(x) < 0, ∀ x ∈ [a, b] then we have alternate convergence
(iterations change side at each step: x_{n}x_{n+1} < 0)


dem:
As
e_{n} = g(z) − g(x_{n1}) = g '(ξ_{n}) e_{n1} 
and as the sign of g' is constant, from the hypothesis, we have
sign(e_{n} ) = sign( g '(ξ_{n}) ) sign(e_{n1} ) 
and obtain
in case (a): sign(e_{n} ) = + sign(e_{n1} ) = ... = sign(e_{0} ),
in case (b): sign(e_{n} ) = − sign(e_{n1} ) = ... = (1)^{n} sign(e_{0} ).

Remark: When alternate convergence is ensured, we have
z∈ [x_{n1}; x_{n}]. This allows to define new intervals where
we have the fixed point, and we may update the value of L.
For instance, assuming that 1 < g' < 0 and it is monotone,
after p iterations we define a new
L_{p} = max { g '(x_{p1}); g'(x_{p}) }

obtaining the following estimates with these new values.
A possibility for an updated a priori estimate will be
e_{p+k} ≤ (L_{p})^{k}  x_{p1} − x_{p} 


Example:
A method to solve quadratic equations x^{2}+bxa=0
consists in establishing the equivalence (for x≠ b)
x^{2}+bxa=0 ⇔ x(x+b) = a ⇔ x = a/(x+b) 
where the iterative function is g(x)=c/(x+1).
This leads to a sequence known as continued fraction
x_{n+1} = a/(b+x_{n}) = a/(b + a/(b + a/(b + ...))) 
that also allows to approximate square roots with fractions, in
an optimal way (taking into account the size of the values).
For instance, for a= b= 1, starting with x_{0}=1 we get
x_{1}=1/2, x_{2}=2/3, ..., x_{6}=13/21, ..., x_{13}=377/610, x_{14}=610/987, ... 
and the last value x_{14} = 610/987 = 0.6180344… is a simple fraction,
giving a good approximation of the exact value z=1+√ 5=0.6180339887…
We may check that in the interval I=[1/2, 1] the fixed point theorem hypothesis
are verified:
 contractivity: g'(x)=(1+x)^{2} is increasing (g'' > 0)
and it is negative, therefore
L = max_{x ∈ I} g'(x)= max{ g'(1/2), g'(1) } = max{ 4/9, 1/4} = 4/9 < 1 
 invariance: g is decreasing (g' < 0 ) and we have
g(1/2) = 2/3 ∈ [1/2,1], g(1) = 1/2 ∈ [1/2,1] 
This justifies the convergence, which is alternate (g' < 0), implying
that z∈ [377/610, 610/987] and furthermore, using the
a posteriori estimates (i) and (ii) we have
e_{14} ≤ L/(1L) x_{14}x_{13}
= 4/5×0.166…10^{5} < 0.133×10^{5} 
As g' is increasing and negative, considering a reduced interval
L_{14}= max {g'(x_{13}), g'(x_{14}) } = 0.381966… 
and we have a precise estimate for the following iterations
e_{14+k}≤ 0.382^{k} (0.133×10^{5}) 

Theorem (Non convergence or "divergence"):
Let g∈ C^{1}(V_{z}) where V_{z} is a neighborhood
of the fixed point z.
If  g '(z)  > 1, the fixed point sequence never converges to
z, except if x_{p}=z (at some step p).


proof:
We saw, using Lagrange's theorem:
e_{n+1} = g '(ξ_{n}) e_{n} 
with ξ_{n}∈ [x_{n};z] ⊂ V_{z} considering x_{n}
sufficiently close to z. Since  g '(z)  > 1, then
in the neighborhood we have also  g '(ξ)  > 1, (because g∈ C^{1}(V_{z})).
Since e_{n} ≠ 0 (as x_{n}≠ z)
e_{n+1} = g '(ξ_{n}) e_{n} > e_{n} 
and the error increases, making convergence impossible.

This is a local result, that allows to conclude
non convergence as long as we show that the derivate
(in modulus) is greater than 1, in an interval that contains
the fixed point.
As a counterpart we may establish a local convergence result.
This result is slightly different from the fixed point theorem,
as it does not demand invariance, but we need have an
a priori location of the fixed point, as the initial iterate must
be sufficiently close to it.

Theorem (Local convergence):
Let g∈ C^{1}(V_{z}) where V_{z} is a neighborhood
of the fixed point z.
If  g '(z)  < 1, the fixed point sequence converges to
z, when x_{0} is sufficiently close to z.


proof:
This result may be reduced to the fixed point theorem considering
with ε small enough to have a contraction,
g'(x) < 1, ∀ x∈ W_{z}⊂ V_{z} , 
as g'(z) < 1 and g∈ C^{1}(V_{z}).
Therefore, if x∈ W_{z} we have
g(x) − z = g '(ξ) x − z  ≤ ε ⇐ g(x)∈ W_{z} 
and invariance is shown. Thus, we conclude the convergence
by the fixed point theorem, as long as x_{0}∈ W_{z}, meaning,
for initial iterates sufficiently close to z.

Within convergence cases, there are sequences that converge
more rapidly than others. Thus, it is appropriate to distinguish
the order of convergence.

Definition
Let (x_{n}) be a sequence that converges to z.
We say that (x_{n}) has order of convergence p with
asymptotic coefficient K_{p} ≠ 0 if
the following limit exists
lim_{n→∞}  e_{n+1} e_{n}^{p}  = K_{p} ≠ 0 

Note that K_{1}≤ 1 because if K_{1} > 1 no convergence
occurs: e_{n+1} > e_{n} (the error increases).
We distinguish 3 main classes of convergence:
 Superlinear (or Exponential): when p>1 (or p=1 and K_{1} = 0).
 Linear: p=1 and 0 < K_{1} < 1
 Sublinear (or Logarithmic): p=1 and K_{1} = 1
There are other names as quadratic convergence
when p=2, cubic convergence when p=3
(these convergence orders are different from the ones
given in the context of approximation, as in quadrature rules).
Note that if the limit is K_{p}=0 this means that the convergence
is at least of order p, but then one should look for an higher p,
to obtain the correct order of convergence.

Examples:
We present some examples that are not necessarily connected to
the fixed point method.
(i) Let x_{n} = α^{n} with 0 < α < 1.
This sequence converges to z=0, and we have
K_{1} = lim_{n→∞}  e_{n+1} e_{n}  =
lim_{n→∞}  α^{n+1} α^{n}  = α ≠ 0 
this means a linear convergence with K_{1} = α.
(ii) Let x_{n} = α^{β^n} with 0 < α < 1 < β.
[note that x_{n} = α^(β^n)
is different from (α^β)^n = α^(β n) ]
This sequence also converges to z=0, but we have
K_{1} = lim_{n→∞}  e_{n+1} e_{n}  =
lim_{n→∞}  α^{β^(n+1)} α^{β^n}  =
lim_{n→∞}  (α^{β^n})^{β} α^{β^n}  =
lim_{n→∞} (α^{β^n})^{β1} = 0 
and we have superlinear convergence (K_{1} = 0). To identify
the exact order, we check
K_{β} = lim_{n→∞}  e_{n+1} e_{n}^{β}  =
lim_{n→∞} (α^{β^n})^{ββ} = 1 ≠ 0 
concluding that the order of convergence is β > 1.
(iii) Finally a logarithmic convergence result.
Take x_{n} = n^{b} with b>0, that converges to z=0. As,
K_{1} = lim_{n→∞}  e_{n+1} e_{n}  =
lim_{n→∞}  (n+1)^{b} n^{b}  =
lim_{n→∞} (1+1/n)^{b} = 1^{b} = 1 
we conclude the sublinear convergence.


Theorem (Fixed point method − order of convergence):
Let g∈ C^{p}(V_{z}) where V_{z} is a neighborhood of the fixed point z.
If g^{(p)}(z)≠ 0, e g'(z) = ... = g^{(p1)}(z) = 0,
then the (local) order of convergence of the fixed point sequence is p,
and the asymptotic coefficient is given by
K_{p} = g^{(p)}(z) / p! 
(note that for p=1, this implies K_{1} < 1 to have local convergence).


proof:
The local convergence was ensured in the previous theorem for K_{1} = g'(z) < 1.
Using the Taylor expansion with Lagrange remainder
g(x_{n}) = g(z) + g^{(p)}(ξ_{n})/p! (e_{n})^{p}, com ξ_{n} ∈ [x_{n};z] 
because g'(z) = ... = g^{(p1)}(z) = 0.
On the other hand, g(x_{n})=x_{n+1} and z = g(z) means
 e_{n+1} = g^{(p)}(ξ_{n})/p! (e_{n})^{p}, 
Thus, substituting in
K_{p} = lim_{n→∞}  e_{n+1} e_{n}^{p}  =
lim_{n→∞} g^{(p)}(ξ_{n})/p! = g^{(p)}(z) / p! ≠ 0 
we conclude that the order of convergence is p.

Remark:
(i) It is an immediate consequence of the definition of order of convergence
that if a method has order of convergence p > 1, a method grouping two
iterations will have order of convergence 2p.
Therefore, it is a qualitative difference to obtain superlinear convergence,
and it is a secondary issue to obtain higher orders of convergence
(despite some emphasis in the usual literature).
Furthermore, if the convergence is just linear, a new method grouping two
iterations, will just reduce the asymptotic coefficient to one half.
Thus, the order of convergence plays a similar role to superlinear methods,
as the role of the asymptotic coefficient in linear convergent methods.
(ii) Another possibility to define order of convergence p,
consists in having a constant K_{p} > 0 such that
e_{n+1} ≤ K_{p} e_{n}^{p} (n sufficiently large) 
(when p=1 we must have K_{1}< 1), and it is not necessary that
the limit exists.
Furthermore, knowing the order of convergence, we may establish
a relation between the difference of iterations and the absolute error:
e_{n} =  x_{n+1} − x_{n} 1 − e_{n+1}/e_{n} 
≤  x_{n+1} − x_{n} 1 − K_{p} e_{n}^{p1}  
as long as K_{p} e_{n}^{p1} < 1. We conclude that
e_{n} ≤  K_{p} e_{n1}^{p1} 1 − K_{p} e_{n1}^{p1}  x_{n} − x_{n1} 
... this expression was already known for p=1
(the condition K_{1} < 1 is necessary for the linear convergence).
Based on the difference of iterations we may define
a stopping criteria, noticing that
e_{n} ≤  L_{n} 1 − L_{n}  x_{n} − x_{n1}, when L_{n} = K_{p} e_{n1}^{p1} < 1


(this will happen for some n as e_{n} → 0).

Newton Method 
The previous theorem allows to conclude superlinear convergence
if the iterative function as a null derivative in the fixed point.
Thus we are led to search methods with this property.
When f '(x) ≠ 0 we may establish an equivalence
f(x) = 0 ⇔ x = x − f(x)/f '(x) 
and define the iterative function
For the root z, as f(z)=0, we get
g'(x) = f(x) f ''(x)/(f '(x))^{2} ⇒ g'(z) = 0 
Therefore, according to the previous theorem, in this case
we have at least quadratic convergence
(it might be cubic, if f '(z) ≠ 0, f ''(z) = 0).
Newton Method consists in applying the fixed point
method to this g, meaning
 Given an initial iterate x_{0} ;
 Repeat
x_{n+1} = x_{n} − f(x_{n })/f '(x_{n })

 .
and as we saw, excluding the case f '(z)=0,
we ensure at least quadratic convergence.
Remarks:
(1) Newton Method has also a geometrical interpretation
that is related to its original deduction (this method is also
called NewtonRaphson Method, or Tangent Method)
as presented in Fig. 2.
Figura 2: Iterations in Newton Method.
Starting with x_{0}=1, the new point x_{1} results from the
intersection of the tangent line with the axis y=0, and so on.
The geometrical deduction consists in using the tangent line
y = f(x_{n}) + f '(x_{n})(xx_{n}) 
as an approximation of f in x_{n} (it is the first order
approximation in Taylor's expansion).
Deducing the solution x such that y=0 we obtain the same expression
0 = f(x_{n}) + f '(x_{n})(xx_{n}) ⇔
x = x_{n} − f(x_{n})/f '(x_{n}) 
this solution x is the new approximation x_{n+1}.
(2) We may be led to deduce higher approximations, increasing the
order of the Taylor expansion. This leads to a higher order of convergence
but also to more complex expressions, that are computationally less
efficient than grouping iterations (see previous remarks).
In fact n iterations with Newton Method are usually more efficient
than a single iteration with some other methods of order 2n.

Examples:
We present some examples that show the performance of Newton Method.
(i) Calculate z= ^{p}√ a using only arithmetical operations.
Note that here z is the solution of f(x) = x^{p} − a = 0.
According to the local convergence theorem, we take a
sufficiently close x_{0}, and the method resumes to
x_{n+1} = x_{n} − (x_{n}^{p}a)/(p x_{n}^{p1})
= x_{n}(11/p) + (a/p) x_{n}^{1p} 
it is even simpler to extract square roots (p=2)
x_{n+1} = x_{n}/2 + a/(2 x_{n}) 
(this is useful in binary FP systems, as the
divisions/products by 2 are trivial).
(ii) Consider for instance the calculation of z= ^{3}√ 6:
x_{0} = 2, x_{1} = 11/6, x_{2} = 1.817263…,
x_{3} = 1.81712060…,
x_{4} = 1.817120592832139… 
in this 4^{th} iterate all the presented digits are correct.
(iii) Recovering the continued fraction example,
the application of Newton Method to f(x)=x^{2}+bxa=0
leads to calculations that involve also simple fractions
x_{n+1} = x_{n} − ((x_{n}+b)x_{n}a)/(2x_{n}+b)
= (x_{n}^{2}+a)/(2x_{n}+b) 
obtaining for x_{0}=1
x_{1}=2/3, x_{2}=13/21, x_{3}=610/987, x_{4}=1346269/2178309, ... 
Note that these are the values that we also obtained in the
continued fraction, but in a much faster way
(here the 2^{nd} iteration is the 6^{th} iteration in the continued fraction,
the 3^{rd} is the previous 14^{th}, and the 4^{th} is the previous 30^{th}).
The fraction x_{4} presents an error below 10^{13}.

Again, we are also interested in establishing sufficient conditions that
ensure a convergence that is not only local (for a "sufficiently close initial
iterate"), but that is valid for a set of initial iterates in an interval.

Theorem (Convergence of Newton Method − sufficient conditions for an interval):
Let f∈ C^{2}[a,b] such that in the interval (x ∈ [a, b])
 (1) f(a) f(b) ≤ 0 ;
 (2) f ' (x) > 0 ou f ' (x) < 0 ;
 (3) f '' (x) ≥ 0 ou f '' (x) ≤ 0 ;
Then Newton Method has at least quadratic convergence
to the unique solution
z ∈ [a,b], initializing the method:
 (a) with x_{0}∈ [a,b] : f(x_{0}) f ''(x) ≥ 0
 (b) for any x_{0}∈ [a,b] if
f(a)/f '(a), f(b)/f '(b) ≤ b − a

proof:
A non constructive proof is simple but not relevant here, we prefer to
discuss the conditions:
 Note that conditions 1) and 2) imply existence and uniqueness.
 Condition 3) is important such that the convexity does not change,
leading to a cyclic, non convergent situation:
− For instance, if f(x) = x^{3} − 5 x, in [1,1] there a single solution
(z=0) and conditions 1) and 2) are verified, but not the
condition 3), since f ''(x)=6x changes sign in [1,1].
As a consequence we see that
x_{0} = 1, x_{1} = 1 − (15)/(35) = 1, x_{2} = 1, ..., x_{k} = (1)^{k}, 
and the sequence defined by Newton Method does not converge!
 Finally, note that condition (a) is always verified in one of the interval
limits, but to ensure convergence for x_{0} at any point of the interval,
the condition (b) guarantees that iterates starting do not fall off the interval...
for instance, if x_{0}=a
x_{1} = a − f(a)/f '(a) ⇐ x_{1} − a
= f(a)/f '(a) ≤ b − a 
Finally, the convergence is monotone for initial iterates verifying (a).
Condition (b) has only to be verified for the limit that does not verify (a)
then having x_{1} as a starting point verifying (a) and monotone
convergence for the next iterations.

Theorem (Newton Method − Error Estimates):
Let f∈ C^{2}[a,b] and assume the previous quadratic convergence conditions.
We have the following error formula:
∃ ξ ∈ [x_{n}; z] :
e_{n+1} = −  f ''(ξ) 2 f '(x_{n})  (e_{n})^{2} 
this leads to the asymptotic coefficient
K_{2} =  f ''(z) 2 f '(z)  
and to the estimates
 (a posteriori)
e_{n} ≤  max_{x∈ [a,b]} f ''(x) 2 f '(x_{n1})  e_{n1}^{2}


 (a priori)
M e_{n} ≤ (Me_{0})^{2^n} 
where
M =  max_{x∈ [a,b]} f ''(x) 2 min_{x∈ [a,b]}f '(x)  


proof:
− The Error Formula results from the Taylor expansion
0 = f(z) = f(x_{n}) + f '(x_{n})(zx_{n}) + ½f ''(ξ_{n})(e_{n})^{2} 
dividing by f '(x_{n})≠ 0 we obtain
0 = f(x_{n})/f '(x_{n}) + z − x_{n} + ½f ''(ξ_{n})/f '(x_{n})(e_{n})^{2} 
and as f(x_{n})/f '(x_{n}) − x_{n} = x_{n+1} we get,
0 = z − x_{n+1} + ½f ''(ξ_{n})/f '(x_{n})(e_{n})^{2} 
with ξ ∈ [x_{n}; z] and x_{n} → z. We get K_{2}
dividing e_{n+1} by e_{n}^{2}. Finally
− the first estimate is immediate, since ξ ∈ [x_{n}; z]⊂ [a,b].
− the second estimate is also immediate, as
M e_{n} ≤ (Me_{n1})^{2}

applying it repeatedly.

Remarks:
(i) When f '(z)=0 we are in the situation of a double root (at least),
and Newton Method loses quadratic convergence, being possible a slight change
to recover quadratic convergence (e.g. [CA]).
(ii) When f ''(z)=0 (and f'(z) ≠ 0) we obtain a cubic convergence.
(iii) As an empirical rule quadratic convergence means doubling
the number of significant digits in each iterate (... cubic convergence means the
triple). This means that near the root, double precision exact value is achieved in
less than 4 or 5 iterations, being more important to get there in the first place.
(iv)
We stress this point, as it follows from the error estimates that to have
an acceptable a priori estimate the value Me_{0} should be less than 1,
and this means that the initial iterate should not be far from the solution
(moreover if f ' is small or f '' is large).

Example:
Consider f(x)= e^{x} − log(x) = 0, that verifies in the interval [1,2]:
 1) f(1) = e^{1} > 0, f(2) = e^{2}log(2) < 0;
 2) f '(x) = e^{x}1/x < 0 (for x∈ [1,2])
 3) f ''(x) = e^{x}+1/x^{2} > 0 (for x∈ [1,2])
the sufficient conditions for Newton Method convergence are verified,
at least for initial iterates such that f(x_{0})>0
[conclusion (a) from the Theorem], as it is the case for x_{0}=1.
Since f(1)/f '(1) = 0.2689… , f(2)/f '(2) = 0.87798… < 21 = 1,
following conclusion (b), we have convergence given any x_{0}∈ [1,2].
Starting with x_{0}=1 we get
x_{1}=1.2689…, x_{2}=1.309108…, x_{3}=1.3097993887… 
as f(x_{3})≤ 2.04× 10^{7} and min_{[1,2]}f '=f '(2) > 0.635,
the general a posteriori estimate gives us
e_{3}≤ 2.04× 10^{7} / 0.635 < 3.21× 10^{7}. 
and we see that x_{3} already presents 7 correct digits.
Using this information we may improve the next estimates.
The interval is now small enough to consider M≈ K_{2} ≈ 0.413,
therefore
0.412e_{k+3} ≤ (Me_{3})^{2^k} ≤ (1.325× 10^{7})^{2^k} 
meaning that x_{5}=1.30979958580415… presents an error bound
e_{5}≤ 0.412^{1}(1.325× 10^{7})^{4} < 10^{27} 
being completely useless other double precision iterations, as only
the rounding error will remain.


Secant Method 
The Secant Method is a variant of Newton Method, as it avoids
the calculation of derivatives substituting them by a finite difference
approximation (that in the limit coincides with derivative itself).
Like the Newton Method, the Secant Method may be deduced in a
geometrical way, substituting the tangent line by a secant line.
Given two points x_{n1} and x_{n} we define the "secant line"
that crosses the image points:
y = f(x_{n}) +  f(x_{n}) − f(x_{n1}) x_{n} − x_{n1}  (x − x_{n}) 
the derivative is now substituted by the finite difference.
In a similar way, solving for y=0 we get
x = x_{n} −  x_{n} − x_{n1} f(x_{n}) − f(x_{n1})  f(x_{n}) 
and this will be the new approximation value x_{n+1}.
Resuming, Secant Method consists in
 Defining two initial iterates x_{1} e x_{0}.
 Repeat
x_{n+1} = x_{n} −  x_{n} − x_{n1} f(x_{n}) − f(x_{n1})  f(x_{n})


Remark: Note that in each iteration we do not need to calculate two times
the function f, as the value f(x_{n1}) is to be stored from the previous iteration.
This way the method is computationally more efficient than Newton Method (in most cases),
as it does not demand an updated calculation of the derivative at each step.
Error Formula (Secant Method)
It is possible to establish the following formula (check details in [KA])
∃ ξ_{n}, η_{n} ∈ [x_{n1}; x_{n}; z] :
e_{n+1} = −  f ''(ξ_{n}) 2 f '(η_{n})  e_{n} e_{n1} 
allowing the error estimate
e_{n+1} ≤  max f ''(x) 2 min f '(x)  e_{n} e_{n1}


The convergence is again superlinear (but not quadratic) when f '(z) ≠ 0, since
lim_{n→∞}  e_{n+1} e_{n}  =
lim_{n→∞}  f ''(ξ_{n}) 2 f '(η_{n})  e_{n1} =
 f ''(z) 2 f '(z)  lim_{n→∞} e_{n1} = 0 
since e_{n}→ 0 (assuming convergence).
It may be shown (e.g. [KA]) that if f '(z) ≠ 0 the order of convergence
of Secant Method is p=(1+√ 5)/2 = 1.618…
the golden number.
NOTE: In fact, taking y_{n} = loge_{n} and applying logarithms
to the formula, we establish (hand waive argument)
y_{n+1} ≈ C + y_{n} + y_{n1} a Fibonacci sequence,
and from the ratio y_{n+1}/y_{n} → p we get loge_{n+1}≈ p loge_{n},
which is a simple justification for this golden value of p.

Theorem (Convergence of the Secant Method − sufficient conditions in an interval)
Let f ∈ C^{2}[a,b].
In the hypothesis (1), (2), (3), of the similar theorem for Newton Method,
Secant method has a superlinear order of convergence p = 1.618…, initializing
 (a) with x_{1}, x_{0} ∈ [a,b] : f(x_{1})f ''(x) ≥ 0 , f(x_{0})f ''(x) ≥ 0
 (b) with any x_{1}, x_{0} ∈ [a,b] if f(a)/f '(a), f(b)/f '(b) ≤ b − a


Example:
Consider again f(x)= e^{x}log(x) = 0
the conditions (1), (2), (3) were already verified in [1,2].
As the condition (b) was also verified, we may use any pair of initial iterates
to obtain with the Secant Method a sequence with superlinear convergence.
Starting with x_{1}=1, x_{0}=1.5, we get
x_{5}=1.309799585804… com e_{5}≤ f(x_{5})/0.635 < 2× 10^{13} 
(using the general a posteriori estimate) being enough to calculate
another iteration to have the exact full digits in double precision.


Final Remarks and References 
It should be noticed that there several other popular numerical methods for root finding, some of them combine several of the previous methods. For instance Brent's Method uses both Bisection and the Secant methods (along with inverse quadratic interpolation).
For complex root finding, Fixedpoint, Newton or Secant methods have straightforward generalizations. There is a sort of generalization for the Bisection method, the LehmerSchur algorithm, that consists in calculating the winding number along 4 smaller squares, at each step.
We will some of these generalizations in the next chapters.
[CA] C. J. S. Alves, Fundamentos de Análise Numérica (I), Secção de Folhas AEIST (2002).
[KA] K. Atkinson: An Introduction to Numerical Analysis, Wiley (1989)
