This is a rough translation of the original portuguese version (chapters 1 and 2... only). Please use a compatible HTML-unicode Browser

Numerical representation
The current positional number notation was introduced in the Middle Ages by influence of Al-Kharizmi works (IX century), and allowed a considerable development of Calculus in Western Europe. The Hindu notion of zero generated a concept of representation that was absent in earlier millenia. The ease of calculations with this new representation has meant that the Greco-Roman cumulative notation was gradually abandoned ... nevertheless at a medieval rhythm.
The choice of a basis still posed some problems for axiomatic medieval scholars.
For example, the representation of 1/3 = 0.3333… could never be exact in decimal notation. If fractions (rational numbers) have a periodical decimal sequence, real numbers like
√ 2= 1.414213562373…
presented an unpredictable sequence of digits. Its decimal representation is always limited to a finite number of digits, and to a consequent error.
It was only at the end of the XIX century, with G. Cantor, that real numbers were defined as (equivalence classes of) Cauchy sequences of rational numbers, where the sequence:
x₀=1, x₁=1.4, x₂=1.41, x₃=1.414, x₄=1.4142, ...
is just a representative of all rational sequences that converge to the real number √ 2.
This interpretation of Cantor represented a great conceptual advance, allowing further to distinguish countable infinity from continuous infinity.
It is clear that the choice of the decimal basis for a real number representative is just the most common choice from an uncountable number of other possibilities. Already in the XVII century, Leibniz advocated advantages in binary representation − which is adopted nowadays in internal computer routines.

In general we can use any natural base β > 1, to write a real number.
For instance, in Scientific Notation (usually β=10),
x = ± (0. a₁ a₂ ... a_n ...)_β × β^t

= ą ∞
∑
k=1 a_k β^t-k
the digits a_k ranging from 0 to β-1. To have uniqueness of exponent representation, a₁≠ 0 and the zero is represented separately. This avoids several possibilities of representation as in
0.1 × β¹ = 0.01 × β² = 0.001 × β³ = 1
by choosing only the first.
Note that, by convention, 0.999… is identified with the representation 1.000… (both sequences converge to 1).
Floating Point Representation
Unless you were in possession of a machine with infinite memory (... a theoretical Turing machine), the representation of a number must be finite. Thus we must consider a finite number of digits in the mantissa and a limitation of the exponents allowed.

Definition: Floating Point System
The FP(β, n, t^-, t⁺) system is the set of rational numbers
FP(β, n, t^-, t⁺) = {0} ∪ { ± 0. a₁ a₂ ... a_n × β^t : a_k∈ {0,..., β-1}, a₁≠ 0, t∈ {t^-, ..., t⁺ } }

Other Remarks:
(i) Note that as a₁ ≠ 0, zero must be represented separately (it might be a bit, a zero flag, as there is a bit that denotes the sign sign flag).
(ii) Usually β=10, but binary representation β=2 is preferred in computers (the digits a_i might be 0 or 1). Details on computer implementation may be found in the link IEEE 754-1985 .
(iii) In programming languages (such as C, Fortran, Matlab) these FP systems are set when we declare the variables. Usually words like single, float, real*4,... stand for a single precision FP system with 4 bytes (23 bits − binary digits in the mantissa ~ up to 8 decimal digits). More recently it is standard to use double precision FP systems with 8 bytes (53 bits − binary digits in the mantissa ~ up to 16 decimal digits) and the variables are declared as double, float, real*8,...

Thus, we must map the real numbers to the rational number set FP,
fl: R → FP(β, n, t^-, t⁺) ⊂ Q
and two type of limitations occur, leading to errors:

Finite exponent limitations (errors that may abort execution):

Overflow: when the exponent t is greater than t⁺.
Underflow: when the exponent t is less than t^-.

Finite mantissa limitations (errors that we will study):

Rounding errors: truncation and rounding.

Rounding
Given a real number, represented in scientific notation
x = ± 0. a₁ a₂ ... a_n a_n+1 .... × β^t
while storing it in a system FP(β,n) we might be forced to suppress further digits in the mantissa. To proceed with that conversion we consider two types of rounding:

Truncation
fl_c(x) = ą 0. a₁ a₂ ... a_n × β^t

Rounding (half-up) (note: β > 2 is even)
fl_s(x) = ą 0. a₁' a₂' ... a_n' × β^t'

The digits a_k' coincide with a_k and the exponent t' coincides with t, if a_n+1 < β/2.
If a_n+1 > β/2 then the digits a_i' and the exponent t' result from the representation of the sum
0. a₁ a₂ ... a_n × β^t + β^{t − n}.
Remarks:
(i) We may also consider
fl_s(x) = fl_c(x + ¹/₂ β^{t − n})
(ii) It should be noted that for statiscal purposes there is also a Round-to-even method that ensures a better distribution (in this method, the final rounding number should have an even mantissa − for instance, both 11.5 and 12.5 are rounded to the nearest even number 12).

Definition:
Consider x an exact value and x^~ an approximated value of x, we define:

Error : e_x( x^~ ) = x − x^~
Absolute error : |e_x( x^~ )| = |x − x^~|
Relative error (x ≠ 0): |δ_x( x^~ )| with
δ_x( x^~ ) = e_x( x^~ )
x

- If there is no confusion with the approximation of x, we simply write e_x to denote e_x( x^~ ), and likewise δ_x for the relative error.
- We also use δ_x^~ when it is more important to emphasize which is the approximation.

As in a FP system the rounding errors are inherent, we bound the rounding relative error |δ_arr(x)|, meaning
δ_arr(x) = δ_x(fl(x)) = x − fl(x)
x

Theorem:
For any x: fl(x)∈ FP(β, n), we have the relative error estimates:

Truncation : |δ_arr(x) | < u_c = β^1-n
Rounding : |δ_arr(x) | < u_s = ¹/₂ β^1-n
The bounds u are sometimes called rounding units.

proof:
We prove for truncation. Consider any real number x
x = ± 0.a₁ ... a_n a_n+1 ... × β^t
As a₁≠ 0 then a₁≥ 1 and
|x|≥ 0.1 × β^t = β^{t − 1}
On the other hand
|e_arr| = |x − fl_c(x)| = |0.a₁ ... a_n a_n+1... − 0.a₁ ... a_n| × β^t
= 0.0 ... 0 a_n+1... × β^t ≤ 0.0 ... 1 0 ... × β^t = β^{t − n}
implies
|δ_arr| = |e_arr|/|x| ≤ β^{t − n} / β^{t − 1} = β^{1 − n}
(The proof for rounding is similar, and these estimates may be slightly improved...)

Remark
Note that in a FP system we never round to zero.
In some calculator machines, or in some ill implemented FP systems, it is common to see an overflow error, but not an underflow error, meaning that small absolute values are improperly rounded to zero.
This leads to odd situations, like null values for exponentials... for instance,
exp(-exp(8)) = 0.243… × 10^-1294
may be zero in your calculator, without an "underflow warning".
- How to present the result, using a conventional machine?
In these cases you should separate calculations − what you need to compute from what is not representable. Thus, we may reduce to powers of 10, using logarithms, and presenting the result afterwards:
e^-exp(8) = 10^{-exp(8)/log(10)} = 10^-1294.6136
note that the operation -exp(8)/log(10) presents no problems for the calculator, and separating the decimal part of the exponent 10^-0.6136=0.2434… we get the mantissa.

These type of errors may get worst while proceeding with further calculations... just note that if the result was 0 we could not get back the logarithm that should be -exp(8).
Error propagation in functions
Consider now the study of error propagation that results from applying a function or an operation to real numbers. Generally, we want to know if a function f does or does not affect much the proximity of the original values:
x^~ ≈ x ⇒ f(x^~ ) ≈ f(x),
in terms of the relative error. Applying Lagrange's Theorem,
e_f(x) = f(x) − f(x^~ ) = f '(ξ) (x-x^~) = f '(ξ)e_x
with ξ∈ [ x; x^~ ], supposing that f∈ C¹ (in the interval).
It results (for x≠ 0, f(x)≠ 0)
δ_f(x) = e_f(x)/f(x) = f '(ξ)
f(x) e_x = x f '(ξ)
f(x) δ_x
and we may get the estimate for the relative error:
|δ_f(x) | ≤ |x| max_{ξ∈ [x,x~]} |f '(ξ)|
|f(x)| |δ_x| .
This estimate is not very useful, and it is more simple to consider the approximation ξ≈ x, which is a 1^st order (or linear) approximation, in terms of the Taylor expansion.
In fact, neglecting the term o(e_x) in
f(x^~ ) = f(x) + f '(x)e_x + o(e_x)
we get e_f(x)≈ f '(x)e_x, and dividing by f(x), we again get the approximation
δ_f(x) ≈ x f '(x)
f(x) δ_x
which allows us to evaluate the error propagation from a function.

Definition: The value P_f(x) = x f '(x)/f(x) is called the condition number of f in x, and we have
δ_f(x) ≈ P_f(x) δ_x

If |P_f(x)| is large this means that the relative error is considerably increased through the application of the function f. On the other hand if it is small, this means that you may get a relatively good approximation of f(x) even with considerable errors in x.

Examples:

If f(x)=x^p, we get P_f(x)=p, and therefore δ_x^p ≈ p δ_x.
We conclude that for larger powers the initial relative error will be quite increased.

If f(x)=e^x, we get P_f(x)=x, thus δ_exp(x) ≈ x δ_x.
We conclude that for large |x| elevado, the initial relative error will be quite increased.

If f(x)=x-a, (a ≠ 0, constant) we get P_f(x)=x/(x-a), and therefore δ_x-a = x/(x-a) δ_x.
We conclude that for x ≈ a, the initial relative error will be largely increased.
This example shows that the sum/subtraction may lead to significative perturbations on the calculations, due to} "subtractive cancellation".

If f(x)=ax, (a ≠ 0, constant) we get P_f(x)=1, and therefore δ_ax = δ_x.
In this case the original error is not affected.
As counterpart to some of the previous examples, the propagation may lead to advantageous situations ... while evaluating x^p if p ≈ 0 (root extraction), the relative error decreases, and the same happens for e^x or x-a when x≈ 0.
In these cases, a reasonable approximation of x may lead to an excellent result in terms of f(x).

Definition: We say that a function f is ill conditioned near a point z if, in the limit, |P_f(z)| = ∞.

Operations (functions with two or more variables)
In the case of two (or several) variables, we must consider simultaneous errors in both variables. Considering the Taylor expansion with two variables (∂_m represents the partial derivative in the m-th variable):
f(x^~, y^~) = f(x, y) + ∂₁f(x, y)e_x + ∂₂f(x, y)e_y + o(e_x, e_y)
we get e_{f(x, y)}≈ ∂₁f(x, y)e_x + ∂₂f(x, y)e_y which is a linear approximation neglecting the term o(e_x, e_y), and we get
δ_f(x,y) ≈ x ∂₁f(x,y)
f(x,y) δ_x + y ∂₂f(x,y)
f(x,y) δ_y
In a similar way we get the condition numbers P_f^x(x,y) e P_f^y(x,y) as coefficients of the terms δ_x and δ_y (respect.), and
δ_f(x,y)≈ P_f^x(x,y)δ_x + P_f^y(x,y)δ_y

When any of these condition numbers tends to infinity in a value (x,y) then we say that there is ill conditioning of f near (x,y), as previously presented for a single variable.
Likewise these notions are extended for functions with more than two variables.
The most interesting cases are simple operations like sums or products, that may be seen as functions with two variables.

For the sum (subtraction) we get
δ_x+y = x
x+y δ_x + y
x+y δ_y ; δ_{x − y} = x
x − y δ_x − y
x − y δ_y
note that it is an equality and not just an approximation!
This formula shows immediately that there will be problems while subtracting similar quantities, i.e. x ≈ y, which is the case of subtractive cancellation, associated with significative rounding error propagation.
In the case of products (or divisions), the application of the formula gives
δ_xy ≈ δ_x + δ_y ; δ_x/y ≈ δ_x − δ_y
Note that this is no longer an equality, it is just an approximation − the exact formula holds a nonlinear term δ_xδ_y that it is neglected
δ_xy = δ_x + δ_y − δ_xδ_y

Error propagation in algorithms
The previous study of conditioning does not depend in the way the calculations are made, it just depends on the function. However, when we calculate an expression it is not possible to perform it is a single step − we are lead to intermediary calculations and to the notion of algorithm.

Algorithm: While calculating z=f(x), we consider an algorithm as a mathematically equivalent form (but not numerically equivalent) defined by the finite composition of elementary functions F_k:
z₁=F₁(x); z₂=F₂(x, z₁); ... z_n=F_n(x,...,z_n-1);
such that z=z_n .
(generally, the variables or functions might be multidimensional).
For instance, to calculate z=sin(x²)+x², we may consider the algorithm
z₁=x²; z₂ = sin(z₁); z₃ = z₂+z₁.
where at each step a single calculation is performed, using the previously stored results. It is now important to define what do we consider as elementar function.

Definition: In a FP(β,n) system we say that F is an elementary computational function associated to a function f, if
F(y) = fl(f(y)), ∀ y∈ D ⊆ FP(β,n)
where D is a subset where the function is defined.
For instance, the computational function F=LOG is associated to the function f(x)=log(x), which is defined for x > 0, and we must consider
D = { x∈ FP(β,n) : x > 0 }.
The computational function LOG is considered elementary if the result (for the allowed numbers in the FP system) is considered exact up to rounding errors [that is what means F(y) = fl(f(y))].

In standard programming languages (C, Fortran, Matlab ...) the simplest functions/operations are implemented to be elementary and they usually return the exact value up to a rounding error in the 15^th decimal place (in double precision). For instances, we will consider as elementary:

the functions: exp, log, sin, cos,
the operations: + , − , * , / , ^ (sum, subtraction, product, division, power);

In principle the admissible set D should be restricted to avoid Overflow or Underflow errors.
For instance, if x=10^q then in a system FP(10,n,-t,t) when we program the elementary operation x^p we should have an error when x^p= 0.1 × 10^1+qp has a larger exponent than the one admissible by the system, for instance, for q > 0:

when p > (t-1)/q : an Overflow error message should occur,
when p < -(t+1)/q : an Underflow error message should occur.

Condition numbers in an algorithm

Condition numbers without rounding error
The linear approximation allows to establish a sequence between the partial results and the final result for the condition number.
We may establish this relation decomposing the function in successive compositions. It is then enough to show for a single composition to conclude the result for the whole sequence.

Let z=f(g(x)) that we decompose in z₁ = g(x); z₂ = f(z₁).
Applying the formula at each step, we get
δ_z2 ≈ z₁ f '(z₁)
f(z₁) δ_z1 = g(x) f '(g(x))
f(g(x)) δ_z1 ≈ g(x) f '(g(x))
f(g(x)) x g'(x)
g(x) δ_x

= x g'(x)f '(g(x))
f(g(x)) δ_x = P_{f ◦ g}(x) δ_x
and therefore the final result gives exactly the condition number. This is resumed in the next proposition:

Proposition: With the linear approximation, without rounding errors, the composition of the condition numbers at each step equals the condition number of the whole composition.

Condition numbers with rounding errors
Even with well implemented elementary functions, at each step we must consider a rounding error. Thus, to calculate the error propagation in the algorithm we should add a rounding error, at each step. For instance, in the first step:
δ_z1≈ P_f(x)δ_x + δ_arr1
and similarly in the next steps, adding the term δ_{arr_k} in each step k of the algorithm.

Example: Consider the hyperbolic sinus function,
sinh(x)=(e^x − e^-x)/2
defined using the elementary function exp , by the algorithm

z₁ = exp (x);
z₂ = 1/z₁;
z₃ = z₂ − z₁;
z₄ = z₃/2;
To study the error propagation in this algorithm, we have at each step:

δ_z1 ≈ x δ_x + δ_arr1;
δ_z2 ≈ − δ_z1 +δ_arr2;
δ_z3 ≈ z₁/(z₁-z₂)δ_z1 − z₂/(z₁-z₂) δ_z2 +δ_arr3;
δ_z4 ≈ δ_z3 +δ_arr4;
the expressions may be substituted, obtaining

δ_z2 ≈ − (x δ_x + δ_arr1) + δ_arr2;

δ_z3 ≈ e^x/(e^x-e^-x)δ_z1 − e^-x/(e^x-e^-x)δ_z2 + δ_arr3
≈ e^x/(e^x-e^-x)(x δ_x + δ_arr1) − e^-x/(e^x-e^-x) (- x δ_x -δ_arr1 + δ_arr2) + δ_arr3
= x(e^x+e^-x)/(e^x-e^-x)δ_x+(e^x+e^-x)/(e^x-e^-x) δ_arr1 − e^-x/(e^x-e^-x)δ_arr2 + δ_arr3;

δ_z4 ≈ x(e^x+e^-x)/(e^x-e^-x)δ_x +(e^x+e^-x)/(e^x-e^-x)δ_arr1 − e^-x/(e^x-e^-x)δ_arr2 +δ_arr3+δ_arr4;

The last expression may be written in a general way
δ_z4 ≈ P_f(x) δ_x + Q₁(x)δ_arr1 + Q₂(x)δ_arr2 + Q₃(x)δ_arr3 + Q₄(x)δ_arr4
and we emphasize that δ_x coefficient is exactly the condition number P_f(x), while all the other coefficients Q_m, associated with the rounding errors δ_{arr_m}, depend on the chosen algorithm (the last is always one, here Q₄=1).

Remark: In fact, we may simplify the increasing sequence of calculations to obtain δ_z4 considering from the start that δ_x=0.
This way, we just calculate the terms Q_k(x) as the terms P_f(x) are given by the explicit formula.
It is an exercise to repeat the previous steps in this simpler fashion:
δ_z1 ≈ x δ_x + δ_arr1 = δ_arr1;
δ_z2 ≈ -δ_z1 +δ_arr2 = -δ_arr1 +δ_arr2;
δ_z3 ≈ ( (e^x + e^-x)δ_arr1 − e^-xδ_arr2 ) /(e^x-e^-x) + δ_arr3;
δ_z4 ≈ δ_z3 +δ_arr4;
and the coefficients are exactly the same!

Note that sinh is not ill conditioned at any point x ∈ R, since even for x ≈ 0 we get
lim_{x→ 0} P_f(x) = 1,
[Remark: it will be ill conditioned for large values, |x| ≈ ∞, because lim_{|x|→ ∞} |P_f(x)| = ∞ .]
Although being well conditioned for x ≈ 0, we see that
lim_{x→ 0} Q₁(x) = + ∞; lim_{x→ 0} Q₂(x) = − ∞
and we conclude that in this algorithm the rounding errors change significantly the final result when x ≈ 0.
Thus, besides conditioning problems in functions, in an algorithm we consider also another notion − numerical stability.

Definition: Consider an algorithmo that uses m elementary functions to calculate the function value z=f(x), in a FP(β,n) system. The expression for the relative error is linearly approximated by
δ_{z_m} ≈ P_f(x) δ_x + Q₁(x)δ_arr1 + ... + Q_m(x)δ_{arr_m}
If for a certain w we have
lim_{x→ w} |P_f(x)| = ∞ or lim_{x→ w} |Q_k(x)| = ∞ (for some k)
we say that the algorithm is numerically unstable for the values x ≈ w.

We present the definition for one variable, but it is clear the generalization for algorithms that depend on a larger number of variables. For instance, for two variable functions we have the condition numbers P_f^x(x,y) e P_f^y(x,y).

Remark: An immediate consequence of the definiton is that ill conditioning implies numerical instability.
Assim, se é impossível resolver os problemas de mau condicionamento, sem mudar o número n no sistema FP(β,n) (o que pode ser feito em linguagens mais sofisticadas, como o Mathematica), devemos procurar evitar os problemas de instabilidade numérica, no cálculo de funçőes bem condicionadas. Isso pode ser feito procurando algoritmos estáveis, ou pelo menos evitando introduzir instabilidades numéricas no algoritmo.

Examples:

Consider an algorithm to calculate f(x)=(x-1)²+1:
z₁ = x-1; z₂ = x+1; z₃ = z₁z₂; z₄ = z₃+1;
where, with simplifications made, it resumes to the square function f(x)=x². This is clearly a well conditioned function, with
δ_f(x) ≈ 2δ_x.
However, the unsimplified expression in the algorithm leads to numerical instabilities as the study of the propagation error shows
δ_f(x) ≈ 2δ_x + (1 − x^-2) (δ_arr1 + δ_arr2 + δ_arr3) + δ_arr4
When x → 0 we obtain Q_1,2,3(x) = − ∞, showing the numerical instability problem when x ≈ 0.
In fact, we may check that in any FP(10,n) system with rounding, if x=10^-n-1 we get z₁ = -1; z₂ = 1; and therefore z₄ = 0, giving 100% relative error, i.e. |δ_z4|=1 (even considering that x is exact and δ_x=0). For that FP system we know that the rounding unit verifies
|δ_arr| < u = 0.5 × 10^1-n,
we have |δ_z4| = 1 ≈ (2 × 10^n-1) 0.5×10^1-n .
Recovering the hyperbolic sinus example, that function presents no ill conditioning problems (for fixed x ∈ R), but when x ≈ 0 the algorithm presents numerical instabilities, that we see in the infinite limits for Q₁ and Q₂.
Based on the Taylor expansion, we may present an approximation that avoids that problem near zero. Thus,
sinh(x)=(e^x − e^-x)/2 = (1+x+x²/2+O(x³) − (1 − x+x²/2+O(x³)))/2 = x+O(x³)
and we may consider sinh(x) ≈ x, with an O(x³) error term.
That error is neglectable for |x| < 10^t in a FP(10,n) system when t < -n/3.
Therefore we may propose a stable algorithm to calculate z=sinh(x):

if t < -n/3 then z=x;
if t ≥ -n/3 then z=z₄ (using the sinh algorithm for z₄);

Nonlinear scalar equations
The resolution of algebraic (polynomial) equations has been an interesting subject even to the oldest civilizations. Some ancient records show that some second degree equations were solved in Babylon (IV century BC) and also appear in the works of Euclid in the III century BC (the general setting was considered by Brahmagupta, VII century AD, and by Al-Kharizmi, IX century AD).
However, no closed formula for third degree equations was found in the next millenia. Some advances made by medieval scholars (e.g. Fibonacci) were only completed in the XVI century, with the Italian Renaissance.
Afterwards, new difficulties and impossibilities for higher degree equations lead to numerical methods for the resolution of any equation.
These numerical methods also showed a performance that was better than the use of the closed formulas and it was not restricted to algebraic equations.

Algebraic Equations

1^st Degree: First degree equations have a trivial solution:
a x + b = 0 ⇔ x = -a^-1 b (quando a ≠ 0)
However we note that even in this trivial case, an elementary division algorithm is needed for b/a, and this is not easy to program in a FP system (it is probably more complicated than any other method here presented).
While writing the solution in the form x = -a^-1 b, we consider a general case, that includes matrices, if the condition a ≠ 0, is understood as the invertibility condition, det(a) ≠ 0.
Quadratic equation (2^nd degree): Second degree equations have a closed simple formula:
a x² + b x + c = 0 ⇔ x = (2a)^-1 ( - b ± (b² − 4ac)^1/2 ) (quando a ≠ 0)
These solutions are complex conjugates for negative discriminants, b² − 4ac < 0 (assuming real coefficients).
This expression includes a solution in any field structure:
To solve x² − x − 1 = 0 in Z₇:
x = (2)^-1 (1 ± (1 + 4)^1/2) = 4 ( 1 ± 3 i) = 4 ± 2 i
Cubic and Quartic (3^rd and 4^th degree): After the work of Khayyam and other slow medieval advances, del Ferro and Tartaglia found the formula to solve the cubic equation in the beginning of the XVI century. These works were further explored to solve the quartic equation by G. Cardan and L. Ferrari and published by Cardan.
Links to these formulas may be found here: cubic formula , quartic formula. These techniques developed at Bologna University failed during the next centuries to solve the quintic (5^th degree).
Quintic (5^th degree): One of the most important problems in the next three centuries was to prove or disprove the existence of a closed formula for the quintic. After some substancial progresses by Lagrange, the non existence was proven independently in the begining of the XIX century by Galois and Abel.
This impossibility answer made it clear that there would be no way to express explicitly the solution of some algebraic equations with degree ≥ 5.
Degree p: Despite the non existence of an explicit formula for the quintic (or higher degree equations) it was analytically established (by Argand, Gauss, early XIX century) that any algebraic equation with degree p would have exactly p complex roots − it is the well known fundamental theorem of algebra.
Non algebraic equations: For other, non algebraic equations, defined also with non polynomial functions (sinus, exponentials, ...), from the beginning there was no hope in finding closed explicit formulas, except in trivial situations. We note that numbers like π or e may be solutions of simple non algebraic equations, but they can not be expressed by a solution of any algebraic equation (with rational coefficients). These numbers are called non algebraic, or transcendental.

Numerical Methods for Root Finding
Since explicit formulas were not available, even in the medieval times some numerical procedures were attempted. Fibonacci (XIII century) presents fixed point iterations to solve some specific algebraic equations, but he does not explore the method.
It was only in the XVII century, with the development of calculus in Europe, that numerical methods were clearly established, specially after the works of Newton (see also Raphson). The Newton Method that we will study is probably the most well known general method for root finding.

Rootfinding − general setting:
Given a real function f ∈ C[a,b], we want to approximate the solution z ∈ [a,b] verifying
f ( z ) = 0
The solutions z are also called roots or zeros of the function f.

The iterative methods consist in finding a sequence (x_n) that converges to the solution z. This sequence is constructed in a recursive (or iterative) way, starting with some initial values and calculating x_n+1 from the previous x_n (or more x_n, x_n-1, ...).
We note that once we establish the convergence of the sequence (x_n) we have not only an approximation in each term x_n, but also an exact representative of the real number through the convergent sequence of rational numbers (if x_n are real, we can always consider a similar rational number that does not interfere in the convergence - the usual way is by rounding the decimals).
The methods presented in this chapter are mainly to find the roots of an equation with a single real variable, but they may be used in more general contexts − systems, functional or differential equations (bissection is mostly restricted to ordered structures; fixed point, Newton and secant methods have been generalized to equations in complexes, matrices, operators...) \end{itemize
We start by reviewing some basic Calculus results that ensure uniqueness and existence of solution in an interval (these are non constructive results).
Root localization
A simple graph inspection may give us an approximate idea of root localization (as the points where the X axis is crossed). However to ensure rigorously that one and only one root exists in an interval, we recall some elementary theorems:

Theorem (Bolzano's (intermediate value) theorem : Existence ):
Let f ∈ C[a,b] : f(a) f(b) ≤ 0. Then ∃ z ∈ [a,b] : f(z) = 0.
We will use this theorem with the computational notation:
BolzanoTheorem[f ∈ C[a,b] , f(a) f(b) ≤ 0] :: z ∈ [a,b] : f(z) = 0

The intermediate value theorem only ensures existence, to ensure uniqueness we may use:

Theorem (Rolle's theorem : Uniqueness) :
Let f ∈ C¹ [a,b], f(a)=f(b) implies ∃ ξ ∈ [a,b] : f ' (ξ) = 0.
Thus, if f ' ≠ 0 then f is injective, and there exists at most one zero of f in the interval.

Lagrange's mean value theorem is a simple consequence of Rolle's theorem and leads to an a posteriori error estimate for any approximation.

Theorem (Lagrange's mean value theorem):
Let f ∈ C¹ [a,b]. Then ∃ ξ ∈ [a,b]: f(a)-f(b) = f '(ξ) (a-b)
Corollary (Lagrange error estimate − a posteriori):
Let f ∈ C¹ [a,b] with f '(x) ≠ 0 ∀ x∈ [a,b].
If z^~ is an approximation of the root z ∈ [a,b], then

|z − z^~| ≤ | f(z^~)|
min_x∈[a,b]|f '(x)|

proof of the theorem:
Take g(x) = (b − a)f(x) − (f(b) − f(a))x.
We have g(a)= proof of the corollary:
Since z , z^~∈ [a,b] there exists ξ∈ [z; z^~ ] ⊆ [a,b]:
f(z^~) − f(z) = f '(ξ)(z^~ − z)
and as f(z)=0, we have |z^~ − z| = |f(z^~)| / |f '(ξ)|. Thus
|z − z^~| ≤ |f(z^~)| / min_x∈[a,b] |f '(x)|.

These theorems do not give any method to approximate the root. Nevertheless we may use the Intermediate value theorem to develop a very simple iterative method.

Bissection Method
Assume that in the interval [a, b] the equation f(x) = 0 has one and only one root (which may be verified with the help of the first two previous theorems).
We generate intervals [ a_n, b_n ] half size of the previous ones [ a_n-1, b_n-1 ], where we ensure the existence of the root, using the Intermediate value theorem.

The method by set with this algorithm :

Initial interval : [ a₀ , b₀ ] = [ a, b ]
Iterate − Step n=1, 2, ... :

x_n = ˝ ( a_n-1 + b_n-1 )
If f (x_n) f(a_n-1) < 0 then a_n = a_n-1 ; b_n = x_n ;
If f (x_n) f(a_n-1) > 0 then a_n = x_n ; b_n = b_n-1 ;
If f(x_n) = 0, or some other criteria, the method stops.

At the end of step n we have the approximation x_n and the subsequent error of approximation
e_n = z − x_n.
We can always obtain an a posteriori error estimate using the previous corollary,
|e_n| ≤ |f(x_n)| / min_{x∈[a_n-1, b_n-1]} |f '(x)| ,
but we may also deduce a priori error estimates.

A priori estimate in the bissection method.
As x_n is the middle point, the new interval [ a_n, b_n ] is half size the previous one [ a_n-1, b_n-1 ], meaning
|a_n − b_n| = ˝|a_n-1 − b_n-1 |
and this leads to
|a_n − b_n| = (˝)ⁿ|a₀ − b₀ | = 2^-n|a − b| .
On the other hand, z∈ [ a_n, b_n ], with x_n=a_n or x_n=b_n,
|e_n| = |z − x_n| ≤ |a_n − b_n| = 2^-n|a − b| .
which is an a priori estimate (we may bound the absolute error in the step n without performing a single iteration).
If we want that |e_n| ≤ ε it is enough to demand
2^-n|a − b| ≤ ε ⇔ |a − b| / ε ≤ 2ⁿ
⇔ n ≥ log ₂ ( |a − b| / ε )
using the a priori estimate.
We conclude, for instance, that for an interval |a − b|=1, double precision is ensured after a little more than fifty iterations, because to have ε < 2^-56≈ 1.4× 10^-17 it is enough that n > 56.
The bissection method may be seen as the last method to try, when all the others fail. We will study other methods, like Newton's method that converges in a fewer number of iterations (usually 5 or 6 iterations are enough to have double precision accuracy).
Fixed point method
The fixed point method starts by considering an equivalence
f(z)=0 ⇔ z=g(z)
where the zero z of the function f is the fixed point of another function g (called iterative function). Then, the method is resumed to the following iteration:

Given any initial x₀ ;
Iterate x_n+1=g(x_n).

Examples:

In a calculator machine, starting with 0 (or any other value), if we repeatedly press the COS key, we obtain a sequence of values (in radians)
0; 1.0 ; 0.54… ; 0.857… ; ... ; 0.737… ; ... ; 0.73907… ; ... ; 0.7390851… ; etc...
After 50 times, meaning 50 iterations, we see that the values of the sequence are getting stable and the first digits remain the same.

In this process we are aplying the fixed point method, and the iterative function is the cosine function.
Starting with x₀=0 we repeatedly make x_n+1= cos (x_n).
When it converges (as in this case), x_n → z then
cos (x_n)→ cos (z)
because the cosine is a continuous function. Thus, in the limit
x_n+1= cos (x_n) → z = cos (z).
We conclude that the limit value z=0.7390851… is the solution of the equation z = cos (z).
Let f(x) = x − cos (x) then
f(x) = 0 ⇔ x = g(x) = cos (x).
Applying the previous theorems we see that in the interval [0,1] we have
f(0) = 0 − cos(0) = -1 < 0, f(1) = 1 − cos(1) > 0
and we ensure the existence of solution. The uniqueness results from f '(x)= 1 + sin (x)≥ 0 which is only null for x=π/2+k π ∉ [0,1].
Furthermore, as
min_x∈[0,1] |f '(x)| = min_x∈[0,1] |1+ sin (x)| = 1
the a posteriori estimate allows to conclude that the absolute error for the presented value is very small:
|z − 0.7390851…| ≤ ^{|0.7390851… − cos (0.7390851…)|}_/₁ = 0.55…× 10^-7

Other simple examples may occur by pressing other calculator keys. Using the SQRT (square root) key, we have g(x)=√ x and the sequence x_n+1=√ x_n converges to a solution of z=√ z, which is also a solution of z=z², meaning z=1. In this case there is another solution z=0 but the method never converges to it, unless you start with it.
As a counterpart, if you have a SQR (square) key you are considering g(x)=x² and the sequence x_n+1=x_n² converges to a solution of z=z², which is z=0 when you start with |x₀| < 1. If we start with |x₀| > 1 the sequence goes to infinity. The iteration only gets the other solution z=1 when we start exactly with x₀=± 1.

From this examples we saw different situations, leading to convergence and non convergence, that we will analyze.
Generally, the "key" that you repeatedly press is the iterative function g. The value with which you start is the initial iterate, and the sequence of iterations is obtained by "pressing the key". We may get a rough idea of convergence by checking that the decimal digits are stabilizing, but this may be misleading in some unclear situations.

Remark: No matter there is convergence or not, at each iterate you may evaluate its accuracy using the a posteriori estimate with f(x)= x − g(x) e z^~=x_n:
|z − x_n| ≤ | x_n − g(x_n)|
min_x∈[a,b]|1 − g'(x)| ≤ | x_n+1 − x_n|
1 − max_x∈[a,b]|g'(x)|
with |g'(x)| < 1, a condition that will define the convergence.

Based on general plots for different iterative functions g that we present in Fig. 1, we follow the iterations for convergence and non convergence ("divergence") situations.
Figura 1: Fixed Point Method − different cases. Convergence situations are presented in the plots above (monotone convergence − on the left, alternate convergence − on the right). The fixed point acts like an iteration atractor. On the other hand, in the two plots below the fixed point acts like an iteration repulsor. Even if we start near the fixed point the iterations diverge from that value.

Definition:
We say that g is a lipschitzian function in [a, b] if there exists L > 0 such that :
| g(x) − g(y) | < L | x − y | , ∀ x, y ∈ [a, b]
The function is called a Contraction if the Lipschitz constant is less than one, meaning L < 1.

Note that lipschitzian functions are continuous on the interval
(x→ y implies immediately g(x)→ g(y)).
When the function has derivatives we relate this notion to the bounds of the derivative:

Proposiçăo:
Let g ∈ C¹[a, b] such that |g '(x)| ≤ L, ∀ x ∈ [a, b], then the function g is lipschitzian in that interval, and when L < 1 it is a contraction.

dem:
Using Lagrange's theorem, for any x, y in [a,b], there exists ξ ∈ ]x; y[ ⊂ [a, b] such that
| g(x) − g(y) | = |g'(ξ)| |x-y| ≤ L |x − y|.

With these notions we may now present a fixed point theorem (here only for an interval).

Theorem (fixed point in a bounded interval).
Let g be an iterative function in [a, b] such that:

g is contractive in [a, b],
g is invariant in [a, b], meaning g([a, b]) ⊆ [a, b] ,
then

g has one and only one fixed point z ∈ [a,b] ;
for any x₀ ∈ [a, b], the fixed point sequence defined by x_n+1 = g(x_n) converges to z;
we have the following error estimates:

(a posteriori)

(i) | e_n | < L | e_n-1 |
(ii) | e_n | < L/(1-L) | x_n − x_n-1|

(a priori)

(iii) | e_n | < Lⁿ | e₀ |
(iv) | e_n | < Lⁿ (1-L)^-1 | x₁ − x₀ |

Here L < 1 is the contractivity constant.
When g∈ C¹[a,b], L = max_{x∈ [a,b]} |g'(x)| < 1.

proof:

Existence (fixed point in an interval):
it may be seen as simple consequence of the intermediate value theorem.
Take f(x)=g(x)-x, the invariance hypothesis implies a≤ g(a), g(b)≤ b.
Therefore, f(a)f(b)≤ 0 and there exists z∈ [a,b]: f(z)=0 ⇔ z=g(z).
Uniqueness (fixed point in an interval)
Suppose z,w∈ [a,b] both fixed points, z=g(z) and w=g(w).
Using the contraction hypothesis
|z − w| = |g(z) − g(w)| ≤ L |z − w| ⇔ (1-L)|z − w|≤ 0
and as L < 1 temos |z − w|≤ 0 that is z = w.
Convergence and estimates:
Using the contraction hypothesis,
|e_n|=|z − x_n| = |g(z)-g(x_n-1)| ≤ L |z-x_n-1| = L |e_n-1|
we obtain the first estimate that, applied repeatedly, gives (iii)
|e_n| ≤ Lⁿ |e₀|.
As L < 1 this estimate implies |e_n|→ 0 (note that |e₀|≤ |a-b|), and we conclude the convergence.
Finally (ii) results from
|e_n|≤ L|e_n-1| = L|z − x_n + x_n − x_n-1| ≤ L|e_n| + L|x_n − x_n-1|
thus (1 − L)|e_n| ≤ L|x_n − x_n-1| implies (ii).
Also, (ii) implies |e₁| ≤ L/(1 − L) |x₁ − x₀| considering n=1, and (iv) results from
|e_n| ≤ L^n-1 |e₁|.

Proposiçăo:
Under the hypothesis of the fixed point theorem, we have:

(a) If 0 < g'(x) < 1, ∀ x ∈ [a, b] then we have monotone convergence
(all iterations are on the same side: x_n > z or x_n < z)
(b) If -1 < g'(x) < 0, ∀ x ∈ [a, b] then we have alternate convergence
(iterations change side at each step: x_nx_n+1 < 0)

dem:
As
e_n = g(z) − g(x_n-1) = g '(ξ_n) e_n-1
and as the sign of g' is constant, from the hypothesis, we have
sign(e_n ) = sign( g '(ξ_n) ) sign(e_n-1 )
and obtain
in case (a): sign(e_n ) = + sign(e_n-1 ) = ... = sign(e₀ ),
in case (b): sign(e_n ) = − sign(e_n-1 ) = ... = (-1)ⁿ sign(e₀ ).

Remark: When alternate convergence is ensured, we have z∈ [x_n-1; x_n]. This allows to define new intervals where we have the fixed point, and we may update the value of L.
For instance, assuming that -1 < g' < 0 and it is monotone, after p iterations we define a new
L_p = max { -g '(x_p-1); -g'(x_p) }

obtaining the following estimates with these new values. A possibility for an updated a priori estimate will be
|e_p+k| ≤ (L_p)^k | x_p-1 − x_p |

Example:
A method to solve quadratic equations x²+bx-a=0 consists in establishing the equivalence (for x≠ -b)
x²+bx-a=0 ⇔ x(x+b) = a ⇔ x = a/(x+b)
where the iterative function is g(x)=c/(x+1).
This leads to a sequence known as continued fraction
x_n+1 = a/(b+x_n) = a/(b + a/(b + a/(b + ...)))
that also allows to approximate square roots with fractions, in an optimal way (taking into account the size of the values).
For instance, for a= b= 1, starting with x₀=1 we get
x₁=1/2, x₂=2/3, ..., x₆=13/21, ..., x₁₃=377/610, x₁₄=610/987, ...
and the last value x₁₄ = 610/987 = 0.6180344… is a simple fraction, giving a good approximation of the exact value z=-1+√ 5=0.6180339887…
We may check that in the interval I=[1/2, 1] the fixed point theorem hypothesis are verified:

contractivity: g'(x)=-(1+x)^-2 is increasing (g'' > 0) and it is negative, therefore
L = max_{x ∈ I} |g'(x)|= max{ -g'(1/2), -g'(1) } = max{ 4/9, 1/4} = 4/9 < 1

invariance: g is decreasing (g' < 0 ) and we have
g(1/2) = 2/3 ∈ [1/2,1], g(1) = 1/2 ∈ [1/2,1]

This justifies the convergence, which is alternate (g' < 0), implying that z∈ [377/610, 610/987] and furthermore, using the a posteriori estimates (i) and (ii) we have
|e₁₄| ≤ L/(1-L) |x₁₄-x₁₃| = 4/5×0.166…10^-5 < 0.133×10^-5
As g' is increasing and negative, considering a reduced interval
L₁₄= max {|g'(x₁₃)|, |g'(x₁₄)| } = 0.381966…
and we have a precise estimate for the following iterations
|e_14+k|≤ 0.382^k (0.133×10^-5)

Theorem (Non convergence or "divergence"):
Let g∈ C¹(V_z) where V_z is a neighborhood of the fixed point z.
If | g '(z) | > 1, the fixed point sequence never converges to z, except if x_p=z (at some step p).

proof:
We saw, using Lagrange's theorem:
e_n+1 = g '(ξ_n) e_n
with ξ_n∈ [x_n;z] ⊂ V_z considering x_n sufficiently close to z. Since | g '(z) | > 1, then in the neighborhood we have also | g '(ξ) | > 1, (because g∈ C¹(V_z)).
Since e_n ≠ 0 (as x_n≠ z)
|e_n+1| = |g '(ξ_n)| |e_n| > |e_n|
and the error increases, making convergence impossible.

This is a local result, that allows to conclude non convergence as long as we show that the derivate (in modulus) is greater than 1, in an interval that contains the fixed point.
As a counterpart we may establish a local convergence result. This result is slightly different from the fixed point theorem, as it does not demand invariance, but we need have an a priori location of the fixed point, as the initial iterate must be sufficiently close to it.

Theorem (Local convergence): Let g∈ C¹(V_z) where V_z is a neighborhood of the fixed point z.
If | g '(z) | < 1, the fixed point sequence converges to z, when x₀ is sufficiently close to z.

proof:
This result may be reduced to the fixed point theorem considering
W_z = {x: |x − z| ≤ ε}
with ε small enough to have a contraction,
|g'(x)| < 1, ∀ x∈ W_z⊂ V_z ,
as |g'(z)| < 1 and g∈ C¹(V_z). Therefore, if x∈ W_z we have
|g(x) − z| = |g '(ξ)| |x − z | ≤ ε ⇐ g(x)∈ W_z
and invariance is shown. Thus, we conclude the convergence by the fixed point theorem, as long as x₀∈ W_z, meaning, for initial iterates sufficiently close to z.

Within convergence cases, there are sequences that converge more rapidly than others. Thus, it is appropriate to distinguish the order of convergence.

Definition Let (x_n) be a sequence that converges to z.
We say that (x_n) has order of convergence p with asymptotic coefficient K_p ≠ 0 if the following limit exists
lim_n→∞ |e_n+1|
|e_n|^p = K_p ≠ 0

Note that K₁≤ 1 because if K₁ > 1 no convergence occurs: |e_n+1| > |e_n| (the error increases).

We distinguish 3 main classes of convergence:

Superlinear (or Exponential): when p>1 (or p=1 and K₁ = 0).
Linear: p=1 and 0 < K₁ < 1
Sublinear (or Logarithmic): p=1 and K₁ = 1
There are other names as quadratic convergence when p=2, cubic convergence when p=3 (these convergence orders are different from the ones given in the context of approximation, as in quadrature rules).
Note that if the limit is K_p=0 this means that the convergence is at least of order p, but then one should look for an higher p, to obtain the correct order of convergence.

Examples:
We present some examples that are not necessarily connected to the fixed point method.
(i) Let x_n = αⁿ with 0 < α < 1.
This sequence converges to z=0, and we have
K₁ = lim_n→∞ |e_n+1|
|e_n| = lim_n→∞ |αⁿ⁺¹|
|αⁿ| = α ≠ 0
this means a linear convergence with K₁ = α.
(ii) Let x_n = α^{β^n} with 0 < α < 1 < β.

[note that x_n = α^(β^n) is different from (α^β)^n = α^(β n) ]

This sequence also converges to z=0, but we have
K₁ = lim_n→∞ |e_n+1|
|e_n| = lim_n→∞ |α^{β^(n+1)}|
|α^{β^n}| = lim_n→∞ |(α^{β^n})^β|
|α^{β^n}| = lim_n→∞ (α^{β^n})^β-1 = 0
and we have superlinear convergence (K₁ = 0). To identify the exact order, we check
K_β = lim_n→∞ |e_n+1|
|e_n|^β = lim_n→∞ (α^{β^n})^β-β = 1 ≠ 0
concluding that the order of convergence is β > 1.
(iii) Finally a logarithmic convergence result.
Take x_n = n^-b with b>0, that converges to z=0. As,
K₁ = lim_n→∞ |e_n+1|
|e_n| = lim_n→∞ |(n+1)^-b|
|n^-b| = lim_n→∞ (1+1/n)^-b = 1^-b = 1
we conclude the sublinear convergence.

Theorem (Fixed point method − order of convergence):
Let g∈ C^p(V_z) where V_z is a neighborhood of the fixed point z.
If g^(p)(z)≠ 0, e g'(z) = ... = g^(p-1)(z) = 0, then the (local) order of convergence of the fixed point sequence is p, and the asymptotic coefficient is given by
K_p = |g^(p)(z)| / p!
(note that for p=1, this implies K₁ < 1 to have local convergence).

proof: The local convergence was ensured in the previous theorem for K₁ = |g'(z)| < 1.
Using the Taylor expansion with Lagrange remainder
g(x_n) = g(z) + g^(p)(ξ_n)/p! (-e_n)^p, com ξ_n ∈ [x_n;z]
because g'(z) = ... = g^(p-1)(z) = 0.
On the other hand, g(x_n)=x_n+1 and z = g(z) means
- e_n+1 = g^(p)(ξ_n)/p! (-e_n)^p,
Thus, substituting in
K_p = lim_n→∞ |e_n+1|
|e_n|^p = lim_n→∞ |g^(p)(ξ_n)|/p! = |g^(p)(z)| / p! ≠ 0
we conclude that the order of convergence is p.

Remark:
(i) It is an immediate consequence of the definition of order of convergence that if a method has order of convergence p > 1, a method grouping two iterations will have order of convergence 2p. Therefore, it is a qualitative difference to obtain superlinear convergence, and it is a secondary issue to obtain higher orders of convergence (despite some emphasis in the usual literature).
Furthermore, if the convergence is just linear, a new method grouping two iterations, will just reduce the asymptotic coefficient to one half.
Thus, the order of convergence plays a similar role to superlinear methods, as the role of the asymptotic coefficient in linear convergent methods.
(ii) Another possibility to define order of convergence p, consists in having a constant K_p > 0 such that
|e_n+1| ≤ K_p |e_n|^p (n sufficiently large)
(when p=1 we must have K₁< 1), and it is not necessary that the limit exists.
Furthermore, knowing the order of convergence, we may establish a relation between the difference of iterations and the absolute error:
|e_n| = |x_n+1 − x_n|
|1 − e_n+1/e_n| ≤ |x_n+1 − x_n|
1 − K_p |e_n|^p-1
as long as K_p |e_n|^p-1 < 1. We conclude that
|e_n| ≤ K_p |e_n-1|^p-1
1 − K_p |e_n-1|^p-1 |x_n − x_n-1|
... this expression was already known for p=1 (the condition K₁ < 1 is necessary for the linear convergence).

Based on the difference of iterations we may define a stopping criteria, noticing that

|e_n| ≤ L_n
1 − L_n |x_n − x_n-1|, when L_n = K_p |e_n-1|^p-1 < 1
(this will happen for some n as e_n → 0).
Newton Method
The previous theorem allows to conclude superlinear convergence if the iterative function as a null derivative in the fixed point. Thus we are led to search methods with this property.

When f '(x) ≠ 0 we may establish an equivalence
f(x) = 0 ⇔ x = x − f(x)/f '(x)
and define the iterative function
g(x) = x − f(x)/f '(x).
For the root z, as f(z)=0, we get
g'(x) = f(x) f ''(x)/(f '(x))² ⇒ g'(z) = 0
Therefore, according to the previous theorem, in this case we have at least quadratic convergence
(it might be cubic, if f '(z) ≠ 0, f ''(z) = 0).

Newton Method consists in applying the fixed point method to this g, meaning

Given an initial iterate x₀ ;
Repeat

x_n+1 = x_n − f(x_n)/f '(x_n)
.
and as we saw, excluding the case f '(z)=0, we ensure at least quadratic convergence.

Remarks:
(1) Newton Method has also a geometrical interpretation that is related to its original deduction (this method is also called Newton-Raphson Method, or Tangent Method) as presented in Fig. 2.
Figura 2: Iterations in Newton Method. Starting with x₀=1, the new point x₁ results from the intersection of the tangent line with the axis y=0, and so on.

The geometrical deduction consists in using the tangent line
y = f(x_n) + f '(x_n)(x-x_n)
as an approximation of f in x_n (it is the first order approximation in Taylor's expansion). Deducing the solution x such that y=0 we obtain the same expression
0 = f(x_n) + f '(x_n)(x-x_n) ⇔ x = x_n − f(x_n)/f '(x_n)
this solution x is the new approximation x_n+1.

(2) We may be led to deduce higher approximations, increasing the order of the Taylor expansion. This leads to a higher order of convergence but also to more complex expressions, that are computationally less efficient than grouping iterations (see previous remarks).
In fact n iterations with Newton Method are usually more efficient than a single iteration with some other methods of order 2n.

Examples:
We present some examples that show the performance of Newton Method.
(i) Calculate z= ^p√ a using only arithmetical operations.
Note that here z is the solution of f(x) = x^p − a = 0.
According to the local convergence theorem, we take a sufficiently close x₀, and the method resumes to
x_n+1 = x_n − (x_n^p-a)/(p x_n^p-1) = x_n(1-1/p) + (a/p) x_n^1-p
it is even simpler to extract square roots (p=2)
x_n+1 = x_n/2 + a/(2 x_n)
(this is useful in binary FP systems, as the divisions/products by 2 are trivial).
(ii) Consider for instance the calculation of z= ³√ 6:
x₀ = 2, x₁ = 11/6, x₂ = 1.817263…, x₃ = 1.81712060…, x₄ = 1.817120592832139…
in this 4^th iterate all the presented digits are correct.
(iii) Recovering the continued fraction example, the application of Newton Method to f(x)=x²+bx-a=0 leads to calculations that involve also simple fractions
x_n+1 = x_n − ((x_n+b)x_n-a)/(2x_n+b) = (x_n²+a)/(2x_n+b)
obtaining for x₀=1
x₁=2/3, x₂=13/21, x₃=610/987, x₄=1346269/2178309, ...
Note that these are the values that we also obtained in the continued fraction, but in a much faster way (here the 2^nd iteration is the 6^th iteration in the continued fraction, the 3^rd is the previous 14^th, and the 4^th is the previous 30^th). The fraction x₄ presents an error below 10^-13.

Again, we are also interested in establishing sufficient conditions that ensure a convergence that is not only local (for a "sufficiently close initial iterate"), but that is valid for a set of initial iterates in an interval.

Theorem (Convergence of Newton Method − sufficient conditions for an interval):
Let f∈ C²[a,b] such that in the interval (x ∈ [a, b])

(1) f(a) f(b) ≤ 0 ;
(2) f ' (x) > 0 ou f ' (x) < 0 ;
(3) f '' (x) ≥ 0 ou f '' (x) ≤ 0 ;
Then Newton Method has at least quadratic convergence to the unique solution
z ∈ [a,b], initializing the method:

(a) with x₀∈ [a,b] : f(x₀) f ''(x) ≥ 0
(b) for any x₀∈ [a,b] if |f(a)/f '(a)|, |f(b)/f '(b)| ≤ |b − a|

proof:
A non constructive proof is simple but not relevant here, we prefer to discuss the conditions:

Note that conditions 1) and 2) imply existence and uniqueness.
Condition 3) is important such that the convexity does not change, leading to a cyclic, non convergent situation:
− For instance, if f(x) = x³ − 5 x, in [-1,1] there a single solution (z=0) and conditions 1) and 2) are verified, but not the condition 3), since f ''(x)=6x changes sign in [-1,1]. As a consequence we see that
x₀ = 1, x₁ = 1 − (1-5)/(3-5) = -1, x₂ = 1, ..., x_k = (-1)^k,
and the sequence defined by Newton Method does not converge!
Finally, note that condition (a) is always verified in one of the interval limits, but to ensure convergence for x₀ at any point of the interval, the condition (b) guarantees that iterates starting do not fall off the interval... for instance, if x₀=a
x₁ = a − f(a)/f '(a) ⇐ |x₁ − a| = |f(a)/f '(a)| ≤ |b − a|
Finally, the convergence is monotone for initial iterates verifying (a). Condition (b) has only to be verified for the limit that does not verify (a) then having x₁ as a starting point verifying (a) and monotone convergence for the next iterations.

Theorem (Newton Method − Error Estimates):
Let f∈ C²[a,b] and assume the previous quadratic convergence conditions. We have the following error formula:
∃ ξ ∈ [x_n; z] : e_n+1 = − f ''(ξ)
2 f '(x_n) (e_n)²
this leads to the asymptotic coefficient
K₂ = |f ''(z)|
2 |f '(z)|
and to the estimates

(a posteriori)

|e_n| ≤ max_{x∈ [a,b]} |f ''(x)|
2 |f '(x_n-1)| |e_n-1|²

(a priori)
M |e_n| ≤ (M|e₀|)^{2^n}
where
M = max_{x∈ [a,b]} |f ''(x)|
2 min_{x∈ [a,b]}|f '(x)|

proof:
− The Error Formula results from the Taylor expansion
0 = f(z) = f(x_n) + f '(x_n)(z-x_n) + ˝f ''(ξ_n)(-e_n)²
dividing by f '(x_n)≠ 0 we obtain
0 = f(x_n)/f '(x_n) + z − x_n + ˝f ''(ξ_n)/f '(x_n)(e_n)²
and as f(x_n)/f '(x_n) − x_n = -x_n+1 we get,
0 = z − x_n+1 + ˝f ''(ξ_n)/f '(x_n)(e_n)²
with ξ ∈ [x_n; z] and x_n → z. We get K₂ dividing |e_n+1| by |e_n|². Finally
− the first estimate is immediate, since ξ ∈ [x_n; z]⊂ [a,b].
− the second estimate is also immediate, as
M |e_n| ≤ (M|e_n-1|)²
applying it repeatedly.

Remarks:
(i) When f '(z)=0 we are in the situation of a double root (at least), and Newton Method loses quadratic convergence, being possible a slight change to recover quadratic convergence (e.g. [CA]).
(ii) When f ''(z)=0 (and f'(z) ≠ 0) we obtain a cubic convergence.
(iii) As an empirical rule quadratic convergence means doubling the number of significant digits in each iterate (... cubic convergence means the triple). This means that near the root, double precision exact value is achieved in less than 4 or 5 iterations, being more important to get there in the first place.
(iv) We stress this point, as it follows from the error estimates that to have an acceptable a priori estimate the value M|e₀| should be less than 1, and this means that the initial iterate should not be far from the solution (moreover if |f '| is small or |f ''| is large).

Example:
Consider f(x)= e^-x − log(x) = 0, that verifies in the interval [1,2]:

1) f(1) = e^-1 > 0, f(2) = e^-2-log(2) < 0;
2) f '(x) = -e^-x-1/x < 0 (for x∈ [1,2])
3) f ''(x) = e^-x+1/x² > 0 (for x∈ [1,2])
the sufficient conditions for Newton Method convergence are verified, at least for initial iterates such that f(x₀)>0 [conclusion (a) from the Theorem], as it is the case for x₀=1.
Since |f(1)/f '(1)| = 0.2689… , |f(2)/f '(2)| = 0.87798… < |2-1| = 1, following conclusion (b), we have convergence given any x₀∈ [1,2].
Starting with x₀=1 we get
x₁=1.2689…, x₂=1.309108…, x₃=1.3097993887…
as |f(x₃)|≤ 2.04× 10^-7 and min_[1,2]|f '|=|f '(2)| > 0.635, the general a posteriori estimate gives us
|e₃|≤ 2.04× 10^-7 / 0.635 < 3.21× 10^-7.
and we see that x₃ already presents 7 correct digits.
Using this information we may improve the next estimates.
The interval is now small enough to consider M≈ K₂ ≈ 0.413, therefore
0.412|e_k+3| ≤ (M|e₃|)^{2^k} ≤ (1.325× 10^-7)^{2^k}
meaning that x₅=1.30979958580415… presents an error bound
|e₅|≤ 0.412^-1(1.325× 10^-7)⁴ < 10^-27
being completely useless other double precision iterations, as only the rounding error will remain.

Secant Method
The Secant Method is a variant of Newton Method, as it avoids the calculation of derivatives substituting them by a finite difference approximation (that in the limit coincides with derivative itself).

Like the Newton Method, the Secant Method may be deduced in a geometrical way, substituting the tangent line by a secant line.
Given two points x_n-1 and x_n we define the "secant line" that crosses the image points:
y = f(x_n) + f(x_n) − f(x_n-1)
x_n − x_n-1 (x − x_n)
the derivative is now substituted by the finite difference.
In a similar way, solving for y=0 we get
x = x_n − x_n − x_n-1
f(x_n) − f(x_n-1) f(x_n)
and this will be the new approximation value x_n+1.

Resuming, Secant Method consists in

Defining two initial iterates x_-1 e x₀.
Repeat

x_n+1 = x_n − x_n − x_n-1
f(x_n) − f(x_n-1) f(x_n)

Remark: Note that in each iteration we do not need to calculate two times the function f, as the value f(x_n-1) is to be stored from the previous iteration. This way the method is computationally more efficient than Newton Method (in most cases), as it does not demand an updated calculation of the derivative at each step.

Error Formula (Secant Method)
It is possible to establish the following formula (check details in [KA])
∃ ξ_n, η_n ∈ [x_n-1; x_n; z] : e_n+1 = − f ''(ξ_n)
2 f '(η_n) e_n e_n-1
allowing the error estimate

|e_n+1| ≤ max |f ''(x)|
2 min |f '(x)| |e_n| |e_n-1|
The convergence is again superlinear (but not quadratic) when f '(z) ≠ 0, since
lim_n→∞ |e_n+1|
|e_n| = lim_n→∞ |f ''(ξ_n)|
2 |f '(η_n)| |e_n-1| = |f ''(z)|
2 |f '(z)| lim_n→∞ |e_n-1| = 0
since |e_n|→ 0 (assuming convergence).
It may be shown (e.g. [KA]) that if f '(z) ≠ 0 the order of convergence of Secant Method is p=(1+√ 5)/2 = 1.618… the golden number.

NOTE: In fact, taking y_n = log|e_n| and applying logarithms to the formula, we establish (hand waive argument) y_n+1 ≈ C + y_n + y_n-1 a Fibonacci sequence, and from the ratio y_n+1/y_n → p we get log|e_n+1|≈ p log|e_n|, which is a simple justification for this golden value of p.

Theorem (Convergence of the Secant Method − sufficient conditions in an interval)
Let f ∈ C²[a,b]. In the hypothesis (1), (2), (3), of the similar theorem for Newton Method, Secant method has a superlinear order of convergence p = 1.618…, initializing

(a) with x_-1, x₀ ∈ [a,b] : f(x_-1)f ''(x) ≥ 0 , f(x₀)f ''(x) ≥ 0
(b) with any x_-1, x₀ ∈ [a,b] if |f(a)/f '(a)|, |f(b)/f '(b)| ≤ |b − a|

Example:
Consider again f(x)= e^-x-log(x) = 0 the conditions (1), (2), (3) were already verified in [1,2]. As the condition (b) was also verified, we may use any pair of initial iterates to obtain with the Secant Method a sequence with superlinear convergence.
Starting with x_-1=1, x₀=1.5, we get
x₅=1.309799585804… com |e₅|≤ |f(x₅)|/0.635 < 2× 10^-13
(using the general a posteriori estimate) being enough to calculate another iteration to have the exact full digits in double precision.

Final Remarks and References
It should be noticed that there several other popular numerical methods for root finding, some of them combine several of the previous methods. For instance Brent's Method uses both Bisection and the Secant methods (along with inverse quadratic interpolation). For complex root finding, Fixed-point, Newton or Secant methods have straightforward generalizations. There is a sort of generalization for the Bisection method, the Lehmer-Schur algorithm, that consists in calculating the winding number along 4 smaller squares, at each step. We will some of these generalizations in the next chapters.
[CA] C. J. S. Alves, Fundamentos de Análise Numérica (I), Secçăo de Folhas AEIST (2002).
[KA] K. Atkinson: An Introduction to Numerical Analysis, Wiley (1989)