The idea behind using numerical methods is to basically use easy maths and a few not-too-inaccurate approximations to predict hard maths. Major applications of this involve using computers to solve tough integrals or derivatives via various methods, to solve simultaneous systems of equations, optimise situations for various controllable conditions, and do other things that would otherwise suck to do by hand.
I plan to touch on a few of these methods, detailing how they’re useful and explaining a bit of theory behind each of them. They’ll be grouped into their various types, as multiple ones are related by way of being modifications of each other, or used to solve the same types of problems.
Throughout this, I’ve been lazy and stolen pretty much all of my images of equations/etc off Wikipedia, so here’s a token referencing of my sources comment. Check Wikipedia for more information about most of these topics, too, though I’ve found it to be a bit hard to get your head around if you’re new to this all, so I’ve tried to explain it a bit easier than Wikipedia has.
Please click on the title of a particular method to get an example of how it is used. This may be changed in future, but it's the easiest way for me to organise it at the present time. It'll expand as I get bored now that I have no uni to do.
Numerical integration methods are used to find the definite integral of a function between two points. Depending on how they’re implemented, they may or may not include adjustments for rounding errors inherent in the method used. I won’t address trivial examples such as the trapezoidal method, as they’re covered in a typical high-school education. Instead, I’ll focus on a few of the slightly more advanced ones.
The Monte-Carlo method is probably one of the easiest methods, and was coined by the guys working on the Manhattan Project. It works on the idea that the integration of a function is, by definition, the area under that function if graphed. As such, it relies on taking the section of a faction that is desired to be evaluated, graphing it, and then simply generating random data on the graph and finding out the ratio that fall under the graphed function. It’s especially useful in cases where one can easily draw the function by defining coordinates of the boundaries of said shape, and is very easy to the point of being almost trivial to implement, whilst still being quite powerful, despite not being particularly efficient.
To implement, one simply draws the function on a graph of known scale. Then, randomly generated values are made for both of the coordinates of said graph, and if the point generated falls within the area bound by the function. One then keeps a tally of both the number of generated values, and the number of those that fall within the area bound by the function. The ratio of these two then equals the ratio of the area bound by the function compared to the total area included by the graph. This can then be used to find the definite integral of that function.
A simple example is find the area of the quarter of a circle of radius 1. We’ll graph it on a graph with the x and y axis ranging from 0 to 1. This is useful, as the total area of the graph is 1, so we know that the ratio of points is going to automatically be the answer, as the ratio of the quarter-circle-area to total graph area is with respect to 1.
This, as seen, is defined by the function x2+y2 = 1.
Now, we simply generate random coordinate points, by generating two values ranging between 0 and 1, and using them as the x and y coordinates. Some may be:
And so on. When each is generated, we check if it falls inside the area bound by the graph. Obviously, in this case, we can take both of those values, and check if they are less than or equal to 1 when put in the above formula.
For the first one:
Because this is less than 1, it lies inside the area of the graph. By generating many of these, the ratio converges towards the definite integral of the function.
The Monte-Carlo method can be expanded into higher dimensions simply by generating more coordinates for each point. As such, it can easily find volumes, so long as you can accurately describe the boundaries of the volume. One example of where this is used, is to explore the dimensionality of various crazy weird shapes, such as fractals and the like, which use a definition of dimension that can acutally be decimal. In cases such as this, it's quite useful, as not only can it be easily tailored to doing various crazy modelling problems via brute force, but it can also be highly parallelised which is a definite advantage in the current computing environment.
You’re probably familiar with this method, in a way, despite not actually knowing it. That’s because the Newton-Cotes method is the name given to the generalised method that, when taken at a low order, produces the trapezoidal and Simpson’s methods.
It basically relies on taking the function, estimating it in small portions, and then calculating its area underneath. It’s most likely used when you know points along the original function that are evenly spaced. We’re familiar with this from the trapezoidal rule, which is order 1 and shown below:
This takes two points, finds their average, and then multiplies it by the width between those two points. As we know, we can do this a lot of times and get a pretty accurate answer.
Simpson’s rule takes this a step higher, though – it doesn’t just take 2 points, but rather 3, and ‘weights’ them to give a better approximation of the shape that the function takes on (trapezoidal rule, for example, assumes a straight line). The following equation shows Simpson’s rule:
Where h is the distance between two points, and f1 is the midpoint between f0 and f2. As you can see, this weights the function over 3 evenly spaced points rather than two and as such, it’s more accurate, as it estimates with a parabola rather than a straight line.
Order 3, otherwise known as ‘Simpson’s 3/8 rule’, is the next order up, and follows the form:
Again, where h is the distance between 2 of the points, and all the points are evenly spaced. It’s even more accurate, but requires more points to solve.
It is possible to go to higher orders, though they tend to get a bit clunky there and other methods are better anyway. However, it’s still quite an accurate way of finding an integral if you can solve for any desired points along the original function. Similarly, some bright sparks, some time ago, came up with ways to make Simpson’s rule more efficient, by making it adapt to parts of functions that need more points to accurately describe, but I won’t cover that in this specifically.
Integration of ODEs
This is slightly different to the numerical integration above, which is about finding the definite integral of a certain function. Instead, these methods are used to integrate ODEs and are typically used to do things like simulate the flight of a projectile, or the flux of heat through an object, where you have differential relations between properties (for example, with the flight of a projectile, acceleration is the derivative of velocity, which is the derivative of displacement).
The idea is to be able to integrate using a history of a few values plus some differential relations, to solve accurately what values will come next. A bunch of these methods, and indeed the two I am covering, are based off the Taylor expansion, which is just the expanded form of an infinite Taylor series. The order of a method is determined simply by how many terms are used, and the error is effectively just the rest of those terms that have been truncated, since the complete Taylor series is the complete solution.
Euler’s method is the simplest of all of these ODE integration methods. It is not unlike the trapezoidal method above, in that the idea is to just split a curve up into loads of little straight lines, which are much easier to work with because it’s easy to predict where straight lines are going to end up. Because it deals with straight line approximations, it makes it a first-order estimation.
It works by taking the tangent of a point, then stepping forward in a straight line from that point by one step. At that point, it then re-calculates everything, finds a new tangent, and repeats the process. The following picture shows a visual representation of this, where the red line is the Euler approximation and the blue line is the real solution.
As you can see, its accuracy depends entirely on how fine you make the steps between each point. Thus, to get a high accuracy, you have to crunch a tonne of numbers and it takes ages even on modern computers (I’ve had simulations using this take over an hour for a huge set of flights of a projectile at different launch angles/fuel volumes). That said, it’s pretty handy because it’s ridiculously easy to implement and gives a good excuse to bugger off to the pub for an hour or 2 whilst numbers crunch away.
Mathematically, it takes the form:
Where n is the ‘step’ of the solution (corresponding to A0, A1, etc. above), h is the step size, and f(tn,yn) is a function describing the derivative at point yn – remember, in two dimensions, this is the equation of the tangent at point yn.
The archetypical example of this method is projectile flight, since, as said before, acceleration, velocity and displacement are all differentially related. Additionally, it’s easy to get acceleration from the forces acting on a body.
As such, the solution is as follows:
1. Find acceleration by taking force balance on body, and dividing by mass 2. Calculate velocity off acceleration using Euler’s method 3. Calculate displacement off velocity using Euler’s method 4. Use displacement to update the object’s position 5. Repeat until you reach an end condition (hitting the ground, for example)
The linear motion equations you would’ve learnt in school are an example of Euler’s method at work, and indeed the above example uses these exact equations, only with very small time steps:
Indeed, the displacement of this effectively just takes the first two terms of the Taylor series expansion, as opposed to the one that’s taken for velocity.
The Runge-Kutta method is effectively just a higher-order version of Euler’s method. When we talk about ‘the Runge-Kutta method’, it tends to be the fourth-order one, however it’s also known as the Runge-Kutta method when you’re dealing with other orders (such as third). As such, it’s really a generalised form of Euler’s method. In this way, it’s not unlike what Simpson’s rule is to the Trapezoidal rule.
The Runge-Kutta method is pretty widely used – you’ll tend to find an RK4 command in Matlab and some mathematical simulation-based packages for Python such as Pylab (off memory). This is because it’s simple but powerful, and quite accurate without being overly complex. As such, it’s pretty popular.
Mathematically, the Runge-Kutta method is expressed thus:
That is to say, it takes a weighting of 4 values, which are calculated off derivative equations, rather than just using 1 like Euler’s method. The values are calculated as follows:
That is to say, you are finding the derivative equations, but using various values instead of just the current time and position. For example, k2 uses the time of half a time step ahead of the current, and uses the position of half a step of k1’s derivative function value ahead of the current position. It’s a bit hard to wrap your head around at first, but one just needs to remember that it’s finding a heap of derivative function values and weighting them together to get an answer.
Because of this higher-order approximation, the error is less. As such, you can either have more accurate results, or you can alternatively use less time steps to get the same accuracy and thus get a faster simulation. This is clearly a Good Thing.
One thing of note is that a modified Runge-Kutta method, by a bloke called Fehlberg and creatively named the Runge-Kutta-Fehlberg method, exists and is sometimes used in place. It’s just really a blend of the 5th and 4th order methods with the ability to adapt its step size, which helps give it a bit more accuracy for a small amount of extra calculation.
These methods differ from the above in that they’re used to find the roots of functions – that is, where the function crosses 0. This is especially handy when dealing with derivatives of equations, as the roots of a derivative are the turning points of a function, and thus allow for computer-calculated optimisations if maximums or minimums are required.
This one is the easiest root-finding method – it’s almost trivially simple but as such is extremely easy to implement so I’ll only cover it a little bit.
All that’s involved is that you take two points, x1 and x2, such that you know your root lies in between the two. You then find the midpoint of the two points called, say, x3. You then check whether x3 has the same sign as x1 or x2, by multiplying. If the answer is negative, then they’re obviously different signs. If it’s positive, they’re the same sign.
Once you find which it differs to, you then take the different original point as well as x3 as your two new points. You’ve just, in effect, halved your range and have ensured your root still lies between your two new points. This is because the root is where the equation crosses 0, and thus a point either side of the root will have different signs to one another. You keep repeating this process until your two points are close enough to one another that they’re within an acceptable tolerance – at this point, you can safely say your root is the midpoint (or thereabouts) between your two last points.
Its name comes from that you are, literally, bisecting your target range of numbers, and repeating to do so until you’ve narrowed your range to a point where it’s ‘close enough’. It is a trivial, easily-implemented but ultimately relatively slow method.
Newton’s method is one of the simpler root-finding algorithms that are widely used. It’s roughly equivalent to Euler’s method, in that it is a first-order method of a more generalised concept that can be taken to higher orders for increased accuracy. I will not touch on them, though.
The algorithm is basically just a result of the algebraic manipulation of the derivative of an equation being equal to rise/run, by taking two points, with one being the point trying to be predicted. It takes the form:
In using it, it effectively calculates a tangent to a function, and moves towards where said tangent intersects 0. It then uses the corresponding function value as the new point to work from, and repeats the process until it has reached where the function itself crosses 0. As such, one can let it get more and more accurate simply by running further iterations, until such time as they are below a certain tolerance.
Visually, it appears like so:
You can see, from this example, how it will converge towards the root of the function. By using xn+1 as the second point, the tangent will be very close to intersecting the root of the function and thus will move gradually closer.
There are a few things one must be careful of when using this, though, which will cause the algorithm to fail. The most obvious is that it will go to the closest root, which may cause errors if one particular one is desired. Secondly, if there are turning points in the function between the point where the Newton’s method is, and the root, then it can cause the algorithm to go unstable as the turning point can become horizontal. However, with careful application and an understanding of the problem at hand, these are usually non-issues.
Linear System Methods
The following methods are algorithms that solve linear systems of equations of the form Ax = b, where A is large and sparse (aka mostly full of 0’s). These happen to exist surprisingly more frequently than you’d think, and thus are taught as useful.
To give some syntax, superscripts refer to time steps. As such, xk+1 is the time step one after xk. Subscripts refer to matrix positions. So, Aij refers to the matrix A’s value at position i,j. Similarly, if only one value is listed, say xj, then it’s because x is a vector that only has 1 column, but j rows.
The thing to basically keep in mind is that throughout this section, superscripts are time steps and subscripts are positions in matrices (though I deviate away from this at the end, for reasons explained there). This differs to all the non-matrix stuff, where there are no superscripts, and subscripts are just time steps.
The Jacobi method is the simplest method there is for these things. It consists of you summing up all of the non-diagonal values of Aij, multiplying each one by xj as you go, subtracting the answer from bi, and then dividing it by Aii aka the diagonal value of a for that row.
Mathematically, it looks like this:
As you can see, sum up aijxj for all values of j, except for the diagonal (where i = j). The formula then completes to get an xi value for the time step of k+1. Once all new x values have been found using this, we can then begin the new time step where the xk+1 vector is now the xk vector and things begin anew.
One can express this, similarly, by dividing the matrix into U, L, and D sections where U stands for upper triangle, L for lower triangle, and D for diagonal. An example is shown below in my amazing identity matrix:
In this case, the 1’s are the D section, the top-right 0’s are the U section, and the bottom-left 0’s are the L section.
With this in mind, the Jacobi method’s formula is:
This expresses it in matrix notation – as you can see, all the i’s and j’s are gone – so it allows you to get a better overall feel of what it’s doing. You wouldn’t usually implement it like this, though, as computer programs require you to count up single values and thus the first notation would be more useful.
This method is the slowest of all of the methods in this section I’ll discuss, but it is pretty damn straightforward and thus important to learn as it acts as the cornerstone for others. You’ll probably never bother implementing it, though, as there are very trivial modifications to it that make it better that will be covered in the following sections.
The GS method is pretty much the Jacobi method, but with the change that someone realised that when calculating out the Jacobi method at step i, we were still using xk for all values of x where j=1 to i-1, even though by that point we’d calculated out xk+1. This isn’t logical, since we could instead use the xk+1 values, which are newer and thus hopefully closer to the true solution. As such, it takes the mathematical form:
As you can see, very similar to the Jacobi method, except for all values of j<i, we use xk+1 because it has already been calculated for j=1 to i-1.
All of this basically means that, if you’re going to ever implement one of the two above methods, you’ll just do the Gauss-Seidel method, as it’s pretty much just as easy to implement as the Jacobi method, but converges faster and thus takes less time to solve. For that reason, there’s no real reason to choose Jacobi over Gauss-Seidel.
Successive Over Relaxation
SOR is a change to the GS method, which works by basically ‘magnifying’ changes made, in the vaguely reasonable hope that it pushes to the solution faster. Realistically, this tweak can really be applied to any solver that’s taking its sweet-arse time, but it tends to be traditionally related to the GS method.
It works by introducing an over relaxation parameter, known as ω, which acts not unlike a gain in a control system would, by multiplying changes to accelerate any difference made.
Mathematically, it looks like this:
With a value of ω=1, this reduces to the GS method.
You may be asking ‘where did that 1-ω part come from?’ If you think about it, the GS method did not add the current xi value on, however you could have alternatively put it on, and then subtracted it by including it in the sum part because it’s effectively then xj - (aijxj)/aij, which cancels to 0 anyway, as opposed to summing all the non-diagonal values only. This just includes it completely as it is modified by the ω value.
With ω, making it a value greater than or equal to 2 will result in divergence of the answer. As such, it’s generally best to pick ω as just less than 2 – I’ve seen examples where it used 1.9975, for example. However, ω can be used in more creative ways to, say, tame a particularly unstable problem by setting it to be < 1. As such, it adds a bit more flexibility to the GS method and is a good change.
The downside is that trying to choose a good ω for a given problem is pretty hard, so some smart people in the world decided to fix this problem.
Conjugate Gradient Method
This method is fundamentally different to the above methods in that it tackles the whole problem completely differently. The upside of this is that it’s a tonne faster. The downside is that it’s more complex.
This method bases itself off what’s called the residual, which is simply the difference between Ax and b, as shown below:
This is obviously the error in the answer, as the residual will be 0 when Ax = b, and thus wants to be minimised. As such, if we were to create a 3d plot of this process, we’d want to move down the gradient (slope) of the plot towards the minimum point. The following figure visualises this well:
To do this, we invent a new variable which determines our search direction, the gradient vector, denoted p, and it has the form:
Obviously, this is a problem for P0, so at that, we just assume it equals the residual. For coding nicety, we normally split that big ugly transpose-containing term into its own variable, ϐ, as shown below:
Anyhow, that gives us the following relationship:
which you can note can just become an addition if the negative from the second form of the ϐ equation is dropped, as is normally the case.
With this direction vector, we are able to step our guess of xk by one towards the bottom of the gradient, which corresponds to the result (as it is the converging point). We do this by solving the following equation:
Similarly, we can update rk with a similar equation, listed below:
In both of these cases, α is equal to the following:
With these relations, one is able to iterate the entire process, starting with calculating α, then xk+1 and rk+1, and if tolerance in rk+1 has not been reached (aka: it is not accurate enough yet), calculate ϐ and then use it to calculate pk+1, which gives you your new direction to head in.
The major advantage of this approach is that it takes a huge amount less steps to converge. That is, for an n*n matrix, convergence is reached within n steps. This means if your matrix is 100x100, then it takes, at most, 100 steps to reach an answer.
Throughout this, I’ve detailed a bunch of various numerical methods algorithms, with a basic detail of how they’re used. Despite this, I’ve not covered how they’re usefully applicable to the real world. I do not plan to cover this in this, as I only wanted to explain several ways of doing numerical analysis.
In a future post, I may consider covering how these methods can be applied to solve various problems, such as heat flow, projectile motion, and the like. Nevertheless, this should give you a solid basis to start with, and with a small amount of research, one can find many tasks that they can be applied to.
It should be noted that this is not designed to be a reputable source of information – I’ve tried to get everything right, but I have written this entirely in my spare time with minimal proofreading. As such, whilst it may be used for reference or learning, like any good student you should cross-check with another source.
I hope I’ve proven enlightening to some of you, perhaps fuelling an idle curiosity into something more by prompting further study in this field. If this is well-received, I’m considering covering other topics, which are a bit more in-depth but build off this, however I would most likely first cover the application of these solvers in real-world problems, as mentioned above. Nevertheless, I would not be against writing up a similar post for some CFD or rarefied gas dynamics simulation algorithms, though they may take me some more time than this did.
Back to Main Page