Creating a model out of data

the tree · August 20, 2006

Say a scientist gave you a set of values of A from an experiment of changing the value of B. You're told that the experiment has been repeated and that the values of A are consitent (with an acceptably tiny margin of error) each try.

At first glance there is nothing as simple as the value of A just going up as B increases but as there is definately a connection so the scientist has asked you, the mathematician, to describe the relationship between these to variables.

How do you begin attacking this problem? Where do you start when all you have is a sheet of numbers?

insane_alien · August 20, 2006

i would start by making a graph and then trying to find a function to fit the data.

BhavinB · August 20, 2006

There is a stochastic, statistical method of determining a mathematical model of data...though usually restricted in industry to polynomial relationships. This is known as the "Statistical Design of Experiment" originally developed by Fisher. Through appropriate design of factors and experiment settings, one can perform just the required number of experiments (no more) to determine the relationship required (within statistical tolerances).

In industry, this method is heavily heavily used to determine optimal conditions. For example, in a chemical production facility or every single semiconductor fab facility in the world. Though not really used for exploration of a system response, I don't see why it can't. There are several books / websites that go over the mathematics behind this method.

woelen · August 27, 2006

If possible, I would do much more experiments than the number of parameters in your model asks for. Indeed, if your model has just 2 parameters A and B, e.g. Y=AX+B, then you could do two experiments, and find a set of two equations in A and B and solve for that. However, this model will be very sensitive to experimental error. A much better approach is to derive much more equations and try to find a best possible fit for the data. Only if determining many datapoints of the model is very expensive (e.g. each experiment would require a $$$$$ investment) or dangerous or whatever, then the method of solving the set of equations is useful, but in general: the more datapoints, the better.

If you have a very simple model Y=AX+B then use this simple technique:

http://www.ies.co.jp/math/java/misc/least_sq/least_sq.html

http://www.efunda.com/math/leastsquares/lstsqr1dcurve.cfm

This simply is a formula for deriving A and B in a robust way, using all your datapoints.

----------------------------------------------------------------------------------------------

In general, you can use least squares techniques to find an optimal set of parameters (A, B, C, ...), such that the modulus of the error vector ||Y - Ŷ|| for the given data set is as small as possible. Here Y is the vector of all data values Y and Ŷ is the vector of all derived values for Y, based on the model with parameters A, B, C, ... and all data points X.

If the model is linear in the parameters A, B, C, etc. (it does not need to be linear in the quantities X and Y and there may even be more related quantities Z etc.), then the least squares method also is easily applied for this. If the model in non-linear in A, B, C, etc. then one can still use the least squares method, but then the math is much more involved. You'll have to resort to iterative methods and you need to compute the Jacobian of the model, relative to A, B, C etc.

For the linear case, the method can easily be expressed in terms of simple linear algebra:

http://en.wikipedia.org/wiki/Linear_least_squares

This method is VERY robust and if one of the data sets has a large error (e.g. a badly performed experiment), then if the number of data points is fairly large, then its influence is not that bad. In case of just performing the number of experiments, precisely sufficient for the number of parameters, the result of your computation will be very sensitive to experimental errors. This is a method of using all data points in a fair way.

One could even go a step further. If some datapoints are trusted more than others, then one can add a weight to each datapoint. If the model is linear in the parameters, one could also take such weights into account, using a very similar mechanism as described in the link above. This thing can make the derivation of the model even more robust. Now your single bad experiment still can be used, but to a lesser extent.

Sign In

Creating a model out of data

Recommended Posts

the tree

insane_alien

BhavinB

woelen

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Important Information