Hi All,
Anyone knows of a statistic that can assess the goodness of predicting out-of-sample values? For example I have N data points which I use to evaluate a model with p parameters. I use that model to predict the values of M data points. I then compare these predictions with the actual values for the M data points. Aside from Chi Square, do you know of any other statistic that can be used to measure the goodness of out-of-sample predictions?
Namir
Namir,
IIRC, usually the model predicts certain values and the data points scatter around these predicted ('expected') values. If the model follows a normal distribution (or something which can be transformed into such a distribution) then confidence intervals can be calculated around the model curve. Typically these intervals are not rectangular even in a coordinate system where the model corresponds to a straight line. Within these intervals the data points shall be found with said confidence. If that's what you're looking for then I've to dig in my old files to find the exact way that's done - it was definitly not chi-square. I don't remember anything else alike.
d:-)
Edited: 18 Dec 2012, 3:35 p.m.
If you are looking for overall summary of how well your model fits the observed data, then an alternative is to work directly with the likelihood ratio statistic, sometimes expressed on an additive scale as -2 ln likelihood. For categorical data this can be simply calculated as the G^2 statistic
However, if your interest is more in identifying outlying individual observations, then a calculation of residual values for each can be useful (e.g., Pearson residuals, or deviance residuals), particularly if calibrated as studentized values. Pearson residuals form the components that make up the Pearson X^2 statistics, while deviance residuals combine to form -2log likelihood, known as the deviance.
Finally, one can reduce over-fitting of a model by using a training sample of observations to estimate the model and then a separate testing set to evaluate the fit of the model (which seems to be something along the lines which you have described). The Prediction Error Sum of Squares is a summary measure of the fit of a regression model to the set of observations that were not themselves used to in estimating the model. It is the sums of squares of the prediction residuals for those observations.
Nick
Edited: 23 Dec 2012, 6:42 a.m.
I'm curious Namir, do you want to predict values outside of your sample set as apposed to inside your sample data set?