Overfitting and Its Avoidance

Assume you work in a company and your boss asked you to build a model for a prediction of customers’ tendency to accept a special offer. You built a model and you came up with a result which is almost perfect (around 99%). Your boss might think you are a genius until he realizes that your model does not work well for the new data set. This phenomena is called overfitting in data science. We want to avoid to develop a model yielding perfect results for the existing data but do not generalize to the new data sets. As the Nobel Laureate Coase said, “If you torture the data long enough, it will confess.” The general accuracy of the model should be tested in another data set which was not used to build the model but the target variable is known. This is called holdout data. When the hold-out data is used in this manner, it is called test data in data science. Using a single holdout data might still be misleading as it might be a matter of chance that you are getting a good or a bad performance. More sophisticated way of holdout and testing procedure is cross-validation. It uses the existing data in a smart way. Instead of splitting the data into one training and one test data, cross-validation estimates the performance over all data by performing multiple splits and systematically swapping out samples for testing. This approach will allow you to compute the statistics on the estimated performance, such as the mean and the variance so that we can understand how the performance of the model varies across data sets. Cross-validation begins by splitting a labeled data set into k partitions called folds. A different fold is chosen as the test data. The chance of overfitting increases as the model becomes more complex. For instance, if we increase the number of leaves in tree induction, our model will become more prone to overfitting. Therefore, tree induction commonly limits the number of leaves to avoid overfitting. In linear models such as logistic regression, complexity can be controlled by choosing a right set of attributes. The general strategy is to find models which not only fit the data but also are simpler. This methodology is called regularization. We essentially penalize the weights if they get more complex. The most commonly used penalty is the sum of the weights, sometimes called the “L2 norm” of weights. This approach gives a large penalty if the absolute values of the weights have large values. The ridge regression is the name of the procedure when L2-norm penalty is used in standard least-squares linear regression. If instead we use the sum of the absolute values, known as L1-norm, we get a procedure known as lasso (Hastie et al., 2009). More generally, this is called L1-regularization which ends up zeroing out many coefficients. Happy holidays and avoid overfitting! Reference: “Data Science for Business” by Foster Provost and Fawcett.
This entry was posted in Data Science. Bookmark the permalink.