*holdout*data. When the hold-out data is used in this manner, it is called

*test*data in data science. Using a single holdout data might still be misleading as it might be a matter of chance that you are getting a good or a bad performance. More sophisticated way of holdout and testing procedure is

*cross-validation*. It uses the existing data in a smart way. Instead of splitting the data into one training and one test data, cross-validation estimates the performance over all data by performing multiple splits and systematically swapping out samples for testing. This approach will allow you to compute the statistics on the estimated performance, such as the mean and the variance so that we can understand how the performance of the model varies across data sets. Cross-validation begins by splitting a labeled data set into k partitions called

*folds*. A different fold is chosen as the test data. The chance of overfitting increases as the model becomes more complex. For instance, if we increase the number of leaves in tree induction, our model will become more prone to overfitting. Therefore, tree induction commonly limits the number of leaves to avoid overfitting. In linear models such as logistic regression, complexity can be controlled by choosing a right set of attributes. The general strategy is to find models which not only fit the data but also are simpler. This methodology is called

*regularization*. We essentially penalize the weights if they get more complex. The most commonly used penalty is the sum of the weights, sometimes called the “L2 norm” of weights. This approach gives a large penalty if the absolute values of the weights have large values. The

*ridge regression*is the name of the procedure when L2-norm penalty is used in standard least-squares linear regression. If instead we use the sum of the absolute values, known as L1-norm, we get a procedure known as

*lasso*(Hastie et al., 2009). More generally, this is called L1-regularization which ends up zeroing out many coefficients. Happy holidays and avoid overfitting! Reference: “Data Science for Business” by Foster Provost and Fawcett.