Predictive modeling would be very useful to better understand or predict a target quantity. In business, this quantity might be something we want to avoid. For example, you may want to predict if a customer will leave the company when her/his contract expires. Or it might be something we will be happy to have happened, such as which customers are most likely to respond a special offer. The key step in predictive modeling is the identification of the attributes/features. Supervised learning model describes a relationship between these attributes and the target variable. In business, you may want to build a model of probability of leaving the company for a customer as a function of customer account attributes such as age, income, data usage, etc. The obvious question is “Which of these attributes would best segment these people into groups, in a way distinguishing leaving the company from not leaving?”.

*Purity measure* can quantify how well an attribute segments the data according to a given target. For instance, if we can successfully segment the data into two the categories we want each subset to be a set of examples belong to only one target (class). These subsets would be called pure. One of the most common purity measures is called

*information gain* which is based on

*entropy*. I know you have proved a lot of fancy theorems in your information theory class but it is not as complicated in the business world. Entropy is defines as:

Each

is the probability of property

within the set. High values of entropy indicates impurity. We can then define

*information gain* (IG) to measure how much an attribute improves (decreases) entropy over the whole segmentation it creates. The definition of IG is:

So far we only discussed impurity measure for a categorical prediction. A natural measure of impurity of a numerical value is

*variance*. Similar notion to IG can be formalized by looking at the reductions in variance between parent and children.
Tree induction is one of the commonly used technique for data mining as it is trivial to comprehend. Tree induction starts with the whole data set and creates “purist” subgroups using the attributes.
In many prediction problems, we may want a more informative prediction that just a classification. For example, we can use the probability of leaving instead of a direct class assignment. Frequency based estimate can be used to assign class probability. For example, is a leaf contains

positive instances and

negative instances, the probability of any new instance being positive maybe estimated as

. We may be overly optimistic if there are not enough number of instances in a leaf. This is called

*overfitting* problem in data science. Laplace correction can be used to moderate the influence of leaves with only a few instances: