*Purity measure*can quantify how well an attribute segments the data according to a given target. For instance, if we can successfully segment the data into two the categories we want each subset to be a set of examples belong to only one target (class). These subsets would be called pure. One of the most common purity measures is called

*information gain*which is based on

*entropy*. I know you have proved a lot of fancy theorems in your information theory class but it is not as complicated in the business world. Entropy is defines as: Each is the probability of property within the set. High values of entropy indicates impurity. We can then define

*information gain*(IG) to measure how much an attribute improves (decreases) entropy over the whole segmentation it creates. The definition of IG is: So far we only discussed impurity measure for a categorical prediction. A natural measure of impurity of a numerical value is

*variance*. Similar notion to IG can be formalized by looking at the reductions in variance between parent and children. Tree induction is one of the commonly used technique for data mining as it is trivial to comprehend. Tree induction starts with the whole data set and creates “purist” subgroups using the attributes. In many prediction problems, we may want a more informative prediction that just a classification. For example, we can use the probability of leaving instead of a direct class assignment. Frequency based estimate can be used to assign class probability. For example, is a leaf contains positive instances and negative instances, the probability of any new instance being positive maybe estimated as . We may be overly optimistic if there are not enough number of instances in a leaf. This is called

*overfitting*problem in data science. Laplace correction can be used to moderate the influence of leaves with only a few instances: