Say you perform a binary classification algorithm using different models and you want to find out which model yields the best results. The evaluation depends on the application but there are some common frameworks for this task. Accuracy is probably the simplest evaluation criteria which is defined as the ration of the “number of correct decisions made” to the “total number of decisions made”. This is equivalent to 1-errorRate. This metric is commonly used in data mining as it reduces classifier’s performance to a single number. Unfortunately, it is too simplistic and has come well-known issues.

Let’s first introduce the notion of a confusion matrix. This is a 2×2 matrix where the columns labeled with actual classes and rows labeled with predicted classes. This can be generalized to nxn matrix for problems involving n labels. In this way, we can analyze different sorts of errors. The diagonal terms represent correctly predicted classes and the non-diagonal terms represent either false positive errors (negative instances classified as positive) or false negative (positive instances classified as positive) errors.

However, using such a metric might be problematic for unbalanced classes. It is very common that the interesting instances are rare. For example, the number of customers who own your insurance policy is much smaller compared to those who don’t (you would be rich otherwise and would not be reading this blog). Under such circumstances, you can get a very high performance rate by just assigning all the classes as negatives. Obviously, this is not you want.

Ideally, you want to estimate the cost or benefit of each decision a classifier can make. Say you gain $100 for each customer who buys your insurance policy and lose $1 for advertising. Your profit would be $99 for every customer who buys the policy and loss would be $1 for every customer who don’t buy. Your benefit will be greater than zero even if there is only one customer who buys your insurance policy among 99 customers. If we can estimate a customer’s likelihood of buying and not buying your insurance company mined form historical data we can estimate the expected benefit. We can then target the customer as long as the probability of buying the product is greater than some threshold so that your profit is greater than zero. Hence, instead of computing the accuracies for the competing models, we can compute expected values.

There are other several other evaluation metrics derived from confusion matrices. Let’s abbreviate the number of true positives, false positives, true negatives and false negatives by TP, FP, TN and FN. We can describe various evaluation metrics using these counts. True positive rate is defined as TP/(TP + FN) and False negative rate as FN/(TP+FN).

The metrics Precision and Recall are often used. Recall is the same as true positive rate, while the precision is TP/(TP + FP), which is the accuracy over the cases predicted to be positive. The F-measure is the harmonic mean of precision and recall at a given point and is F-measure = 2.(precision x recall) / (precision + recall).

A different strategy to expected value framework would be to rank a set of cases by these scores. A ROC graph is a 2-D plot of a classifier with false positive rate on the x axis against true positive rate on the y-axis. This way, a ROC graph depicts relative trade-offs that a classifier makes between benefits and costs. The advantage is that ROC graphs are independent of the class proportions as well as the costs and benefits. AUC (area under curve) is an important summary statistic of ROC. This is useful when a single number is needed to summarize performance.