One of the major tasks in machine learning and statistical testing is classification. In classification problems, we use a training set of labeled data to train our model to classify an unlabeled observation into one category or another. At the simplest level, this method uses observable data to make a related yes-or-no classification (like: will it rain today or not rain today). Classification problems can also have more than two classifications (like: will it be cloudy, sunny, rainy, snowy, etc.), but the principles for analyzing the results are largely the same. Many popular techniques exist for classification problems, such as logistic regression, trees (including boosted trees and random forests), and neural networks. To see how well a given method works, we can use a confusion matrix to understand the results of the model.
The Confusion Matrix
The picture below shows a typical layout of a two-class classification with the actual class on the top and our model’s predicted class on the side. In most cases, the matrix shows the number of cases in each of the four cells. Ideally all of our data would fall into the diagonal green cells, and none into the off-diagonal red cells, but models are never that accurate. Instead it is important to understand what each of the different types of errors mean to your problem using medical diagnosis as an example.
False Positives: Your model classified this individual as having the disease when in reality she was healthy. This is called a type I error and in statistical hypothesis testing is equivalent to rejecting a true null hypothesis.
False Negatives: Your model classified this individual as being healthy when in reality he had the disease. This is called type II error and is equivalent to failing to reject a false null hypothesis.
For me, this is the most confusing part of the confusion matrix. For every possible ratio of numbers, there is at least one fairly ambiguous name describing it. Here are some of the highlights where FP stands for false positive (and so on for TP, FN, TN).
There are many other statistics used to summarize the matrix, and you can pick the one that makes the most sense for your problem.
Let’s look at two examples to see how the tradeoffs between the two types of misclassification errors.
Hair Length: Imagine you ask somebody the length of their hair then try to determine whether they are female or not female. If their hair is neck-length or longer, you guess female and if its shorter, you guess not female. This is a pretty straightforward model that would work reasonably well. For a 1,000,000 people, your confusion matrix might look like this:
You miss women who wear their hair short (including very young children), and men who have their hair long. Importantly, the population is very balanced between females and not, and there is not an inherent asymmetry in the cost of classifying someone who is not female as female or a vice-versa. Therefore, we can set a middle-of-the-road cutoff like neck length to minimize the total number of incorrect predictions.
Medical Diagnosis: Let’s turn our attention back to medicine and imagine we are testing for a rare disease. You have a very good test (the false positive and false negative rates are both 0.1%), but for a very rare disease (also 0.1%). Your confusion matrix might look like this:
In these types of tests we have two key differences from the hair example: an unbalanced population and unbalanced costs of error. First, there are many more uninfected people than infected, and second, a false negative could lead to an untreated disease and very high costs, while a false positive could be fairly easily remedied, say, by running another test. In this case, we may want to accept a high false positive rate in exchange for minimizing the false negative rate to address these cost asymmetries.
We also have the curious result due to the unbalanced population that, if tested at random, a positive test result gives you a 50% probability of having the disease. But, that is a topic for another blog post about conditional probability.
Our statistics and probability tutors are doctoral candidates and PhDs. Our team also includes a small number of tutors, including MD and MBA candidates, who use statistics in the context of specialized fields. We help students master the fundamentals of statistics and probability: basic probability models, combinatorics (combinations and permutations), random variables, discrete and continuous probability distributions, statistical estimation and testing, confidence intervals, and linear regression. Whether you are encountering statistics for the first time, or you are looking for graduate level assistance in a specialized field, such as biostatistics or stochastic processes, we can help you.
Looking for more information on statistics? Check out some other helpful blog posts below!: