“I never understood that.”

One of the major tasks in machine learning and statistical testing is classification. In classification problems, we use a training set of labeled data to train our model to classify an unlabeled observation into one category or another. At the simplest level, this method uses observable data to make a related yes-or-no classification (like: will it rain today or not rain today). Classification problems can also have more than two classifications (like: will it be cloudy, sunny, rainy, snowy, etc.), but the principles for analyzing the results are largely the same. Many popular techniques exist for classification problems, such as logistic regression, trees (including boosted trees and random forests), and neural networks. To see how well a given method works, we can use a confusion matrix to understand the results of the model.

If you’ve ever taken a statistics course, you’ve experienced the strange, slightly opaque world of statistical jargon, where colloquial language has highly specific meanings that are easily abused. One of the most famous, most abused statistical terms is the “p-value.” In almost every field of science there’s an ongoing discussion over P-values, and whether the common P-Value Threshold of 0.05 is even reasonable or not. So, what is a P-value, and why is 0.05 such a contentious number?

Statistics is fun, I promise! But before we can start having all the fun, it is important to describe the distribution of our data. We will need to handle problems differently depending on the distribution.

One of the most commonly identified challenges in statistics for psychology is differentiating between mediation and moderation. Fully understanding these concepts can seem overwhelming, but it doesn’t have to be that way! All concepts that seem tricky can be broken down into simple, comprehendible steps.