Introductory statistics: are my data normal?

academics High School math statistics & probability

StatsStatistics is fun, I promise! But before we can start having all the fun, it is important to describe the distribution of our data. We will need to handle problems differently depending on the distribution.

A histogram is just a graphical way to look at the distribution of our data.

Let’s look at the one below. The x-axis represents our variable of interest, student GPA. The y-axis simply represents frequency for the given x value. We can read from the histogram that 30 people reported having a GPA between 2.5 and 3.0 in our data. This histogram shows that the variable GPA is normally distributed because both sides of the histogram are symmetrical. Hooray!

Stats 1

But what if, instead, our histogram looks like this one (below)? This is referred to as right-skewed or positive-skewed because there is a longer tail to the right of the most frequently reported GPA (1.5-2.0). When the variable of interest has a normal distribution, the measures of central tendency, the mean, median, and mode, are approximately equal. However, in a right-skewed histogram, the central tendency measures no longer line up. In the example given here, the median moves a bit to the right (relative to the mode) and the mean shifts even further to the right, while the exact opposite happens in a left-skewed histogram. Because the mean shifts further away from the mode than the median, we see the mean is heavily influenced by observations far from the mode (outliers in our data), while the median is not.

Stats 2

So far, we have visually looked at our data distribution with histograms. We can also get an idea of normality by looking at a statistic called the skewness, which generally pops out when using descriptive statistics programming tools. The skewness tells us the amount and the direction of the skew. If it is a negative number, we have a negative-skew (or left-skew), and if it is a positive number, we have a positive-skew (or right-skew, as illustrated in the histogram above). While this statistic can be a helpful guide when determining distribution of a variable, graphical visualization is always recommended. Why can we not just ignore the histogram and rely on the skewness statistic? Well, in order to declare a variable “non-normal” for the purposes of statistical research, we generally need the variable to be very abnormal. The skewness statistic rarely looks perfect and can lead us to thinking a variable is non-normal when it is actually quite close to normal. We might choose to forgo certain statistical procedures based on the skewness statistic that would be perfectly fine to use in our data.

Now we know we can visualize our data with a histogram and look at the skewness variable to check normality, but why is normality important? Unfortunately, when we have large deviations from normality, some of our handy parametric analytic procedures, such as correlation tests, t-tests, and ANOVA, cannot be used in good conscience.

What are we to do when this occurs? Many parametric tests that require (or prefer) the assumption of normality have non-parametric counterparts. For example, for a t-test, in the case of non-normal data, we can instead perform something called a Mann-Whitney test. Another option is to transform the data so that it becomes more normally distributed and apply standard parametric methods to the new data. For instance, when we have a variable with a heavy concentration of values close to zero, we often take the logarithm of the variable to make the distribution normal. The following plots show a histogram before and after logarithmic transformation. Because the one on the right depicts a variable that is no longer obviously non-normal, we are safe to use any methods requiring a normal distribution. The tricky part about transformations comes with interpretation, but that is a topic for another day.

Screen Shot 2020-03-10 at 10.42.49 AM

To sum up, it is always important to check what assumptions different statistical procedures require before using them. If one of the assumptions is normality, we must check our data before moving forward. After that, the fun can begin!

 

Comments

topicTopics
academics study skills MCAT medical school admissions SAT college admissions expository writing strategy English MD/PhD admissions writing LSAT physics GMAT GRE chemistry biology math graduate admissions academic advice interview prep law school admissions ACT language learning test anxiety premed career advice MBA admissions personal statements homework help AP exams creative writing MD test prep study schedules computer science Common Application mathematics summer activities history secondary applications philosophy organic chemistry economics research supplements grammar 1L PSAT admissions coaching dental admissions law psychology statistics & probability legal studies ESL CARS PhD admissions SSAT covid-19 logic games reading comprehension calculus engineering USMLE mentorship Spanish parents Latin biochemistry case coaching verbal reasoning AMCAS DAT English literature STEM admissions advice excel medical school political science skills French Linguistics MBA coursework Tutoring Approaches academic integrity astrophysics chinese dental school gap year genetics letters of recommendation mechanical engineering units Anki DO Social Advocacy algebra art history artificial intelligence business careers cell biology classics data science diversity statement geometry kinematics linear algebra mental health presentations quantitative reasoning study abroad tech industry technical interviews time management work and activities 2L AAMC DMD IB exams ISEE MD/PhD programs Sentence Correction adjusting to college algorithms amino acids analysis essay athletics business skills cold emails fellowships finance first generation student functions graphing information sessions international students internships logic networking poetry proofs resume revising science social sciences software engineering trigonometry writer's block 3L Academic Interest EMT FlexMed Fourier Series Greek Health Professional Shortage Area Italian JD/MBA admissions Lagrange multipliers London MD vs PhD MMI Montessori National Health Service Corps Pythagorean Theorem Python Shakespeare Step 2 TMDSAS Taylor Series Truss Analysis Zoom acids and bases active learning architecture argumentative writing art art and design schools art portfolios bacteriology bibliographies biomedicine brain teaser burnout campus visits cantonese capacitors capital markets central limit theorem centrifugal force chem/phys chemical engineering chess chromatography class participation climate change clinical experience community service constitutional law consulting cover letters curriculum dementia demonstrated interest dimensional analysis distance learning econometrics electric engineering electricity and magnetism escape velocity evolution executive function extracurriculars freewriting genomics harmonics health policy history of medicine history of science hybrid vehicles hydrophobic effect ideal gas law immunology induction infinite institutional actions integrated reasoning intermolecular forces intern investing investment banking lab reports letter of continued interest linear maps mandarin chinese matrices mba medical physics meiosis microeconomics mitosis mnemonics