Introductory statistics: are my data normal?

academics High School math statistics & probability

StatsStatistics is fun, I promise! But before we can start having all the fun, it is important to describe the distribution of our data. We will need to handle problems differently depending on the distribution.

A histogram is just a graphical way to look at the distribution of our data.

Let’s look at the one below. The x-axis represents our variable of interest, student GPA. The y-axis simply represents frequency for the given x value. We can read from the histogram that 30 people reported having a GPA between 2.5 and 3.0 in our data. This histogram shows that the variable GPA is normally distributed because both sides of the histogram are symmetrical. Hooray!

Stats 1

But what if, instead, our histogram looks like this one (below)? This is referred to as right-skewed or positive-skewed because there is a longer tail to the right of the most frequently reported GPA (1.5-2.0). When the variable of interest has a normal distribution, the measures of central tendency, the mean, median, and mode, are approximately equal. However, in a right-skewed histogram, the central tendency measures no longer line up. In the example given here, the median moves a bit to the right (relative to the mode) and the mean shifts even further to the right, while the exact opposite happens in a left-skewed histogram. Because the mean shifts further away from the mode than the median, we see the mean is heavily influenced by observations far from the mode (outliers in our data), while the median is not.

Stats 2

So far, we have visually looked at our data distribution with histograms. We can also get an idea of normality by looking at a statistic called the skewness, which generally pops out when using descriptive statistics programming tools. The skewness tells us the amount and the direction of the skew. If it is a negative number, we have a negative-skew (or left-skew), and if it is a positive number, we have a positive-skew (or right-skew, as illustrated in the histogram above). While this statistic can be a helpful guide when determining distribution of a variable, graphical visualization is always recommended. Why can we not just ignore the histogram and rely on the skewness statistic? Well, in order to declare a variable “non-normal” for the purposes of statistical research, we generally need the variable to be very abnormal. The skewness statistic rarely looks perfect and can lead us to thinking a variable is non-normal when it is actually quite close to normal. We might choose to forgo certain statistical procedures based on the skewness statistic that would be perfectly fine to use in our data.

Now we know we can visualize our data with a histogram and look at the skewness variable to check normality, but why is normality important? Unfortunately, when we have large deviations from normality, some of our handy parametric analytic procedures, such as correlation tests, t-tests, and ANOVA, cannot be used in good conscience.

What are we to do when this occurs? Many parametric tests that require (or prefer) the assumption of normality have non-parametric counterparts. For example, for a t-test, in the case of non-normal data, we can instead perform something called a Mann-Whitney test. Another option is to transform the data so that it becomes more normally distributed and apply standard parametric methods to the new data. For instance, when we have a variable with a heavy concentration of values close to zero, we often take the logarithm of the variable to make the distribution normal. The following plots show a histogram before and after logarithmic transformation. Because the one on the right depicts a variable that is no longer obviously non-normal, we are safe to use any methods requiring a normal distribution. The tricky part about transformations comes with interpretation, but that is a topic for another day.

Screen Shot 2020-03-10 at 10.42.49 AM

To sum up, it is always important to check what assumptions different statistical procedures require before using them. If one of the assumptions is normality, we must check our data before moving forward. After that, the fun can begin!

 

Comments

topicTopics
academics study skills medical school admissions MCAT SAT college admissions expository writing strategy English writing MD/PhD admissions LSAT physics GMAT GRE chemistry academic advice graduate admissions biology math interview prep law school admissions ACT language learning test anxiety personal statements premed career advice MBA admissions test prep AP exams homework help creative writing MD study schedules mathematics computer science Common Application history research summer activities secondary applications philosophy organic chemistry economics supplements admissions coaching dental admissions 1L grammar statistics & probability PSAT psychology law legal studies ESL reading comprehension CARS PhD admissions SSAT calculus covid-19 logic games engineering USMLE admissions advice medical school mentorship Latin Spanish biochemistry parents AMCAS English literature case coaching verbal reasoning DAT STEM adjusting to college dental school excel genetics political science skills French Linguistics MBA coursework Tutoring Approaches academic integrity astrophysics chinese classics freewriting gap year letters of recommendation mechanical engineering technical interviews units Anki DO Social Advocacy algebra amino acids art history artificial intelligence business careers cell biology cold emails data science diversity statement finance first generation student geometry graphing kinematics linear algebra mental health pre-dental presentations quantitative reasoning revising software engineering study abroad tech industry time management work and activities writer's block 2L AAMC DMD IB exams ISEE Japanese MD/PhD programs MMI Sentence Correction algorithms analysis essay argumentative writing athletics business skills executive function fellowships functions genomics infinite information sessions international students internships logic networking office hours outlining poetry proofs reading recommendations research fit resume scholarships science social sciences statement of purpose trigonometry 3L ADHD Academic Interest ChatGPT EMT FlexMed Fourier Series Greek Health Professional Shortage Area Italian JD/MBA admissions Lagrange multipliers London MD vs PhD Montessori National Health Service Corps Pythagorean Theorem Python Shakespeare Step 2 TMDSAS Taylor Series Truss Analysis Zoom acids and bases active learning architecture art art and design schools art portfolios bacteriology bibliographies biomedicine boarding school brain teaser burnout campus visits cantonese capacitors capital markets central limit theorem centrifugal force chem/phys chemical engineering chess chromatography class participation climate change clinical experience community service competitions constitutional law consulting cover letters creative nonfiction curriculum dementia demonstrated interest dimensional analysis distance learning econometrics electric engineering electricity and magnetism embryology entropy escape velocity evolution extracurriculars fundraising harmonics health policy history of medicine history of science hybrid vehicles hydrophobic effect ideal gas law immunology induction infinite series