Introductory statistics: are my data normal?

academics High School math statistics & probability

StatsStatistics is fun, I promise! But before we can start having all the fun, it is important to describe the distribution of our data. We will need to handle problems differently depending on the distribution.

A histogram is just a graphical way to look at the distribution of our data.

Let’s look at the one below. The x-axis represents our variable of interest, student GPA. The y-axis simply represents frequency for the given x value. We can read from the histogram that 30 people reported having a GPA between 2.5 and 3.0 in our data. This histogram shows that the variable GPA is normally distributed because both sides of the histogram are symmetrical. Hooray!

Stats 1

But what if, instead, our histogram looks like this one (below)? This is referred to as right-skewed or positive-skewed because there is a longer tail to the right of the most frequently reported GPA (1.5-2.0). When the variable of interest has a normal distribution, the measures of central tendency, the mean, median, and mode, are approximately equal. However, in a right-skewed histogram, the central tendency measures no longer line up. In the example given here, the median moves a bit to the right (relative to the mode) and the mean shifts even further to the right, while the exact opposite happens in a left-skewed histogram. Because the mean shifts further away from the mode than the median, we see the mean is heavily influenced by observations far from the mode (outliers in our data), while the median is not.

Stats 2

So far, we have visually looked at our data distribution with histograms. We can also get an idea of normality by looking at a statistic called the skewness, which generally pops out when using descriptive statistics programming tools. The skewness tells us the amount and the direction of the skew. If it is a negative number, we have a negative-skew (or left-skew), and if it is a positive number, we have a positive-skew (or right-skew, as illustrated in the histogram above). While this statistic can be a helpful guide when determining distribution of a variable, graphical visualization is always recommended. Why can we not just ignore the histogram and rely on the skewness statistic? Well, in order to declare a variable “non-normal” for the purposes of statistical research, we generally need the variable to be very abnormal. The skewness statistic rarely looks perfect and can lead us to thinking a variable is non-normal when it is actually quite close to normal. We might choose to forgo certain statistical procedures based on the skewness statistic that would be perfectly fine to use in our data.

Now we know we can visualize our data with a histogram and look at the skewness variable to check normality, but why is normality important? Unfortunately, when we have large deviations from normality, some of our handy parametric analytic procedures, such as correlation tests, t-tests, and ANOVA, cannot be used in good conscience.

What are we to do when this occurs? Many parametric tests that require (or prefer) the assumption of normality have non-parametric counterparts. For example, for a t-test, in the case of non-normal data, we can instead perform something called a Mann-Whitney test. Another option is to transform the data so that it becomes more normally distributed and apply standard parametric methods to the new data. For instance, when we have a variable with a heavy concentration of values close to zero, we often take the logarithm of the variable to make the distribution normal. The following plots show a histogram before and after logarithmic transformation. Because the one on the right depicts a variable that is no longer obviously non-normal, we are safe to use any methods requiring a normal distribution. The tricky part about transformations comes with interpretation, but that is a topic for another day.

Screen Shot 2020-03-10 at 10.42.49 AM

To sum up, it is always important to check what assumptions different statistical procedures require before using them. If one of the assumptions is normality, we must check our data before moving forward. After that, the fun can begin!

 

Comments

topicTopics
academics study skills MCAT medical school admissions SAT expository writing college admissions English MD/PhD admissions strategy writing LSAT GMAT GRE physics chemistry math biology graduate admissions academic advice ACT interview prep law school admissions test anxiety language learning premed MBA admissions career advice personal statements homework help AP exams creative writing MD study schedules computer science test prep Common Application summer activities history mathematics philosophy organic chemistry secondary applications economics supplements research 1L PSAT admissions coaching grammar law psychology statistics & probability legal studies ESL CARS SSAT covid-19 dental admissions logic games reading comprehension engineering USMLE calculus PhD admissions Spanish mentorship parents Latin biochemistry case coaching verbal reasoning DAT English literature STEM excel medical school political science skills AMCAS French Linguistics MBA coursework Tutoring Approaches academic integrity chinese letters of recommendation Anki DO Social Advocacy admissions advice algebra artificial intelligence astrophysics business cell biology classics diversity statement gap year genetics geometry kinematics linear algebra mechanical engineering mental health presentations quantitative reasoning study abroad technical interviews time management work and activities 2L DMD IB exams ISEE MD/PhD programs Sentence Correction adjusting to college algorithms amino acids analysis essay art history athletics business skills careers cold emails data science dental school finance first generation student functions information sessions international students internships logic networking poetry resume revising science social sciences software engineering tech industry trigonometry writer's block 3L AAMC Academic Interest EMT FlexMed Fourier Series Greek Health Professional Shortage Area Italian Lagrange multipliers London MD vs PhD MMI Montessori National Health Service Corps Pythagorean Theorem Python Shakespeare Step 2 TMDSAS Taylor Series Truss Analysis Zoom acids and bases active learning architecture argumentative writing art art and design schools art portfolios bacteriology bibliographies biomedicine brain teaser campus visits cantonese capacitors capital markets central limit theorem centrifugal force chemical engineering chess chromatography class participation climate change clinical experience community service constitutional law consulting cover letters curriculum dementia demonstrated interest dimensional analysis distance learning econometrics electric engineering electricity and magnetism escape velocity evolution executive function freewriting genomics graphing harmonics health policy history of medicine history of science hybrid vehicles hydrophobic effect ideal gas law immunology induction infinite institutional actions integrated reasoning intermolecular forces intern investing investment banking lab reports linear maps mandarin chinese matrices mba medical physics meiosis microeconomics mitosis mnemonics music music theory nervous system neurology neuroscience object-oriented programming office hours operating systems