Tuesday, April 14, 2015

What Does it "Mean" to be Normal?

Focus Questions:

How do we determine what is "normal" for a given parameter?
What are some ways that we can compare the same data for two different samples?

Part 1: Interpreting Box Plots

The main focus for this part of the lesson is to look at how we can use a boxplot to determine what is "normal" for a set of data.

As a reminder, here is the general structure of a box plot:



Remember that there are two main things to keep in mind:

  1. Each section of the box plot represents 25% of your sample (or in other words, if your sample was 100 people, each section represents the data for 25 people)
  2. Each section represents the "range in data" for that 25% of the sample.  This is why the sections are different sizes, some sections of your sample have more "variability' (spread) than others.
Speaking of Variability....

There are a couple of ways that you can determine the variability of your data.  Variability refers to how similar your results are.  Generally speaking, the lower the variability, the more confident you can be that the results of your sample can be used to make predictions about your population.

So...strategies for determining variability:
  1. Calculate the Range and Interquartile Range (IQR) - if you are comparing two sets of data from the same parameter, the data set that has the LOWER range and IQR is probably more reliable.
  2. Look at the Box Plot - box plots that have less variability tend to have longer whiskers and shorter boxes.
Look at the box plots below.  Which set of data do you think has the least variability?


Hopefully, you selected "Seattle".  This is because if I look visually, all of the box plots are about the same as they all have shorter whiskers. But, if I calculate the range and IQR, I can see that the weather is Seattle is less variable.

Part 2: Determining what is "Normal"

Statistically speaking, there are three different ways we can determine what is "normal".  These are tools that you have used in the past:
  1. Mean - add all of the numbers up and divide by the sample size (this is not shown on a box plot)
  2. Median - put all the numbers in increasing order and find the number in the middle (this is shown on a box plot)
  3. Mode - find the most common number (or categorical response)
Since there are a variety of ways to decide what is "normal", we need to decide when we should be using each one.  Here is a little guide for your data analysis pleasure:

  • STEP 1: Decide what type of "parameter" you have:
    • if it is categorical, use the MODE
    • if it is quantitative, go to STEP 2
  • STEP 2: Make a box plot of your data and analyze the general shape of the box plot:
    • if it is generally symmetrical, use the MEAN
    • if it is generally "skewed", use the MEDIAN
So...how do you know if your data is SKEWED?

Data is skewed (not symmetrical) when we have data that is highly variable.  The main reason for this high variance is the presence of "outliers" which are shown as a * on the box plot.  Outliers cause the overall box plot to change its shape.  So look for the following clues that your data is skewed:
  1. You have some outliers
  2. One whisker is much longer than the other
  3. The box is shifted far to the left or right.
Look at the example below to see what a "skewed" box plot might look like.


Part 3: Talking the Talk

Next class you will be analyzing the entire set of male and female data from our 7th grade census.  As you work through the data, try to answer the following questions as you go:

  • Are you analyzing the results of a census or a survey?
    • If it is a census, what is the population?
    • If it is a survey, what is the population? what is the sample? what sampling strategy did you use? how do you know that it is representative (unbiased)?
  • What parameter are you currently analyzing?
  • Is it a quantitative or categorical parameter?  How do you know?
  • What did you decide was "normal" for this parameter?
    • Is your data skewed or symmetrical?  How do you know?
    • What measure of normal (mean, median, mode) should you use?
    • Why did you choose this one?
What you MUST do before next class:
  1. Read this blog post at least two times before next class.
  2. Complete the practice page handed out in class.

No comments:

Post a Comment