One of the most important aspects of communicating information is clarity. But often, we aren’t clear or specific when we talk about averages.
You might find many websites compare the average cost of living between different countries. Or you might be looking at the class average for your college course. What specifically are we talking about when we say average?
The mean is the simplest measure of central tendency. It is calculated by taking the arithmetic sum of each value in your dataset, dividing by the number of datapoints. However, this means that in smaller datasets, outlier values will skew your average.
Imagine five houses are for sale. Here are the asking prices: $250 000, $275 000, $300 000, $325 000, $1 500 000. One house is worth more than the rest of the houses combined, and when calculating the mean, it skews it. In larger datasets, you can detect and exclude these outliers. But in this small dataset, we need another way to calculate the average.
Using the median gives us a much better idea of the average price. The median takes the middle value in the dataset. Here, the median will be $300,000. If there were six numbers in this dataset, the median would be the midpoint between the third and fourth values in the dataset.
But not all data is quantitative – what if you’re working with nominal or ordinal values? These values don’t have any numerical value but they still convey some form of information. Here, the best way to calculate the average is the mode. Simply put, it’s the most common value in your set.
You have 5 black sweaters, 3 brown sweaters, and 2 yellow sweaters. If you were to select one at random, you’d likely pick out a black sweater. This is the mode within this sweater set.
Finally, the range helps describe the spread of values in your dataset. For our housing prices dataset, our range would be the difference between the highest and the lowest value: $1 250 000.
How do you know that your dataset is reliable? Perhaps you might need to get more measurements or consider extraneous variables.
When you’re analyzing an experiment standard deviation (SD) and variation can quickly tell you if something is off. Standard deviation measures the average distance of each value from the mean.
Sample standard deviation s first calculates the distance of each value from the mean. These distances are squared and added together. This is divided by a denominator of N – 1, which is your sample size minus one. Finally, you take the square root of this value to get your SD. Population standard deviation is similar, except the denominator is simply N. Variance is another way to calculate the spread. It is calculating by squaring the standard deviation.
Finally you can describe the distance of any value from the mean using the Z score. A Z-score of 1 means that your value is 1 from the mean.
Many traits, variables, and features you might measure in an experiment might appear random at first. But as you keep collecting more and more data from the population, there’s an interesting trend. You map out the data into a probability distribution and it forms that all-too familiar bell-curve.
Yes, you can’t explain mathematics or statistics even if you’re a scientist. According to the Central Limit Theorem, any averages calculated from independent, identically distributed random variables will approximate a normal distribution. The y axis represents how after a certain variable or trait appears in the population.
That means when you measure many traits like height, weight, and even behavior, your data will follow a similar pattern. Basically, data near the mean is most common and it tapers out at either extremes. Take a look at the normal curve below – under the standard distribution, the mean (the center) is also equal to the mode.
Standard deviation also provides us with useful measures that apply to this probability distribution. In a normal distribution:
Not all traits will form a regular-looking normal distribution. If the standard deviation of a normal curve is smaller, the curve will be slimmer – as most of the traits will be closer to the center, and less at the tails. Meanwhile a more distributed trait will have a higher standard deviation, with more values located at the tail. This measure is called kurtosis and is useful to assess how widely distributed certain traits might be across a sample or population.
This measure tells us about the location of the mode. This is going to be the peak of our distribution. We might see a normal distribution with a peak to the right of the mean, and a tail tapering off to the left. Check out how this affects the mean, median and mode.
The normal distribution provides us important information about continuous traits within a sample or population. From a glance, you get a good idea of the distribution, standard deviation and central tendency of a particular trait.