This is a statistics blog by Clievins Selva.
Descriptive statistics is a branch of statistics that deals with the organization, summarization, and presentation of data. In this section, I'll discuss some key statistical indices that are commonly used to describe data:
The mean (or average) is the sum of all scores divided by the number of scores. Mathematically, for a dataset \(X\) with \(N\) values \(x_1, x_2, ..., x_N\), the mean \(\mu\) is given by:
\(\mu = \frac{1}{N}\sum_{i=1}^{N}x_i\).
The mean is a fundamental concept in statistics and is used in a wide range of applications, from calculating average grades in a class to determining average income in a country.
However, one of the limitations of the mean is its sensitivity to outliers. Extremely high or low values in the dataset can skew the mean, making it less representative of the "typical" value in the dataset. When reporting the mean, it's beneficial to also report the sample size and a measure of dispersion, typically the standard deviation (see below). This gives a better understanding of how spread out the data is around the mean. For example, "The mean age of the group is 30.5 years (n=10, SD=5.2 years)."
The median is another measure of central tendency, representing the middle value in an ordered dataset. If the dataset has an odd number of observations, the median is the middle value. If the dataset has an even number of observations, the median is the average of the two middle values.
Unlike the mean, the median is not affected by outliers and thus provides a more robust measure of the "center" of the data for skewed distributions.
When reporting the median, it's often useful to also report the interquartile range (IQR), which gives an idea of the spread of the middle 50% of the data. For instance, "The median age of the group is 25.5 years (IQR 23-28 years)."
The mode is the value that appears most frequently in a dataset. A dataset may have one mode (unimodal), more than one mode (multimodal), or no mode at all (amodal). The mode can be useful for identifying the most common value in a dataset, but it may not be representative of the dataset as a whole, especially for continuous data.
When reporting the mode, it can be useful to also report the frequency of the mode, especially in a multimodal dataset. For instance, "The mode of the dataset is 29, which appears twice."
The range is a measure of dispersion, representing the difference between the maximum and minimum values in a dataset. It is calculated as:
\(\text{Range} = \text{Max}(X) - \text{Min}(X)\)
The range provides a quick snapshot of the spread of the data, but it is sensitive to outliers and does not provide information about how the data is distributed around the center.
When reporting the range, it's often useful to also report the minimum and maximum values. For instance, "The range of ages in the group is 69 years, with a minimum age of 21 years and a maximum age of 90 years."
Variance is a measure of dispersion that indicates how far the values in a dataset are spread out from the mean. The variance \(s^2\) for a dataset \(X\) with \(N\) values and mean \(\mu\) is calculated as:
\(s^2 = \frac{1}{N-1}\sum_{i=1}^{N}(x_i - \mu)^2\)
The variance provides a measure of the "spread" of the data, but its units are the square of the original data units, which can be difficult to interpret. For example, if our dataset represents ages, the variance would be in "squared years," which doesn't have a clear real-world meaning.
When reporting the variance, it's important to specify that it's the variance and to give the units. For instance, "The variance of ages in the group is 123.45 squared years."
The standard deviation is the square root of the variance, providing a measure of dispersion in the same units as the original data. It is calculated as:
\(s = \sqrt{s^2} = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(x_i - \mu)^2}\)
The standard deviation provides a measure of the average distance of the data values from the mean. A small standard deviation indicates that the data points tend to be close to the mean, while a large standard deviation indicates that the data points are spread out over a wider range.
When reporting the standard deviation, it's important to give the units. For instance, "The standard deviation of ages in the group is 5 years."
Agresti, A., & Franklin, C. A. (2018). Statistics: The art and science of learning from data. Pearson.
Eid, M., Gollwitzer, M., & Schmitt, M. (2017). Statistik und Forschungsmethoden: Lehrbuch. Mit Online-Material [Statistics and research methods: Textbook. With online material] (5th ed.). Beltz.
Gravetter, F. J., & Wallnau, L. B. (2017). Statistics for the behavioral sciences. Cengage Learning.