Measures of Location and Variability

Measures of Location

When reporting a sample mean, use one extra significant digit.

Sample mean: \(\bar{x}\)
Population mean: \(\mu\)
Sample median: \(\tilde{x}\)
Population median: \(\tilde{\mu}\)

The mean is very sensitive to outliers. The median is completely insensitive to outliers. An in-between is the trimmed mean. A 10% trimmed mean is one where you drop the top and bottom 10%, and then calculate the mean. But what if you want a 10% trimmed mean, but your sample size is 22? 10% of 22 is 2.2. So calculate the trimmed mean where you remove 2 elements, and again where you remove 3 elements, and take a linear interpolated value.

You normally will use a 10-20% for the trimmed mean.

Quartiles and percentiles are generalizations of the median.

The mode of a sample is the value that appears the most often.

Measures of Variability

The range of a data set is simply the difference between the largest and the smallest values.

The sample variance is given by:

\begin{equation*} s^{2}=\frac{\sum\left(x_{i}-\bar{x}\right)^{2}}{n-1}=\frac{S_{xx}}{n-1} \end{equation*}

The population variance is given by:

\begin{equation*} \sigma^{2}=\frac{\sum\left(x_{i}-\mu\right)^{2}}{N} \end{equation*}

Why use \(n-1\) for the sample variance and not \(n\)? The reason is that \(\bar{x}\ne\mu\). \(s^{2}\) is minimized if you use \(\bar{x}\) as the reference point. Hence we would be underestimating the variance if we divide by \(n\). To account for this, divide by \(n-1\). Without this, it would be a biased estimator (to be explained elsewhere). Dividing by \(n-1\) makes it an unbiased estimator.

We say \(s^{2}\) has \(n-1\) degrees of freedom (df), as one constraint is that \(\sum{x_{i}-\bar{x}}=0\).

Note that:

\begin{equation*} S_{xx}=\sum{x_{i}^{2}}-\frac{1}{n}\left(\sum{x_{i}}\right)^{2} \end{equation*}

The variance is:

Invariant to translations in the data.
If you scale all values by \(c\), then the new \(s^{2}\) is scaled by \(c^{2}\).

The lower/upper fourth is the median of the smallest/largest half. The fourth spread \(f_{s}\) is the difference of these two. This measure of spread is relatively unaffected by outliers.

An outlier is any observation farther than \(1.5f_{s}\) from the closest fourth. It is an extreme outlier if it is greater than \(3f_{s}\) away from the closest fourth. It is mild otherwise.

midrange: \((x_{min}+x_{max})/2\)

midfourth: Average of two fourths

Exponential smoothing: A way to smoothen out data from a time series that has a lot of fluctuations (like my CPU temperature data): Pick \(0<\alpha<1\). Let \(\bar{x}_{t}\) be the smoothened value at \(t\). Let \(\bar{x}_{1}=x_{1}\) and \(\bar{x}_{t}=\alpha x_{t}+(1-\alpha)\bar{x}_{t-1}\)