Measures of Location
When reporting a sample mean, use one extra significant digit.
- Sample mean: \(\bar{x}\)
- Population mean: \(\mu\)
- Sample median: \(\tilde{x}\)
- Population median: \(\tilde{\mu}\)
The mean is very sensitive to outliers. The median is completely insensitive to outliers. An in-between is the trimmed mean. A 10% trimmed mean is one where you drop the top and bottom 10%, and then calculate the mean. But what if you want a 10% trimmed mean, but your sample size is 22? 10% of 22 is 2.2. So calculate the trimmed mean where you remove 2 elements, and again where you remove 3 elements, and take a linear interpolated value.
You normally will use a 10-20% for the trimmed mean.
Quartiles and percentiles are generalizations of the median.
The mode of a sample is the value that appears the most often.
Measures of Variability
The range of a data set is simply the difference between the largest and the smallest values.
The sample variance is given by:
The population variance is given by:
Why use \(n-1\) for the sample variance and not \(n\)? The reason is that \(\bar{x}\ne\mu\). \(s^{2}\) is minimized if you use \(\bar{x}\) as the reference point. Hence we would be underestimating the variance if we divide by \(n\). To account for this, divide by \(n-1\). Without this, it would be a biased estimator (to be explained elsewhere). Dividing by \(n-1\) makes it an unbiased estimator.
We say \(s^{2}\) has \(n-1\) degrees of freedom (df), as one constraint is that \(\sum{x_{i}-\bar{x}}=0\).
Note that:
The variance is:
- Invariant to translations in the data.
- If you scale all values by \(c\), then the new \(s^{2}\) is scaled by \(c^{2}\).
The lower/upper fourth is the median of the smallest/largest half. The fourth spread \(f_{s}\) is the difference of these two. This measure of spread is relatively unaffected by outliers.
An outlier is any observation farther than \(1.5f_{s}\) from the closest fourth. It is an extreme outlier if it is greater than \(3f_{s}\) away from the closest fourth. It is mild otherwise.
midrange: \((x_{min}+x_{max})/2\)
midfourth: Average of two fourths
Exponential smoothing: A way to smoothen out data from a time series that has a lot of fluctuations (like my CPU temperature data): Pick \(0<\alpha<1\). Let \(\bar{x}_{t}\) be the smoothened value at \(t\). Let \(\bar{x}_{1}=x_{1}\) and \(\bar{x}_{t}=\alpha x_{t}+(1-\alpha)\bar{x}_{t-1}\)