Some General Concepts of Point Estimation

\(\newcommand{\Cov}{\mathrm{Cov}}\) \(\newcommand{\Corr}{\mathrm{Corr}}\)

Notation: \(\hat{\mu}=\bar{X}\) means the point estimator of \(\mu\) is the sample mean \(\bar{X}\).

Note that the estimator is itself a random variable.

Unbiased Estimators

A point estimator \(\hat{\theta}\) is said to be an unbiased estimator of \(\theta\) if \(E(\hat{\theta})=\theta\) for every possible value of \(\theta\). If \(\hat{\theta}\) is biased, the difference \(E(\hat{\theta})-\theta\) is called the bias of \(\hat{\theta}\).

For a binomial random variable with parameters \(n\) and \(p\), \(\hat{p}=X/n\) is an unbiased estimator of \(p\).

Suppose you have a uniform distribution from 0 to \(\theta\), where \(\theta\) is unknown. Say you take a sample and try to estimate \(\theta\) with \(\max\{X_{1},\dots,X_{n}\}\). You’ll be underestimating it:

\begin{equation*} E\left(\hat{\theta}\right)=\frac{n}{n+1}\theta \end{equation*}

To get an unbiased estimate, use \(\frac{n+1}{n}\max\{X_{1},\dots,X_{n}\}\). Its variance is \(\frac{\theta^{2}}{n(n+2)}\).

For variance, the following estimator is unbiased:

\begin{equation*} S^{2}=\frac{1}{n-1}\sum\left(X_{i}-\bar{X}\right)^{2} \end{equation*}

To prove this, note that \(E[Y^{2}]=V(Y)+[E(Y)]^{2}\) and

\begin{equation*} S^{2}=\frac{1}{n-1}\left[\sum X_{i}^{2}-\frac{\left(\sum X_{i}\right)^{2}}{n}\right] \end{equation*}

Use the above two to calculate \(E(S^{2})\).

If you have a sample from a distribution with mean \(\mu\), then \(\bar{X}\) is an unbiased estimator. And if the distribution is continuous and symmetric, then so is \(\tilde{X}\) and any trimmed mean. There is no unique unbiased estimator. What differentiates them all is that the variances may differ!

We often seek a MVUE (minimum variance unbiased estimator). For a normal distribution, the MVUE for \(\mu\) is \(\hat{\mu}=\bar{X}\)

Occasionally, we do pick a biased estimator if its variance is much lower than the unbiased one.

If the pdf is symmetric about \(\mu\), then \(\tilde{X}\) is an unbiased estimator for it. If \(n\) is large, it can be shown that \(V(\tilde{X})\approx\frac{1}{4n[f(\mu)]^{2}}\)

Example of Preferring a Biased Estimator

Suppose we know a distribution is Poisson. We want to estimate \(P(X=0)^{2}=e^{-2\lambda}\) with a sample size of 1. This is the equivalent of the probability of not having an event in the next two minutes, where \(\lambda\) is the rate of events per minute.

Let \(\delta(X)\) be an unbiased estimator. We know that:

\begin{equation*} E(\delta(X))=\sum\delta(x)\frac{\lambda^{x}e^{-\lambda}}{x!}=e^{-2\lambda} \end{equation*}

The only function that would satisfy this is \(\delta(x)=(-1)^{x}\).

This means that we note down how many events occurred in one minute, and plug into the above formula.

This will give absurd values for high \(X\) (including negative ones). Clearly, the unbiased estimator is not appropriate here.

Transformations on Estimators

Note that although \(S^{2}\) is unbiased for \(\sigma^{2}\), \(S\) is biased for \(\sigma\) (although the bias is small).

If you use \(\bar{X}\) to estimate \(\mu\), then \(E(\bar{X^{2}})\) is a biased estimator for \(\mu^{2}\) \(E(\bar{X^{2}})=\mu^{2}+\frac{\sigma^{2}}{n}\)

In general, applying a transformation to an unbiased estimator need not preserve the unbiasedness. Jensen’s inequality will tell you this for a convex transformation.

Some Complications

For a heavy tail distribution, the mean may be a poor estimator, and the median may work better.

In general, \(\bar{X_{\mathrm{tr}(10)}}\) is very good when you don’t know the underlying distribution. A 10 or 20% trimmed mean is a robust estimator. The median/mean are not (i.e. there exist more distributions for which these are poor estimators).

Reporting a Point Estimate: The Standard Error

The standard error for an estimator \(\hat{\theta}\) is its standard deviation \(\sigma_{\hat{\theta}}=\sqrt{V(\hat{\theta})}\). If you need to estimate parameters to estimate the standard error, the value is called the estimated standard error. Often denoted by \(\hat{\sigma_{\hat{\theta}}}\) or \(s_{\hat{\theta}}\).

When \(\hat{\theta}\) has approximately a normal distribution (common for large \(n\)), then we can be reasonably confident that the true value of \(\theta\) is within 2 standard errors of \(\hat{\theta}\).

If \(\hat{\theta}\) is not approximately normal (or the distribution is unknown), then the estimate will deviate from \(\theta\) by more than 4 standard errors at most 6% of the time.

Often it is hard to figure out an expression for \(\sigma_{\hat{\theta}}\). So use bootstrapping.

Bias Variance Tradeoff

While bias quantifies the average difference to be expected between an estimator and an underlying parameter, an estimator based on a finite sample can additionally be expected to differ from the parameter due to the randomness in the sample.

The bias is an error due to using the wrong model. The variance represents the error you’ll get from one sample.

One measure which is used to try to reflect both types of difference is the mean square error

\(\operatorname{MSE}(\hat{\theta})=\operatorname{E}\big[(\hat{\theta}-\theta)^2\big]\)

This can be shown to be equal to the square of the bias, plus the variance:

\begin{equation*} \operatorname{MSE}(\hat{\theta})=(\operatorname{E}[\hat{\theta}]-\theta)^2 + \operatorname{E}[\,(\hat{\theta} - \operatorname{E}[\,\hat{\theta}\,])^2\,] \end{equation*}

The first term is the square of the bias. The second term is the variance.

A derivation is here.

Sometimes (often?), an estimator that minimizes the bias does not minimize the variance, and people search for a formula that minimizes this sum. I do not know if it’s a necessity that a lower bias implies a greater variance! I think it may be more “empirical”.

Bootstrapping

If you know the distribution:

Take a sample of size \(n\) and estimate parameter \(\theta=\hat{\theta}\). An example will make this more concrete:

Let \(\hat{\theta}=2.5\). Then consider \(f(x;2.5)\). Take \(B\) samples of size \(n\) from \(f\), and computer \(\hat{\theta^{*}}\) for each one. Then compute the standard error using \(S_{\hat{\theta}}=\sqrt{\frac{1}{B-1}\sum\left(\hat{\theta_{i}^{*}}-\bar{\theta^{*}}\right)^{2}}\)

This is a bootstrap estimate. \(B\) is usually 100-200.

But what if you have no idea about the underlying distribution? Use the original sample itself, but sample with replacement.

Useful Application

Suppose you want to get an idea of percentage of students who violate the honor code. But they may not be truthful in reporting in the survey. One scheme: Have 100 cards, 50 of which ask if they’ve violated the honor code, and 50 ask if the last digit of their phone number is 0, 1 or 2. All the student has to do is answer “Yes”. Let \(p\) be the percentage that violate the honor code and \(\lambda=P(\mathrm{yes})\). Then \(\lambda=0.5p+0.3\). Note that if \(Y\) is the number of yes responses, then \(Y\sim \mathrm{Bin}(n,\lambda)\) and \(Y/n\) is an unbiased estimator of \(\lambda\). Then \(2Y/n-0.3\) is an unbiased estimator of \(p\). What about \(\sigma\), though?