What is the relationship between population sampling and sampling distributions?

STATISTICS

Learn about Central Limit Theorem, Standard Error, and Bootstrapping in the context of the sampling distribution

Image by author

It is important to distinguish between the data distribution (aka population distribution) and the sampling distribution. The distinction is critical when working with the central limit theorem or other concepts like the standard deviation and standard error.

In this post, we will go over the above concepts and as well as the concept of bootstrapping to estimate the sampling distribution. In particular, we will cover the following:

  • Data distribution (aka population distribution)
  • Sampling distribution
  • Central limit theorem (CLT)
  • Standard error and its relation with the standard deviation
  • Bootstrapping

Data Distribution

Much of the statistics deals with inferring from samples drawn from a larger population. Hence, we need to distinguish between the analysis done the original data as opposed to analyzing its samples. First, let’s go over the definition of the data distribution:

Data distribution: The frequency distribution of individual data points in the original dataset.

Let’s first generate random skewed data that will result in a non-normal (non-Gaussian) data distribution. The reason behind generating non-normal data is to better illustrate the relation between data distribution and the sampling distribution.

So, let’s import the Python plotting packages and generate right-skewed data.

The histogram of generated right-skewed data (Image by author)

Sampling Distribution

In the sampling distribution, you draw samples from the dataset and compute a statistic like the mean. It’s very important to differentiate between the data distribution and the sampling distribution as most confusion comes from the operation done on either the original dataset or its (re)samples.

Sampling distribution: The frequency distribution of a sample statistic (aka metric) over many samples drawn from the dataset[1]. Or to put it simply, the distribution of sample statistics is called the sampling distribution.

  1. Draw a sample from the dataset.
  2. Compute a statistic/metric of the drawn sample in Step 1 and save it.
  3. Repeat Steps 1 and 2 many times.
  4. Plot the distribution (histogram) of the computed statistic.
>>> Mean: 0.23269
The histogram of the sample means drawn from the population (Image by author)

Above sampling distribution is basically the histogram of the mean of each drawn sample (in above, we draw samples of 50 elements over 2000 iterations). The mean of the above sampling distribution is around 0.23, as can be noted from computing the mean of all samples means.

⚠️ Do not confuse the sampling distribution with the sample distribution. The sampling distribution considers the distribution of sample statistics (e.g. mean), whereas the sample distribution is basically the distribution of the sample taken from the population.

Central Limit Theorem (CLT)

💡 Central Limit Theorem: As the sample size gets larger, the sampling distribution tends to be more like a normal distribution (bell-curve shape).

In CLT, we analyze the sampling distribution and not a data distribution, an important distinction to be made. CLT is popular in hypothesis testing and confidence interval analysis, and it’s important to be aware of this concept, even though with the use of bootstrap in data science, this theorem is less talked about or considered in the practice of data science[1]. More on bootstrapping is provided later in the post.

Standard Error (SE)

The standard error is a metric to describe the variability of a statistic in the sampling distribution. We can compute the standard error as follows:

where s denotes the standard deviation of the sample values and n denotes the sample size. It can be seen from the formula that as the sample size increases, the SE decreases.

We can estimate the standard error using the following approach[1]:

  1. Draw a new sample from a dataset.
  2. Compute a statistic/metric (e.g., mean) of the drawn sample in Step 1 and save it.
  3. Repeat Steps 1 and 2 several times.
  4. An estimate of the standard error is obtained by computing the standard deviation of the previous steps’ statistics.

While the above approach can be used to estimate the standard error, we can use bootstrapping instead, which is preferable. I will go over that in the next section.

⚠️ Do not confuse the standard error with the standard deviation. The standard deviation captures the variability of the individual data points (how spread the data is), unlike the standard error that captures a sample statistic’s variability.

Bootstrapping

Bootstrapping is an easy way of estimating the sampling distribution by randomly drawing samples from the population (with replacement) and computing each resample’s statistic. Bootstrapping does not depend on the CLT or other assumptions on the distribution, and it is the standard way of estimating SE[1].

Luckily, we can use bootstrap() functionality from the MLxtend library (You can read my post on MLxtend library covering other interesting functionalities). This function also provides the flexibility to pass a custom sample statistic.

>>> Mean: 0.23293 
>>> Standard Error: +/- 0.00144
>>> CI95: [0.23023, 0.23601]

Conclusion

The main takeaway is to differentiate between whatever computation you do on the original dataset or the sample of the dataset. Plotting a histogram of the data will result in data distribution, whereas plotting a sample statistic computed over samples of data will result in a sampling distribution. On a similar note, the standard deviation tells us how the data is spread, whereas the standard error tells us how a sample statistic is spread out.

Another takeaway is that even if the original data distribution is non-normal, the sampling distribution is normal (central limit theorem).

You can find the Jupyter notebook for this blog post on GitHub.

Thanks for reading!

Originally published at https://www.ealizadeh.com.

I’m a senior data scientist and an engineer and I’d like to write about Statistics, Machine Learning, Time Series Analysis, and interesting Python libraries, and tips & tricks.

  • If you liked this post, follow me on Medium
  • Subscribe to my mailing list
  • Let’s connect on LinkedIn and Twitter

References

[1] P. Bruce & A. Bruce (2017), Practical Statistics for Data Scientists, First Edition, O’Reilly

What is the relationship between the population mean and the mean of the distribution of sample means for a specific sample size?

1. The mean of the distribution of sample means is called the Expected Value of M and is always equal to the population mean μ.

What is the difference between sampling and population distribution?

Your sample is the only data you actually get to observe, whereas the other distributions are more like theoretical concepts. Your sample distribution is therefore your observed values from the population distribution you are trying to study.

Is the sampling distribution equal to the population distribution?

The mean of the sampling distribution is equal to the population mean, and the variance of the sampling distribution is equal to the population variance divided by the sample size. These facts are summarized in the Central Limit Theorem.