Sampling: How to Draw Reliable Conclusions from a Small Group -

In an ideal world, we could study entire populations to obtain perfectly accurate information. But reality is quite different: time, budget, or accessibility constraints often force us to work with samples. Sampling, this fundamental technique in statistics, allows us to draw valid conclusions about a population by studying just a fraction of it. Let’s discover how this approach, when properly implemented, can provide surprisingly reliable results.

What is Sampling?

Sampling is the process of selecting a subset of individuals or observations from a larger population. The goal is to use the characteristics of this sample to estimate those of the entire population.

For example, rather than surveying all 330 million Americans about their voting intentions, a polling institute can survey a carefully selected sample of 1,000 people to predict election results with reasonable accuracy.

The following program visually and statistically demonstrates how a random sample compares to its population. It helps illustrate key concepts in descriptive statistics and sampling theory. If the sample mean is close to the population mean, and the standard deviation is also similar, that usually means the sample is a good representation of the population.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set a consistent style for all plots
sns.set(style="whitegrid")

# Create a simulated population
np.random.seed(42)
population_size = 10000
population = np.random.normal(170, 10, population_size)  # Height distribution with mean=170cm, sd=10cm

# Calculate population parameters
population_mean = np.mean(population)
population_std = np.std(population)

# Visualize the population distribution
plt.figure(figsize=(10, 6))
plt.hist(population, bins=30, alpha=0.7, color='skyblue')
plt.axvline(population_mean, color='red', linestyle='dashed', linewidth=2, 
            label=f'Population Mean: {population_mean:.2f}cm')
plt.title('Height Distribution in the Population')
plt.xlabel('Height (cm)')
plt.ylabel('Frequency')
plt.legend()
plt.show()

# Take a random sample
sample_size = 100
sample = np.random.choice(population, size=sample_size, replace=False)

# Calculate sample statistics
sample_mean = np.mean(sample)
sample_std = np.std(sample)

# Visualize the sample
plt.figure(figsize=(10, 6))
plt.hist(sample, bins=15, alpha=0.7, color='lightgreen')
plt.axvline(sample_mean, color='green', linestyle='dashed', linewidth=2, 
            label=f'Sample Mean: {sample_mean:.2f}cm')
plt.axvline(population_mean, color='red', linestyle='dashed', linewidth=2, 
            label=f'Population Mean: {population_mean:.2f}cm')
plt.title(f'Height Distribution in a Random Sample (n={sample_size})')
plt.xlabel('Height (cm)')
plt.ylabel('Frequency')
plt.legend()
plt.show()

print(f"Population Mean: {population_mean:.2f}cm, Population Std Dev: {population_std:.2f}cm")
print(f"Sample Mean: {sample_mean:.2f}cm, Sample Std Dev: {sample_std:.2f}cm")
print(f"Difference: {abs(population_mean - sample_mean):.2f}cm")

Main Sampling Methods

Several sampling techniques exist, each with its advantages and limitations:

Simple random sampling: Each member of the population has an equal probability of being selected. It’s like drawing names randomly from a hat. This method, although conceptually simple, is often difficult to implement in practice as it requires a complete list of the population.

Stratified sampling: The population is first divided into homogeneous subgroups (strata) according to certain characteristics (age, gender, region…), then random samples are taken from each stratum. This method ensures that all important subgroups are proportionally represented.

Cluster sampling: The population is divided into natural groups (clusters), such as neighborhoods or schools, then a few clusters are randomly selected and all their members are included in the sample. This approach is often more economical but can introduce more variability.

Systematic sampling: After selecting a random starting point, every nth element of the population is chosen. For example, in a list of 10,000 people, you might select every 100th person to obtain a sample of 100 individuals.

The following code demonstrates four different sampling methods — simple random, stratified, cluster, and systematic — using a synthetic dataset with two distinct groups (A and B). It shows how each method selects data and how representative the samples are.

# Demonstrate different sampling methods
np.random.seed(42)

# Create a more complex population with two features
n = 1000
feature1 = np.concatenate([
    np.random.normal(10, 2, n//2),  # Group A
    np.random.normal(20, 2, n//2)   # Group B
])
feature2 = np.concatenate([
    np.random.normal(50, 10, n//2),  # Group A
    np.random.normal(70, 10, n//2)   # Group B
])

# Create a DataFrame
df = pd.DataFrame({
    'Feature1': feature1,
    'Feature2': feature2,
    'Group': ['A'] * (n//2) + ['B'] * (n//2)
})

# Visualize the population
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Feature1', y='Feature2', hue='Group', data=df, alpha=0.6)
plt.title('Population Distribution')
plt.show()

# 1. Simple Random Sampling
sample_size = 100
simple_random_sample = df.sample(sample_size, random_state=42)

# 2. Stratified Sampling
stratified_sample = df.groupby('Group', group_keys=False).apply(
    lambda x: x.sample(sample_size//2, random_state=42))

# 3. Cluster Sampling (simulated by creating 10 clusters and selecting 3)
df['Cluster'] = np.random.randint(0, 10, size=len(df))
selected_clusters = np.random.choice(10, 3, replace=False)
cluster_sample = df[df['Cluster'].isin(selected_clusters)]

# 4. Systematic Sampling
step = len(df) // sample_size
systematic_sample = df.iloc[::step].head(sample_size)

# Visualize the different samples
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Simple Random Sample
sns.scatterplot(x='Feature1', y='Feature2', hue='Group', data=simple_random_sample, 
                alpha=0.8, ax=axes[0, 0])
axes[0, 0].set_title(f'Simple Random Sample (n={len(simple_random_sample)})')
axes[0, 0].scatter(df['Feature1'], df['Feature2'], color='gray', alpha=0.1)

# Stratified Sample
sns.scatterplot(x='Feature1', y='Feature2', hue='Group', data=stratified_sample, 
                alpha=0.8, ax=axes[0, 1])
axes[0, 1].set_title(f'Stratified Sample (n={len(stratified_sample)})')
axes[0, 1].scatter(df['Feature1'], df['Feature2'], color='gray', alpha=0.1)

# Cluster Sample
sns.scatterplot(x='Feature1', y='Feature2', hue='Group', data=cluster_sample, 
                alpha=0.8, ax=axes[1, 0])
axes[1, 0].set_title(f'Cluster Sample (n={len(cluster_sample)})')
axes[1, 0].scatter(df['Feature1'], df['Feature2'], color='gray', alpha=0.1)

# Systematic Sample
sns.scatterplot(x='Feature1', y='Feature2', hue='Group', data=systematic_sample, 
                alpha=0.8, ax=axes[1, 1])
axes[1, 1].set_title(f'Systematic Sample (n={len(systematic_sample)})')
axes[1, 1].scatter(df['Feature1'], df['Feature2'], color='gray', alpha=0.1)

plt.tight_layout()
plt.show()

Sample Size and Margin of Error

A crucial question in sampling is: “How many observations are needed?” The answer depends on several factors:

Population variability: The more heterogeneous the population, the larger the sample will need to be to represent it faithfully.

Desired confidence level: Usually set at 95%, it indicates the probability that the confidence interval contains the true population parameter.

Acceptable margin of error: The smaller the required precision (low margin of error), the larger the sample will need to be.

Contrary to intuition, the necessary sample size depends little on the total population size. A sample of 1,000 people can be sufficient to represent a population of 10,000 or 10 million, with a similar margin of error.

The classic formula for calculating the margin of error (E) is:
E = z × √(p(1-p)/n)

Where z is the z-score corresponding to the confidence level (1.96 for 95%), p is the estimated proportion (0.5 gives the maximum margin of error), and n is the sample size.

# Demonstrate the relationship between sample size and margin of error
sample_sizes = [10, 30, 100, 300, 1000, 3000]
num_simulations = 1000
confidence_level = 0.95
z = stats.norm.ppf((1 + confidence_level) / 2)  # z-score for 95% confidence

# Run simulations for different sample sizes
results = []
for size in sample_sizes:
    sample_means = []
    for _ in range(num_simulations):
        sample = np.random.choice(population, size=size, replace=False)
        sample_means.append(np.mean(sample))

    # Calculate the observed margin of error
    observed_margin = np.std(sample_means) * z

    # Calculate the theoretical margin of error
    theoretical_margin = population_std / np.sqrt(size) * z

    results.append({
        'Sample Size': size,
        'Observed Margin': observed_margin,
        'Theoretical Margin': theoretical_margin
    })

# Convert to DataFrame
results_df = pd.DataFrame(results)

# Plot the relationship
plt.figure(figsize=(12, 6))
plt.plot(results_df['Sample Size'], results_df['Observed Margin'], 'o-', label='Observed Margin of Error')
plt.plot(results_df['Sample Size'], results_df['Theoretical Margin'], 's--', label='Theoretical Margin of Error')
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Sample Size (log scale)')
plt.ylabel('Margin of Error (log scale)')
plt.title('Relationship Between Sample Size and Margin of Error')
plt.grid(True, which="both", ls="-")
plt.legend()
plt.show()

# Create a table of results
print("Sample Size vs. Margin of Error:")
print(results_df.to_string(index=False, float_format=lambda x: f"{x:.4f}"))

Pitfalls to Avoid

Sampling may seem simple in theory, but several biases can compromise its validity:

Selection bias: If certain groups systematically have a higher chance of being included in the sample than others, the results will be biased. This is what happened in the famous prediction error of Literary Digest magazine in 1936, which predicted Roosevelt’s defeat based on a sample biased toward the wealthy classes.

Non-response bias: People who agree to participate in a study may systematically differ from those who refuse, thus creating an unrepresentative sample.

Survivorship bias: By studying only the “survivors” of a process, we ignore the cases that didn’t survive, thus skewing the conclusions.

Applications in Signal Processing

In signal processing, sampling takes on a particular dimension. The Nyquist-Shannon sampling theorem, a pillar in this field, states that a continuous band-limited signal can be perfectly reconstructed from discrete samples if the sampling frequency is at least twice the maximum frequency present in the signal.

This principle is fundamental in analog-to-digital conversion, audio and video compression, and telecommunications. Without it, digital music, phone calls, and HD television would be impossible.

The following code demonstrate the Nyquist–Shannon sampling theorem, which states that a signal can be perfectly reconstructed from its samples if it is sampled at a rate at least twice the maximum frequency present in the signal.

# Demonstrate the Nyquist-Shannon sampling theorem
from scipy import signal

# Create a continuous signal
def continuous_signal(t):
    return np.sin(2 * np.pi * 1 * t) + 0.5 * np.sin(2 * np.pi * 2 * t)

# Generate a high-resolution version of the signal (approximating continuous)
t_continuous = np.linspace(0, 1, 1000)
y_continuous = continuous_signal(t_continuous)

# Sample at different rates
sampling_rates = [4, 8, 16]  # in Hz
fig, axes = plt.subplots(len(sampling_rates), 1, figsize=(12, 10))

for i, rate in enumerate(sampling_rates):
    # Sample the signal
    t_sampled = np.linspace(0, 1, rate)
    y_sampled = continuous_signal(t_sampled)

    # Reconstruct using sinc interpolation (ideal reconstruction)
    y_reconstructed = np.zeros_like(t_continuous)
    for j, tj in enumerate(t_sampled):
        y_reconstructed += y_sampled[j] * np.sinc(rate * (t_continuous - tj))

    # Plot
    axes[i].plot(t_continuous, y_continuous, 'b-', label='Original Signal')
    axes[i].plot(t_sampled, y_sampled, 'ro', label='Samples')
    axes[i].plot(t_continuous, y_reconstructed, 'g--', label='Reconstructed Signal')
    axes[i].set_title(f'Sampling Rate: {rate} Hz (Nyquist Rate: 4 Hz)')
    axes[i].set_xlabel('Time (s)')
    axes[i].set_ylabel('Amplitude')
    axes[i].legend()
    axes[i].grid(True)

plt.tight_layout()
plt.show()

Conclusion

Sampling is much more than a simple statistical technique: it’s a bridge between the particular and the general, allowing us to extrapolate knowledge from limited observations. Whether it’s opinion polls, market research, clinical trials, or signal processing, mastering sampling is essential for drawing reliable conclusions in a world where exhaustiveness is rarely possible.

By understanding the principles of sampling, its methods, and potential pitfalls, you’ll be better equipped to evaluate the quality of studies based on samples and to design your own research with rigor. As statistician W. Edwards Deming said: “Without data, you’re just another person with an opinion.” And with sampling, you can transform limited data into valuable knowledge.

Sampling: How to Draw Reliable Conclusions from a Small Group