In this post we will understand how to analyse data, unhapply there is no “right way” to do it. The chosen methode will depend on the problem and the purposes of the analyses, and also of the data characteristics. I will start by defining two types of data sets: Population and Sample. The I will show some procedures for numerical summary mesures.

Population and Sample

Population

A population refers to the entire group that you want to study or draw conclusions about. It includes all individuals, items, or events that fit a specific set of characteristics.

Examples of Population:

All college students in the U.S.
Every citizen in a country.
All products manufactured in a factory.
Every customer of a company.

Sample

Since populations can be very large, it is often impractical or impossible to collect data from every individual. A sample is a smaller subset of the population that is selected to represent the whole group and must be random and representative to ensure accurate conclusions about the population.

Examples of Samples:

A survey of 1,000 college students to estimate the study habits of all students
A medical trial with 500 patients to determine the effectiveness of a new drug
Testing 100 random light bulbs from a factory to estimate the defect rate of all light bulbs produced.

Population x Sample

Population refers to the entire group that you want to study and sample is an observed subset of population values.

Mesures of Central Tendency

The measures of central tendency indicate the central or typical value of a dataset. We will apply this mesures in the following table of grades of a math class of 10 students going to 0 to 100. Also, wel will calculate each mesure of central tendency in Python, using the Numpy library.

Name	Grade
Liam	85
Sofia	92
Hiroshi	78
Aisha	88
Mateo	95
Elena	90
Ravi	82
Noah	87
Fatima	91
Alejandro	89

Mean

Mean: Sum of all values divided by the number of observations (amount of data). The equation is the following, where it divides the sum of all grades for the total number of observations

\[Mean = \mu = \frac{\sum X_i}{N} = \frac{85+92+78+88+95+90+82+87+91+89}{10} = 87.7\]

The mean can be calculated in Python by using the function mean of Numpy library:

import numpy as np

grades = [85, 92, 78, 88, 95, 90, 82, 87, 91, 89]

mean_grade = np.mean(grades)
print("Mean = ", mean_grade)

Median

Median: The middle value of an ordered dataset.
- If the number of observations is odd, the median is the middle value.
  - Ex:
    - Values: 5, 3, 11, 7, 9
    - Sorted values: 3, 5, 7, 9, 11
    - Middle element of the sorted list: 7
    - Median = 7
- If the number of observations is even, the median is the average of the two middle values.
  - Ex:
    - Values: 2, 4, 6, 8, 1, 3
    - Sorted values: 2, 3, 4, 6, 8
    - Middle element of the sorted list: 3 and 4
    - The median is the average of the two central values

\[Median = \frac{3+4}{2} = 3.5\]

For the case of the grades, the Median is calculate by:

Values: 85, 92, 78, 88, 95, 90, 82, 87, 91, 89
Sorted values: 78, 82, 85, 87, 88, 89, 90, 91, 92, 95
Center values: 88 and 89

\[Median = \frac{88+89}{2} = 88.5\]

For calculate Median in Python you can use the following code:

import numpy as np
grades = [85, 92, 78, 88, 95, 90, 82, 87, 91, 89]
median_grade = np.median(grades)
print("The median of the given data set is:", median_grade)

Mode

Mode: The value that appears most frequently in a dataset. It represents the most common observation.
A dataset can have:
- One mode (unimodal) – If only one value appears most frequently.
- Two modes (bimodal) – If two values appear with the same highest frequency.
- Multiple modes (multimodal) – If more than two values have the highest frequency.
- No mode – If all values appear with the same frequency.

There is no mode in the grades values because all values appear only once.

To examplify a use of mode, consider this another set of grades:
- 85, 92, 78, 88, 95, 90, 82, 87, 91, 89, 92, 92, 95
  - We can sort the data to be more easy to find the most frequently value:
  - 78, 82, 85, 87, 88, 89, 90, 91, 92, 92, 92, 95, 95
- The mode of the given dataset is 92, as it appears most frequently (3 times)

For calculate the Mode in Python you can use the following code:

import numpy as np

data = [85, 92, 78, 88, 95, 90, 82, 87, 91, 89, 92, 92, 95]
values, counts = np.unique(data, return_counts=True)
mode_value = values[np.argmax(counts)]
print("Mode:", mode_value)&lt;br>

Measures of Dispersion

Measures of dispersion: describe how spread out or scattered the data points are in a dataset. These measures help to understand the variability and consistency of the data.

Variance and Standard deviation of a Population

Standard deviation and variance measure essentially the same thing: the dispersion of data relative to the mean.

Variance

Variance: Aims to quantify how far the values of a dataset are from the mean. A higher variance means the data points are more spread out, while a lower variance indicates that they are closer to the mean.

The Variance calculus of a Population can be donne by following these steps:

We start by calculating the differences between each value and the mean

\[ x_i – \mu \]

We square the results to eliminate cancellation effects and give more weight to extreme values

\[ (x_i – \mu)^2 \]

We take the average of the squared differences to obtain a single measure of dispersion. What results in the following Variance formula, where N represents all population

\[ \sigma^2 = \frac{\sum (x_i – \mu)^2}{N} \]

Let’s calculate the variance of the grades 85, 92, 78, 88, 95, 90, 82, 87, 91, 89:

\[ \sigma^2 = \frac{ (85 – 88.7)^2+(92 – 88.7)^2+(78 – 88.7)^2+(88 – 88.7)^2+(95 – 88.7)^2+(90 – 88.7)^2+(82 – 88.7)^2+(87 – 88.7)^2+(91 – 88.7)^2+(89 – 88.7)^2}{10}\]

\[ \sigma^2 = 22.41\]

Standard deviation

Standard deviation is the square root of variance, bringing dispersion back to the same unit as the original data, making it more intuitive:

\[ \sigma = \sqrt{\frac{\sum (x_i – \mu)^2}{N}} \]

Following the logic, the standard deviation of the grades will be:

\[ \sigma = \sqrt{22.41} = 4.73 \]

In Python, we can use the following code to calculate Standard deviation and Variance:

import numpy as np

# Sample data
grades = [85, 92, 78, 88, 95, 90, 82, 87, 91, 89]

# Compute variance and standard deviation using NumPy
variance = np.var(grades)  # Population variance
std_dev = np.std(grades)   # Population standard deviation

# Display results
print(f"Variance: {variance:.2f}")
print(f"Standard Deviation: {std_dev:.2f}")

To go back to the summary, click here

Mesures of Central Tedency and Dispersion

Table of Contents