Scatter

The variance tells us how spread out the values are in our statistical ensemble. The variance is sometimes also called the variance.

What is variance

Take a look at the following table, which shows the final report card grades of ten selected eighth grade students. We will be interested in two subjects, mathematics and history.

Name Mathematics History
Thomas 2 4
Martin 1 1
Jiri 2 1
Miroslav 2 1
Jana 2 3
Lenka 1 4
Ondrej 2 5
Lukas 2 1
Petra 2 2
John 3 4

We can see that most of the pupils have a B in mathematics, sometimes someone has an A or a C. While in the history column we have quite a chaos - the grades are scattered from A to F and no one grade even significantly dominates. The grades in history are thus more scattered than the grades in math.

If we calculate the average math grade, it comes out to 1.9. We can see that most of the elements are fairly close to this value. In history, on the other hand, we get an average of 2.6 and most of the grades do not move close to this value.

How would we quantify this variance?

How to calculate the variance

Let's convert the previous graph into a slightly different form and show only the grades for History.

Resulting grades in history

On the x axis we again have pupils (this time without names, so let's assume that 1 = Thomas, 2 = Martin, etc.). On the y axis we have the resulting grades. The line y = 2,6 represents the average value.

We then calculate the variance as the average of the squared distances from the mean. This sounds arcane, but it's simple. We can think of the squared x as the area of a square with side length x. So we create squares in our graph that have side lengths equal to the distance of the value from the mean:

Graph with distances squared

Four such squares are shown in the graph; they would not all fit. We can see that Jana is quite close to the diameter, so her square is small. While Andrew is far from the mean, therefore his square is large. If we were to construct all the squares and calculate the average area of those squares, we would get the variance.

If we have a set of values X = [x1, …, xN], where $\overline{x}$ is the average value, then the variance, let's denote it $\mbox{Var}$, is calculated as follows:

$$ \mbox{Var}(X) = \frac1N \left((x_1-\overline{x})^2 + (x_2-\overline{x})^2 + … + (x_N-\overline{x})^2 \right) $$

Alternatively, using the sum as follows:

$$ \mbox{Var}(X) = \frac1N\sum_{i=1}^N (x_i-\overline{x})^2 $$

Why $(x_1-\overline{x})^2$? The expression $x_1-\overline{x}$ alone would return the distance of the point x1 from the mean. More accurately, we should write $|x_1-\overline{x}|$ (absolute value) if the value of x1 is less than the value of the mean. Since we want to know the area of the square, we multiply this value by the square.

The variance of our data set would thus be:

$$ \begin{align*} Var(\mbox{ History }) = \frac{1}{10}((4-2{,}6)^2+(1-2{,}6)^2+(1-2{,}6)^2+\\(1-2{,}6)^2+(3-2{,}6)^2+(4-2{,}6)^2+\\(5-2{,}6)^2+(1-2{,}6)^2+(2-2{,}6)^2+\\(4-2{,}6)^2)=\frac{1}{10}\cdot 22{,}4 = 2{,}24 \end{align*} $$

The variance (variance) is 2.26. The variance of the values with the math results would look like this:

$$ \begin{align*} Var(\mbox{ Mathematics }) = \frac{1}{10}((2-1{,}9)^2+(1-1{,}9)^2+(2-1{,}9)^2+\\(2-1{,}9)^2+(2-1{,}9)^2+(1-1{,}9)^2+\\(2-1{,}9)^2+(2-1{,}9)^2+(2-1{,}9)^2+\\(3-1{,}9)^2)=\frac{1}{10}\cdot2{,}9 = 0{,}29 \end{align*} $$

The variance (variance) is 0.29. We can see that the variance for this set is, as expected, much smaller.

How to calculate the variance in Excel

In both Czech and English Excel, the var function, or some variant of it, is used to do this. For example, var.p takes a selection of cells as a parameter.

References and resources