Basics of statistics

Statistics is a science that tries to examine real data and uses probability theory to describe that data.

The actual statistics section

This is an old and outdated article, statistics already has its own section with the introductory article Basic concepts of statistics.

Basic concepts

There are several basic concepts that are used in statistics, which we will describe here.

First of all, a statistical set, which is a finite set of some data that we want to study. The data can be general, it can be basically anything. If you want to study the average salary in the Czech Republic, the statistical set will be the set of all people in the Czech Republic. The number of elements in the statistical set is called the size of the set. So the size of the statistical set we define would be equal to the population of the Czech Republic.

There is also the concept of a statistical unit, which is a specific element of a statistical population. In our case, the statistical unit would be one specific person.

Finally, we have the statistical characteristic, which is what we want to measure. In our example, the statistical characteristic would be salary. A statistical characteristic can be either qualitative or quantitative. A quantitative (quantity = quantity, number) trait is one that is expressible in numbers (for example, just that salary, height, number of children, ...), while a qualitative trait is one that is expressible in words (color, yes/no, occupation, ...).

Frequencies

Frequencies can be either relative or absolute and indicate how many values of a given trait occur in the statistical population - either absolutely or relative to the total number of elements in the population.

Importantly, we must apply the statistical feature when calculating the frequency, because if we want to calculate the absolute frequency of a statistical unit, we must necessarily arrive at either zero or one, because the statistical set is a set and the set itself does not admit of having more than one of the same element.

So the absolute frequency of the value of the statistical feature z indicates the number of occurrences of the feature z in the statistical set S. Example: let's have a class of ten students. Each pupil got some grade in mathematics on his report card, from one to five. The grades are recorded in the following table:

$$\begin{array}{c|c|c|c|c|c|c|c|c|c} 1&2&3&4&5&6&7&8&9&10\\ \hline 2&5&3&2&1&1&2&4&1&3 \end{array}$$

Note: the statistical set for this example would be the ten classmates, something like

$$S=\left\{\mbox{ Ondra }, \mbox{ Veronica }, \mbox{ Martin }, \ldots\right\}$$

In the table, for simplicity, we have a numerical identifier for the student in the first row, so the statistical units, i.e., the elements of the statistical set, are in the first row. In the second row we have the values of the statistical feature, i.e. the values of the "final math grade" of that student.

Thus, the absolute frequency of the trait (the grade on the report card) z = 3 would be equal to two, with only two students getting a C on their report card (these are "students" 3 and 10). The absolute frequency of the character z = 1 would be three ('pupils' 5, 6 and 9).

The relative frequency indicates what percentage of the trait values in the statistical set are equal to z. We calculate the relative frequency of the trait z as follows:

$$r=\frac{z_a}{|S|},$$

where za is the absolute frequency of the character z and |S| is the range of the statistical set, i.e., the number of elements. Thus, the relative frequency of mark three would be:

$$r_3=\frac{2}{10}=\frac15.$$

The extent of our population is ten, because we have ten students in our class. We get the percentage notation by multiplying by 100, so we would get 20%. The relative frequency of a one would be

$$r_1=\frac{3}{10}.$$

The arithmetic mean

The arithmetic mean, or often just the mean, is the average of all the values in a statistical population. By the word value, we mean after the application of the statistical sign. The average is calculated by adding up all the values and dividing by the number of values in the population. So roughly like this:

$$p_a=\frac{x_1+x_2+x_3+\ldots+x_n}{n}=\frac1n\sum_{i=1}^nx_i$$

I've also added the expression using sum, in case you find it more readable. But the previous expression with the fraction is enough. The values of x represent all the values in our file.

Example: we take the data from the previous table and calculate the average grade per student.

$$p_a=\frac{2+5+3+2+1+1+2+4+1+3}{10}=\frac{24}{10}=2{,}4$$

In our class, the average grade is 2.4. As you can see, the arithmetic mean can return a value that is not actually valid at all - you can't give a grade of 2.4.

The arithmetic mean is also bad to use when part of the data has a fundamentally different value than the rest of the data. So if we have a file with values of 1, 3, 2, 5, 4, 2, 75, the arithmetic mean will come out

$$p_a=\frac{1+3+2+5+4+2+75}{7}=13{,}14.$$

We can see that the resulting value is far from all the values in the file. It is several times larger than the first six numbers and several times smaller than the last value. This is a problem that can be solved by, for example, the median, see below. At least now you know why two thirds of people don't make the average salary - there is a small group of people who have much higher salaries that increase the arithmetic mean.

The geometric mean

The geometric mean is calculated in a similar way to the arithmetic mean, except that multiplication is used instead of addition and the square root is used instead of division. So we calculate the geometric mean as follows:

$$p_g=\sqrt[n]{x_1\cdot x_2\cdot x_3\cdot\ldots\cdot x_n}=\sqrt[n]{\prod_{i=1}^n x_i}$$

The geometric mean can be used as an indicator of growth. As an example, suppose the price of a product increased by 10% in one year, 15% the next year, and 5% the next year. Thus, the original price of c was at the

$$1{,}1\cdot1{,}15\cdot1{,}05c=1{,}32825c.$$

The geometric mean of these coefficients would be:

$$p_g=\sqrt[3]{1{,}1\cdot1{,}15\cdot1{,}05}=1{,}0992419$$

What does this mean? That if the price increased by just 1.0992419 each time, the final price would be at the same value:

$$1{,}0992419^3=1{,}32825$$

Modus and median

The modus of a character is the value that has the highest frequency, denoted by Mod(x). If we return to the example of grades, then the modus is the values 1 and 2 because they occur most often - both have an absolute frequency of 3.

The median is then the middle value, denoted by Med(x). If we can arrange the values in a non-decreasing sequence

$$x_1\le x_2\le x_3\le\ldots\le x_n,$$

then the median represents the value that is in the middle of that sequence. The median value then differs if the sequence has an odd or even number of elements. If odd, the median is the element at the position

$$\mbox{Med}(x)=x_{\frac{n+1}{2}}$$

If the sequence has an even number of elements, then it does not have an element that is completely in the middle (example: the sequence 1, 2, 3, 4 simply does not have a middle element). Therefore, we take the average of the two middle values (the average of 2 and 3). So the formula for even numbers:

$$\mbox{Med}(x)=\frac{x_{n/2}+x_{(n+2)/2}}{2}$$

Let's go back to the example that illustrated the misused arithmetic mean. We had a set of values s=1, 3, 2, 5, 4, 2, 75. The modus would be equal to two, it's the only number that repeats. To calculate the median, we put the numbers in sequence:

$$a_i=1{,}2,2{,}3,4{,}5,75.$$

The sequence has seven elements, the environment element is so a4 and it is equal to three.