Basic concepts of statistics

Statistics is the science that examines, processes, and evaluates data.

Population

The goal of statistics is to conduct an experiment to find out something interesting about a given population. By population we generally mean any set of elements that we want to study at the moment. If we want to find out the average age of the population of the Czech Republic, our population will be all the inhabitants of the Czech Republic.

But if we want to find the average gasoline consumption of cars per hundred kilometers, our population will be the set of all cars (in a given area).

Selection, sample

It is often not possible to work with all elements of the population. Let us imagine that we want to find out what people in the Czech Republic think about compulsory mathematics graduation. In order to really find out what people think about compulsory maths GCSEs, we would have to go house to house, bridge to bridge, and ask every citizen what they think about compulsory maths GCSEs. That is not possible in practice. Some reasons:

It is too expensive. Asking all the approximately ten and a half million inhabitants is not a cheap affair. For example, the first direct election of the President cost 625 million crowns.
It takes too long. The election was certainly several months in the making - if you need a statistical result in a week, that's too long.
Not everyone will want to answer. Some will not want to answer your questions on principle. If our population are some kind of machines, they may break down again. If you were tracking mileage on cars, the speedometer might break or someone might deliberately over-read it.
The experiment may be too dangerous. Nobody is likely to get a heart attack from asking about compulsory school leaving exams, but we can take another example - testing a new drug called "all-over". What would happen if we were to test everything-only on the entire population of the Czech Republic and during the testing it was found that 20% of the people tested immediately got violent diarrhoea? Well, it's probably better if we test the drug on a smaller group of people first, right?

In order to avoid these disadvantages, we only select a sample (or a sample) from the population. If we have a population P, then the sample V is any subset of P, i.e. V ⊆ P. We then run our experiment on only this sample V and generalize the results to the whole population. Of course, these results will be imprecise - how imprecise they are depends mainly on how large the sample V is and what method we chose to select the elements into V.

Typical errors may thus be:

Too few elements in V. If you ask the first seven people you see about the compulsory matriculation, you cannot get meaningful results.
Unrepresentative selection of elements from the population. If you ask a thousand graduates of the Mathematics and Physics Department about the compulsory maths exam, you will get different answers than if you ask a thousand third year high school students.

Variables

During the experiment, we examine the elements of the sample. The data we observe are called variables, and the values of the variables are called variances. There are basic types of variables:

Qualitative variable: this variable is typically not worth measuring, it is some sort of verbal rating. A typical example would be a query about nationality. Variants of such a variable would be e.g. the values "Czech nationality", "Slovak nationality", etc. It does not make sense to measure or compare Czech and Slovak nationality. We can compare the numbers of Czechs and Slovaks, but we cannot compare the nationality itself.

The question on compulsory matriculation also falls into this category, where the expected answers are "yes, I want a compulsory matriculation in mathematics" or "no, I do not want a compulsory matriculation in mathematics", which are variants of this variable. Again, we can compare the number of responses, but it is not meaningful to compare the actual 'yes' and 'no'.

Quantitative variables: we will measure this variable. So these are lengths, weights, times, counts and so on. We further divide quantitative variables into discrete and continuous variables:

Discrete variable

A discrete variable contains a finite number of variants or contains a countable number of variants (see below). Quite often these are integers. For example, the number of pupils in a classroom - in a normal classroom there will be, say, something between fifteen and forty children.

A discrete variable is characterised by the fact that we are always able to tell what the next and previous variants are. If there are 28 children in class 3B, the previous variation is 27 children and the next variation is 29 children. For the qualitative variable, we are usually not able to do this - what is the next variant after Czech nationality?

A discrete variable can be infinite, but it must be countable - that is, we still need to be able to determine the previous and next variant. For example, we could introduce the variable "distance of two objects to the nearest kilometer". If we measure that the distance of two objects, for example a car and a barn, is 12 kilometres, then again the next and previous variation is 13 and 11 kilometres respectively. Yet the distance is probably not limited in any way. If we have two objects that are 1,500,000 kilometres apart, surely we can find objects that are 1,500,001 kilometres apart.

The variable would remain discrete even if we changed the precision to tenths of a kilometer (i.e., hundreds of meters). Then we could measure a distance of 15.7 km and the next and previous values would be 15.8 and 15.6.

If there is no previous or subsequent variation, it does not contradict the fact that the variable is discrete. For example, for a distance of zero kilometers, there is no previous variation - we do not define a distance of minus one kilometer. Yet distance to the nearest kilometer is a discrete variable.

Continuous variable

A continuous variable always contains an infinite number of variations. The values are typically real numbers, so it is, for example, a distance (without the precision addendum). For continuous variables, we cannot determine the previous or next variant. If we measure that the distance of something is 3.58745 meters, we can't find the number that is exactly after that number.

There are irrational numbers with infinite decimal expansion in the set of real numbers. Of course, we don't have instruments that can measure a distance to such a distance, so in reality every such variable is equally discrete - precisely because every instrument has some precision. If you measure something with a ruler, you have an accuracy of one millimeter there. So you can measure that a book is 167 mm wide or 168 mm wide, but nothing in between; unless, of course, you somehow guess etc.

If you have a more scientific instrument, you can be accurate to one micrometer. Even so, it's probably not enough to measure an object completely accurately.

Despite all this, we commonly talk about distance or mass as continuous variables. In practice, such simplification is necessary and usually doesn't matter.

Random variable

A random variable is a discrete or continuous variable for which we cannot determine its resulting value before performing the experiment. Thus, a random variable can be the result of a roll of a six-sided die. Until we roll this die, we cannot know what number will be rolled on the die.

We may be able to predict that some values will be more likely than others, that's fine, we just can't be absolutely certain that we will get any particular value. For example, if we randomly drew one resident of the Czech Republic and asked them what city they live in, they are more likely to live in Prague than somewhere in Kravaře. In short, more people live in Prague.

If we had a dice that had six dots on five sides and two dots on the remaining sixth side, it is much more likely that we would get six dots on a roll. But it's still a random variable, because it's not certain that six dots will fall.

If we were to modify this die so that there were six dots on all six sides, the die roll would not be a random variable because we would always get six dots.

Links and resources

« Previous: Statistics

Next: Frequency distributions »