Friedman test - coincidence index

Kapitoly: The Vigenère cipher, How to break the Vigenère cipher with knowledge of the key length, Estimating the key length of the Vigenère cipher, How to calculate the key length of the Vigenère cipher, Friedman test - coincidence index

Another way to break the Vigenère cipher is the Friedman test, which makes it relatively easy to verify that a ciphertext has been encrypted with a key of a chosen length. It is therefore a good idea to combine the Friedman test with the Kasiski test, which gives us a set of probable key lengths, and use the Friedman test to choose the key length that is most likely.

Coincidence index

The coincidence index tells us how likely it is that if we randomly pick two letters from the text, they will be the same. For example, in the text "aaaaaa", we have a 100% probability that the chosen pair will be the same. In the text "aaaabbc", the probability will be 1/3 and the most likely is that we choose two letters "a".

How would we calculate the coincidence index? Given some input text t and values n_a, n_b, …, n_z, which indicate the number of occurrences of the given letter in the text t. So for t = "aaaabbc" we have the values of n_a = 4, n_b = 2, n_c = 1 and the other values are zero.

Next, we count the number of identical pairs of letters. How many different pairs of the letter "a" are there? We can take this pair "aaaabbc" or maybe this "aaaabbc" and a few others. How many are there in total? Since the order doesn't matter, we choose combinations. We have a total of n_a possibilities to choose the letter "a" and we have a total of n_a − 1 possibilities to add another "a" to it. Since the order doesn't matter, we'll add two more to the value. Thus, the number of all different pairs of the letter "a" is equal:

$$ \frac{n_a\cdot(n_a-1)}{2} $$

We can use this formula for all letters, i.e., the number of all possible same pairs of letters is equal to

$$ \frac{n_a\cdot(n_a-1)}{2} + \frac{n_b\cdot(n_b-1)}{2} + \dots + \frac{n_z\cdot(n_z-1)}{2}. $$

For our text we would get:

$$ \frac{4\cdot3}{2}+\frac{2\cdot1}{2}+\frac{1\cdot0}{2}=7 $$

A is the following pairs of the same letters: "aaaabbc", "aaaabbc", "aaaabbc", "aaaabbc", "aaaabbc", "aaaabbc", "aaaabbc".

The previous formula can be broken down using sum (here x is already a variable, not a letter):

$$ \sum_{x=„a“}^{„z“} \frac{n_x\cdot(n_x-1)}{2} $$

To get the probability, divide this value by the number of all possible pairs of letters (same or different) that exist in the text. If we denote the length of the text by N, then the number of all pairs will be

$$ \frac{N \cdot (N - 1)}{2} $$

The reason is again the same: we have a total of N possibilities to select a letter, and we have a total of N − 1 possibilities to add another letter to it to form a pair. Since the order doesn't matter, we divide by two. The total probability, and the cooccurrence index, is thus equal, denote IK:

$$ IK = \frac{\sum_{x=„a“}^{„z“} \frac{n_x\cdot(n_x-1)}{2}}{\frac{N \cdot (N - 1)}{2}} $$

We can truncate the twos in the denominator, so we are left with:

$$ IK = \frac{\sum_{x=„a“}^{„z“} n_x\cdot(n_x-1)}{N \cdot (N - 1)} $$

This is the final formula. If we plug in the values from our text:

$$ IK =\frac{\sum_{x=„a“}^{„z“} n_x\cdot(n_x-1)}{N \cdot (N - 1)}=\frac{4\cdot3+2\cdot1+1\cdot0}{7\cdot6}=\frac{14}{42}=\frac13 $$

The coincidence index of the text "aaaabbc" is equal to one third. That is, we have a one-third probability that if we randomly choose two letters in the text, that those letters will be the same.

Coincidence index of random text

If we have a long random text with a uniform distribution of letters, all letters in the text occur approximately equally often. We can say that for an infinitely long text with a uniform distribution of letters, we will always have a probability of 1/26 of choosing a given letter at random. (We count the English alphabet, which has 26 letters.) What would be the coincidence index of such a text?

In such a text, for each letter, we would get the probability $\frac{1}{26}\cdot\frac{1}{26}$ that we pick the same pair of letters. Since we have a total of 26 different letters, the total probability of randomly picking two identical letters is

$$ \sum_{i=1}^{26} \frac{1}{26}\cdot\frac{1}{26} = \sum_{i=1}^{26} \frac{1}{676} = \frac{26}{676} = \frac{1}{26}. $$

Language coincidence index

Since text written in a common language like Czech is not random text, it has a different cooccurrence index than random text. How to calculate the coincidence index of a language? In short, we collect heaps of Czech texts somewhere and calculate the coincidence index of these texts.

The more texts we collect, the better results we get. Of course, there is no such thing as an "official Czech language coincidence index", because such an index de facto changes with every newly written/speaked word in the Czech language.

To illustrate this, I tried to perform this analysis over five thousand Czech written books I purchased on saveto. As a result, the cooccurrence index of the 5000 Czech books came out to be roughly 0.0588. So we can say that the co-incidence index of the Czech language is approximately 0.0588. Another source gives a very similar value, 0.06027. The same source also gives indices for other languages:

Czech	0.06027
English	0.06689
Danish	0.07073
Finnish	0.07380
French	0.07460
Dutch	0.07981
German	0.07667
Italian	0.07329
Russian	0.05607
Spanish	0.07661

How to break the Vigenère cipher using the Friedman test

We use the co-incidence index to test whether a given length is the length of the original cipher key of the Vigenère cipher. Exact procedure:

We estimate that the key could be between 2 and 15 (why 15? It doesn't matter.) Let's denote: K_o = {2,…,15}.
Using the Cassis method, let's find another set of probable keys, denote it by K_k.
We unify the two sets K = K_o ∪ K_k. This gives us all the probable lengths of the Vigenère cipher. In other words, we will try all lengths between 2 and 15 plus those lengths returned by the Kasiski test.
For each key length, i.e., for each k∈ K: we will divide the ciphertext into k blocks, the first block will always contain the 1st letter of the ciphertext, (1 + k)-that letter, (1 + 2k)-that letter, etc. The second block will always contain the 2nd letter of the ciphertext, (2 + k)-that letter, (2 + 2k)-that letter, etc. We will mark the blocks B_i.
For each block B_i we compute the cooccurrence index IK_i.
We compute the average cooccurrence index of all blocks:

$$\mbox{IK} = \frac{\sum_{i=1}^k \mbox{IK}_i}{k}$$
We find the length of the key that has the corresponding lowest coincidence index. This is probably the length of the original key.
We break the Vigenère cipher using the classical algorithm for breaking the Vigenère cipher with knowledge of the key length.

Online tool to break the Vigenère cipher

The following tool simulates the above procedure.

« Previous: How to calculate the key length of the Vigenère cipher

Next: Blog »