Frequency analysis

Kapitoly: Cryptography, Frequency analysis

Frequency analysis is a general procedure that can be used during the cracking of several different kinds of ciphers. In frequency analysis, we typically go through the cipher text and try to find some similarities with the language in which the message is written.

The frequency of letters in the ciphertext

One of the basic tools of frequency analysis is the frequency of characters in a given language. If we analyse enough Czech text, we can calculate various statistical information, including how many times which letters occurred in these texts.

Since I bought about five thousand Czech-language books the other day, I did such an analysis on them. There were a total of 1,404,443,151 letters in those five thousand books. The following table shows the distribution of letters (after removing diacritics):

LetterNumber of occurrencesLetterNumber of occurrences
a134 675 829n83 104 322
b24 944 593o112 776 769
c42 120 335p43 747 863
d53 015 496q83 322
e153 141 622r61 750 942
f2 458 624s78 451 777
g3 087 128t75 633 324
h35 075 708u50 265 458
i93 903 002v55 510 103
j32 383 080w762 129
k49 549 907x504 334
l80 345 129y40 132 126
m50 636 489z46 383 740

The following table then shows the relative frequency of letters:

LetterRelative frequencyLetterRelative frequency
a0.09589269n0.05917244
b0.0177612o0.08029999
c0.02999077p0.03114961
d0.03774841q0.00005933
e0.10904081r0.04396827
f0.0017506s0.0558597
g0.00219812t0.05385289
h0.02497482u0.03579031
i0.06686138v0.03952464
j0.02305759w0.00054266
k0.03528082x0.0003591
l0.05720782y0.02857512
m0.0360545z0.03302643

Thus, the table says that approximately 9.58% of the letters in all the books were the letter "a". From this we can deduce that approximately 9.58% of the letters in a normal Czech text would be "a".

Each such statistic will show slightly different values, but that's okay - it's not so much about exact values, you can't find those out anyway, just approximate values.

The deviation of the frequency of letters from the frequency of letters in the language

If we know the expected relative frequencies of each letter in a given language, we can calculate the deviation of some text from the expected values. As an example, consider two texts, "fhes dudy fqhup ijhecuc fhejepu eriqxkzu ahkpdysu" and "proc neni parez tree because it contains crucifixes". At first glance we see that the first text is some gibberish, while the second sentence is the Czech text. Let's calculate the deviation - it should work out that the deviation of the first text will be substantially higher than that of the second.

First, let's calculate the occurrences of letters in each text:

LetterNumber of lettersRelative frequency
First textSecond textFirst textSecond textCzech language
a120.023260.046510.09589
b0100.023260.01776
c220.046510.046510.02999
d300.0697700.03775
e560.116280.139530.10904
f300.0697700.00175
g00000.0022
h510.116280.023260.02497
i220.046510.046510.06686
j210.046510.023260.02306
k210.046510.023260.03528
l00000.05721
m0200.046510.03605
n0300.069770.05917
o0500.116280.0803
p330.069770.069770.03115
q200.0465100.00006
r150.023260.116280.04397
s220.046510.046510.05586
t0200.046510.05385
u620.139530.046510.03579
v00000.03952
w00000.00054
x100.0232600.00036
y200.0465100.02858
z130.023260.069770.03303

Now we calculate the deviations. We calculate these by subtracting the relative frequency of a letter in the language from the relative frequency in the first text and squaring. That is, for the letter "a" and the first text, we would get the deviation as

$$ (0.0232558 - 0.0958927)^2=0.0052761 $$

and similarly for all the other letters. If you hover the mouse over the cell with the result, the expression that was used to arrive at the result is also displayed.

Relative frequencyDeviation
1. text2. textCzech language1st text2. text
a0.023260.046510.095890.00527610.0024385
b00.02325580.01776120.00031550.0000302
c0.04651160.04651160.02999080.00027290.0002729
d0.069767400.03774840.00102520.0014249
e0.11627910.13953490.10904080.00005240.0009299
f0.069767400.00175060.00462630.0000031
g000.00219810.00000480.0000048
h0.11627910.02325580.02497480.00833650.000003
i0.04651160.04651160.06686140.00041410.0004141
j0.04651160.02325580.02305760.00055010
k0.04651160.02325580.03528080.00012610.0001446
l000.05720780.00327270.0032727
m00.04651160.03605450.00129990.0001094
n00.06976740.05917240.00350140.0001123
o00.11627910.08030.00644810.0012945
p0.06976740.06976740.03114960.00149130.0014913
q0.046511600.00005930.00215780
r0.02325580.11627910.04396830.0004290.0052289
s0.04651160.04651160.05585970.00008740.0000874
t00.04651160.05385290.00290010.0000539
u0.13953490.04651160.03579030.01076290.0001149
v000.03952460.00156220.0015622
w000.00054273e-73e-7
x0.023255800.00035910.00052431e-7
y0.046511600.02857510.00032170.0008165
z0.02325580.06976740.03302640.00009550.0013499
0.05585460.0211603

Finally, we summed all the deviations. Correctly, we should still divide this value by the number of letters and then subtract to really get the deviation. But these operations do not change the fact that the second text has a significantly lower deviation, so it is much more likely to be a Czech text.

The most common words of the language

Sometimes a list of the most common words of a language can be useful.

WordOccurrencesWordOccurrencesWordOccurrencesWordOccurrences
I am2 516 782Here286 606advertised203 656perhaps163 296
as1 321 843well273 130asked201 509Hands161 382
when1 237 411for a while269 045mela199 285Also161 174
its974 131All268 277if197 778seen160 506
was689 849if262 429so193 357after all160 228
was657 760percent262 226Mr.193 123prilis156 839
ad644 137never260 096all192 988Place156 221
more608 371could259 919All185 965by155 250
We are549 840were255 072then183 655again154 924
Which522 587between253 466at all179 968enough154 784
or489 592around251 146its178 390some154 165
which483 800to243 646Quickly175 422head153 777
before460 548re235 604a little174 553to153 655
something448 947maybe235 161permanent174 451by151 450
of this446 667just233 467with me173 820started by151 309
going to412 128muse227 622only172 716man149 240
You are394 867also224 705who172 342wanted148 978
are376 431more221 056people169 569against148 603
their331 235Same218 216this167 825was not147 829
not318 782if217 917Someone165 736very145 921
I would304 981myself211 148First164 858doors141 094
because302 978them210 384therefore164 817things140 673
each301 623one208 864So164 637knew140 656
Which295 817were206 830no one164 442think139 935
her288 095via205 040Hello163 638once upon a time136 314