Frequency analysis
Kapitoly: Cryptography, Frequency analysis
Frequency analysis is a general procedure that can be used during the cracking of several different kinds of ciphers. In frequency analysis, we typically go through the cipher text and try to find some similarities with the language in which the message is written.
The frequency of letters in the ciphertext
One of the basic tools of frequency analysis is the frequency of characters in a given language. If we analyse enough Czech text, we can calculate various statistical information, including how many times which letters occurred in these texts.
Since I bought about five thousand Czech-language books the other day, I did such an analysis on them. There were a total of 1,404,443,151 letters in those five thousand books. The following table shows the distribution of letters (after removing diacritics):
Letter | Number of occurrences | Letter | Number of occurrences |
---|---|---|---|
a | 134 675 829 | n | 83 104 322 |
b | 24 944 593 | o | 112 776 769 |
c | 42 120 335 | p | 43 747 863 |
d | 53 015 496 | q | 83 322 |
e | 153 141 622 | r | 61 750 942 |
f | 2 458 624 | s | 78 451 777 |
g | 3 087 128 | t | 75 633 324 |
h | 35 075 708 | u | 50 265 458 |
i | 93 903 002 | v | 55 510 103 |
j | 32 383 080 | w | 762 129 |
k | 49 549 907 | x | 504 334 |
l | 80 345 129 | y | 40 132 126 |
m | 50 636 489 | z | 46 383 740 |
The following table then shows the relative frequency of letters:
Letter | Relative frequency | Letter | Relative frequency |
---|---|---|---|
a | 0.09589269 | n | 0.05917244 |
b | 0.0177612 | o | 0.08029999 |
c | 0.02999077 | p | 0.03114961 |
d | 0.03774841 | q | 0.00005933 |
e | 0.10904081 | r | 0.04396827 |
f | 0.0017506 | s | 0.0558597 |
g | 0.00219812 | t | 0.05385289 |
h | 0.02497482 | u | 0.03579031 |
i | 0.06686138 | v | 0.03952464 |
j | 0.02305759 | w | 0.00054266 |
k | 0.03528082 | x | 0.0003591 |
l | 0.05720782 | y | 0.02857512 |
m | 0.0360545 | z | 0.03302643 |
Thus, the table says that approximately 9.58% of the letters in all the books were the letter "a". From this we can deduce that approximately 9.58% of the letters in a normal Czech text would be "a".
Each such statistic will show slightly different values, but that's okay - it's not so much about exact values, you can't find those out anyway, just approximate values.
The deviation of the frequency of letters from the frequency of letters in the language
If we know the expected relative frequencies of each letter in a given language, we can calculate the deviation of some text from the expected values. As an example, consider two texts, "fhes dudy fqhup ijhecuc fhejepu eriqxkzu ahkpdysu" and "proc neni parez tree because it contains crucifixes". At first glance we see that the first text is some gibberish, while the second sentence is the Czech text. Let's calculate the deviation - it should work out that the deviation of the first text will be substantially higher than that of the second.
First, let's calculate the occurrences of letters in each text:
Letter | Number of letters | Relative frequency | |||
---|---|---|---|---|---|
First text | Second text | First text | Second text | Czech language | |
a | 1 | 2 | 0.02326 | 0.04651 | 0.09589 |
b | 0 | 1 | 0 | 0.02326 | 0.01776 |
c | 2 | 2 | 0.04651 | 0.04651 | 0.02999 |
d | 3 | 0 | 0.06977 | 0 | 0.03775 |
e | 5 | 6 | 0.11628 | 0.13953 | 0.10904 |
f | 3 | 0 | 0.06977 | 0 | 0.00175 |
g | 0 | 0 | 0 | 0 | 0.0022 |
h | 5 | 1 | 0.11628 | 0.02326 | 0.02497 |
i | 2 | 2 | 0.04651 | 0.04651 | 0.06686 |
j | 2 | 1 | 0.04651 | 0.02326 | 0.02306 |
k | 2 | 1 | 0.04651 | 0.02326 | 0.03528 |
l | 0 | 0 | 0 | 0 | 0.05721 |
m | 0 | 2 | 0 | 0.04651 | 0.03605 |
n | 0 | 3 | 0 | 0.06977 | 0.05917 |
o | 0 | 5 | 0 | 0.11628 | 0.0803 |
p | 3 | 3 | 0.06977 | 0.06977 | 0.03115 |
q | 2 | 0 | 0.04651 | 0 | 0.00006 |
r | 1 | 5 | 0.02326 | 0.11628 | 0.04397 |
s | 2 | 2 | 0.04651 | 0.04651 | 0.05586 |
t | 0 | 2 | 0 | 0.04651 | 0.05385 |
u | 6 | 2 | 0.13953 | 0.04651 | 0.03579 |
v | 0 | 0 | 0 | 0 | 0.03952 |
w | 0 | 0 | 0 | 0 | 0.00054 |
x | 1 | 0 | 0.02326 | 0 | 0.00036 |
y | 2 | 0 | 0.04651 | 0 | 0.02858 |
z | 1 | 3 | 0.02326 | 0.06977 | 0.03303 |
Now we calculate the deviations. We calculate these by subtracting the relative frequency of a letter in the language from the relative frequency in the first text and squaring. That is, for the letter "a" and the first text, we would get the deviation as
$$ (0.0232558 - 0.0958927)^2=0.0052761 $$
and similarly for all the other letters. If you hover the mouse over the cell with the result, the expression that was used to arrive at the result is also displayed.
Relative frequency | Deviation | ||||
---|---|---|---|---|---|
1. text | 2. text | Czech language | 1st text | 2. text | |
a | 0.02326 | 0.04651 | 0.09589 | 0.0052761 | 0.0024385 |
b | 0 | 0.0232558 | 0.0177612 | 0.0003155 | 0.0000302 |
c | 0.0465116 | 0.0465116 | 0.0299908 | 0.0002729 | 0.0002729 |
d | 0.0697674 | 0 | 0.0377484 | 0.0010252 | 0.0014249 |
e | 0.1162791 | 0.1395349 | 0.1090408 | 0.0000524 | 0.0009299 |
f | 0.0697674 | 0 | 0.0017506 | 0.0046263 | 0.0000031 |
g | 0 | 0 | 0.0021981 | 0.0000048 | 0.0000048 |
h | 0.1162791 | 0.0232558 | 0.0249748 | 0.0083365 | 0.000003 |
i | 0.0465116 | 0.0465116 | 0.0668614 | 0.0004141 | 0.0004141 |
j | 0.0465116 | 0.0232558 | 0.0230576 | 0.0005501 | 0 |
k | 0.0465116 | 0.0232558 | 0.0352808 | 0.0001261 | 0.0001446 |
l | 0 | 0 | 0.0572078 | 0.0032727 | 0.0032727 |
m | 0 | 0.0465116 | 0.0360545 | 0.0012999 | 0.0001094 |
n | 0 | 0.0697674 | 0.0591724 | 0.0035014 | 0.0001123 |
o | 0 | 0.1162791 | 0.0803 | 0.0064481 | 0.0012945 |
p | 0.0697674 | 0.0697674 | 0.0311496 | 0.0014913 | 0.0014913 |
q | 0.0465116 | 0 | 0.0000593 | 0.0021578 | 0 |
r | 0.0232558 | 0.1162791 | 0.0439683 | 0.000429 | 0.0052289 |
s | 0.0465116 | 0.0465116 | 0.0558597 | 0.0000874 | 0.0000874 |
t | 0 | 0.0465116 | 0.0538529 | 0.0029001 | 0.0000539 |
u | 0.1395349 | 0.0465116 | 0.0357903 | 0.0107629 | 0.0001149 |
v | 0 | 0 | 0.0395246 | 0.0015622 | 0.0015622 |
w | 0 | 0 | 0.0005427 | 3e-7 | 3e-7 |
x | 0.0232558 | 0 | 0.0003591 | 0.0005243 | 1e-7 |
y | 0.0465116 | 0 | 0.0285751 | 0.0003217 | 0.0008165 |
z | 0.0232558 | 0.0697674 | 0.0330264 | 0.0000955 | 0.0013499 |
0.0558546 | 0.0211603 |
Finally, we summed all the deviations. Correctly, we should still divide this value by the number of letters and then subtract to really get the deviation. But these operations do not change the fact that the second text has a significantly lower deviation, so it is much more likely to be a Czech text.
The most common words of the language
Sometimes a list of the most common words of a language can be useful.
Word | Occurrences | Word | Occurrences | Word | Occurrences | Word | Occurrences |
---|---|---|---|---|---|---|---|
I am | 2 516 782 | Here | 286 606 | advertised | 203 656 | perhaps | 163 296 |
as | 1 321 843 | well | 273 130 | asked | 201 509 | Hands | 161 382 |
when | 1 237 411 | for a while | 269 045 | mela | 199 285 | Also | 161 174 |
its | 974 131 | All | 268 277 | if | 197 778 | seen | 160 506 |
was | 689 849 | if | 262 429 | so | 193 357 | after all | 160 228 |
was | 657 760 | percent | 262 226 | Mr. | 193 123 | prilis | 156 839 |
ad | 644 137 | never | 260 096 | all | 192 988 | Place | 156 221 |
more | 608 371 | could | 259 919 | All | 185 965 | by | 155 250 |
We are | 549 840 | were | 255 072 | then | 183 655 | again | 154 924 |
Which | 522 587 | between | 253 466 | at all | 179 968 | enough | 154 784 |
or | 489 592 | around | 251 146 | its | 178 390 | some | 154 165 |
which | 483 800 | to | 243 646 | Quickly | 175 422 | head | 153 777 |
before | 460 548 | re | 235 604 | a little | 174 553 | to | 153 655 |
something | 448 947 | maybe | 235 161 | permanent | 174 451 | by | 151 450 |
of this | 446 667 | just | 233 467 | with me | 173 820 | started by | 151 309 |
going to | 412 128 | muse | 227 622 | only | 172 716 | man | 149 240 |
You are | 394 867 | also | 224 705 | who | 172 342 | wanted | 148 978 |
are | 376 431 | more | 221 056 | people | 169 569 | against | 148 603 |
their | 331 235 | Same | 218 216 | this | 167 825 | was not | 147 829 |
not | 318 782 | if | 217 917 | Someone | 165 736 | very | 145 921 |
I would | 304 981 | myself | 211 148 | First | 164 858 | doors | 141 094 |
because | 302 978 | them | 210 384 | therefore | 164 817 | things | 140 673 |
each | 301 623 | one | 208 864 | So | 164 637 | knew | 140 656 |
Which | 295 817 | were | 206 830 | no one | 164 442 | think | 139 935 |
her | 288 095 | via | 205 040 | Hello | 163 638 | once upon a time | 136 314 |