Frequency analysis

Kapitoly: Cryptography, Frequency analysis

Frequency analysis is a general procedure that can be used during the cracking of several different kinds of ciphers. In frequency analysis, we typically go through the cipher text and try to find some similarities with the language in which the message is written.

The frequency of letters in the ciphertext

One of the basic tools of frequency analysis is the frequency of characters in a given language. If we analyse enough Czech text, we can calculate various statistical information, including how many times which letters occurred in these texts.

Since I bought about five thousand Czech-language books the other day, I did such an analysis on them. There were a total of 1,404,443,151 letters in those five thousand books. The following table shows the distribution of letters (after removing diacritics):

Letter	Number of occurrences	Letter	Number of occurrences
a	134 675 829	n	83 104 322
b	24 944 593	o	112 776 769
c	42 120 335	p	43 747 863
d	53 015 496	q	83 322
e	153 141 622	r	61 750 942
f	2 458 624	s	78 451 777
g	3 087 128	t	75 633 324
h	35 075 708	u	50 265 458
i	93 903 002	v	55 510 103
j	32 383 080	w	762 129
k	49 549 907	x	504 334
l	80 345 129	y	40 132 126
m	50 636 489	z	46 383 740

The following table then shows the relative frequency of letters:

Letter	Relative frequency	Letter	Relative frequency
a	0.09589269	n	0.05917244
b	0.0177612	o	0.08029999
c	0.02999077	p	0.03114961
d	0.03774841	q	0.00005933
e	0.10904081	r	0.04396827
f	0.0017506	s	0.0558597
g	0.00219812	t	0.05385289
h	0.02497482	u	0.03579031
i	0.06686138	v	0.03952464
j	0.02305759	w	0.00054266
k	0.03528082	x	0.0003591
l	0.05720782	y	0.02857512
m	0.0360545	z	0.03302643

Thus, the table says that approximately 9.58% of the letters in all the books were the letter "a". From this we can deduce that approximately 9.58% of the letters in a normal Czech text would be "a".

Each such statistic will show slightly different values, but that's okay - it's not so much about exact values, you can't find those out anyway, just approximate values.

The deviation of the frequency of letters from the frequency of letters in the language

If we know the expected relative frequencies of each letter in a given language, we can calculate the deviation of some text from the expected values. As an example, consider two texts, "fhes dudy fqhup ijhecuc fhejepu eriqxkzu ahkpdysu" and "proc neni parez tree because it contains crucifixes". At first glance we see that the first text is some gibberish, while the second sentence is the Czech text. Let's calculate the deviation - it should work out that the deviation of the first text will be substantially higher than that of the second.

First, let's calculate the occurrences of letters in each text:

Letter	Number of letters		Relative frequency
Letter	First text	Second text	First text	Second text	Czech language
a	1	2	0.02326	0.04651	0.09589
b	0	1	0	0.02326	0.01776
c	2	2	0.04651	0.04651	0.02999
d	3	0	0.06977	0	0.03775
e	5	6	0.11628	0.13953	0.10904
f	3	0	0.06977	0	0.00175
g	0	0	0	0	0.0022
h	5	1	0.11628	0.02326	0.02497
i	2	2	0.04651	0.04651	0.06686
j	2	1	0.04651	0.02326	0.02306
k	2	1	0.04651	0.02326	0.03528
l	0	0	0	0	0.05721
m	0	2	0	0.04651	0.03605
n	0	3	0	0.06977	0.05917
o	0	5	0	0.11628	0.0803
p	3	3	0.06977	0.06977	0.03115
q	2	0	0.04651	0	0.00006
r	1	5	0.02326	0.11628	0.04397
s	2	2	0.04651	0.04651	0.05586
t	0	2	0	0.04651	0.05385
u	6	2	0.13953	0.04651	0.03579
v	0	0	0	0	0.03952
w	0	0	0	0	0.00054
x	1	0	0.02326	0	0.00036
y	2	0	0.04651	0	0.02858
z	1	3	0.02326	0.06977	0.03303

Now we calculate the deviations. We calculate these by subtracting the relative frequency of a letter in the language from the relative frequency in the first text and squaring. That is, for the letter "a" and the first text, we would get the deviation as

$$ (0.0232558 - 0.0958927)^2=0.0052761 $$

and similarly for all the other letters. If you hover the mouse over the cell with the result, the expression that was used to arrive at the result is also displayed.

	Relative frequency			Deviation
	1. text	2. text	Czech language	1st text	2. text
a	0.02326	0.04651	0.09589	0.0052761	0.0024385
b	0	0.0232558	0.0177612	0.0003155	0.0000302
c	0.0465116	0.0465116	0.0299908	0.0002729	0.0002729
d	0.0697674	0	0.0377484	0.0010252	0.0014249
e	0.1162791	0.1395349	0.1090408	0.0000524	0.0009299
f	0.0697674	0	0.0017506	0.0046263	0.0000031
g	0	0	0.0021981	0.0000048	0.0000048
h	0.1162791	0.0232558	0.0249748	0.0083365	0.000003
i	0.0465116	0.0465116	0.0668614	0.0004141	0.0004141
j	0.0465116	0.0232558	0.0230576	0.0005501	0
k	0.0465116	0.0232558	0.0352808	0.0001261	0.0001446
l	0	0	0.0572078	0.0032727	0.0032727
m	0	0.0465116	0.0360545	0.0012999	0.0001094
n	0	0.0697674	0.0591724	0.0035014	0.0001123
o	0	0.1162791	0.0803	0.0064481	0.0012945
p	0.0697674	0.0697674	0.0311496	0.0014913	0.0014913
q	0.0465116	0	0.0000593	0.0021578	0
r	0.0232558	0.1162791	0.0439683	0.000429	0.0052289
s	0.0465116	0.0465116	0.0558597	0.0000874	0.0000874
t	0	0.0465116	0.0538529	0.0029001	0.0000539
u	0.1395349	0.0465116	0.0357903	0.0107629	0.0001149
v	0	0	0.0395246	0.0015622	0.0015622
w	0	0	0.0005427	3e-7	3e-7
x	0.0232558	0	0.0003591	0.0005243	1e-7
y	0.0465116	0	0.0285751	0.0003217	0.0008165
z	0.0232558	0.0697674	0.0330264	0.0000955	0.0013499
				0.0558546	0.0211603

Finally, we summed all the deviations. Correctly, we should still divide this value by the number of letters and then subtract to really get the deviation. But these operations do not change the fact that the second text has a significantly lower deviation, so it is much more likely to be a Czech text.

The most common words of the language

Sometimes a list of the most common words of a language can be useful.

Word	Occurrences	Word	Occurrences	Word	Occurrences	Word	Occurrences
I am	2 516 782	Here	286 606	advertised	203 656	perhaps	163 296
as	1 321 843	well	273 130	asked	201 509	Hands	161 382
when	1 237 411	for a while	269 045	mela	199 285	Also	161 174
its	974 131	All	268 277	if	197 778	seen	160 506
was	689 849	if	262 429	so	193 357	after all	160 228
was	657 760	percent	262 226	Mr.	193 123	prilis	156 839
ad	644 137	never	260 096	all	192 988	Place	156 221
more	608 371	could	259 919	All	185 965	by	155 250
We are	549 840	were	255 072	then	183 655	again	154 924
Which	522 587	between	253 466	at all	179 968	enough	154 784
or	489 592	around	251 146	its	178 390	some	154 165
which	483 800	to	243 646	Quickly	175 422	head	153 777
before	460 548	re	235 604	a little	174 553	to	153 655
something	448 947	maybe	235 161	permanent	174 451	by	151 450
of this	446 667	just	233 467	with me	173 820	started by	151 309
going to	412 128	muse	227 622	only	172 716	man	149 240
You are	394 867	also	224 705	who	172 342	wanted	148 978
are	376 431	more	221 056	people	169 569	against	148 603
their	331 235	Same	218 216	this	167 825	was not	147 829
not	318 782	if	217 917	Someone	165 736	very	145 921
I would	304 981	myself	211 148	First	164 858	doors	141 094
because	302 978	them	210 384	therefore	164 817	things	140 673
each	301 623	one	208 864	So	164 637	knew	140 656
Which	295 817	were	206 830	no one	164 442	think	139 935
her	288 095	via	205 040	Hello	163 638	once upon a time	136 314

« Previous: Cryptography

Next: Caesar's cipher »