Formal languages

Formal language is a kind of generalization of the notion of language we are used to from everyday life.

Basic concepts

To begin with, we can think of formal language as a common language, such as English. What does such a language consist of? It consists of words that we further combine into sentences. We are not interested in sentences, only in words. What do words consist of? The letters of the alphabet. We'll start with the alphabet.

Analphabet is any non-empty set. The elements of the alphabet are called symbols. If we're talking about the Czech language, then it would be the classical Czech alphabet, including both lower case and upper case letters, plus variations of letters with hooks and commas. If we did not want to describe Czech but, for example, numbers, we would have all ten digits in the alphabet. We often denote the alphabet with the symbol $\Sigma$.

By a word, or also by a string, we mean a finite sequence of symbols from an alphabet $\Sigma$. If we have, for example, the alphabet $\Sigma=\left\{a,b,c,d,e,\dots, z\right\}$, then a word is, for example, the sequence "ahoj" or "doberman", but not "večer" (we don't have a "č"), "Honza" (we don't have an "H") or "dobry den" (we don't have a space). By word length I mean the number of symbols that make up a word.

Word concatenation: if we have two words a = a1a2… an and b = b1b2… bm, then concatenating the words will produce the word ab = a1a2… anb1b2… bm. Example: concatenating the words "hello" and "doberman" produces the word "hello doberman". We either mark the concatenation with a circle, i.e. $a\circ b$, or we don't mark it at all and just write the words after each other without a special symbol for concatenation.

We mark anempty word with $\varepsilon$ to denote a word that has length zero. If we concatenate the word a = a1… an with the empty word $\varepsilon$, we get back the word a. That is, $a\circ\varepsilon=a$.

The closure of the alphabet $\Sigma$ is the set of all the words we can form from the alphabet $\Sigma$, including the empty word. We denote the closure by $\Sigma^\ast$. If we have the binary alphabet $\Sigma=\left\{0,1\right\}$, then

$$ \Sigma^\ast=\left\{\varepsilon,0{,}1,01{,}10,11{,}011,101{,}110,\dots\right\} $$

Sometimes we also use a positive alphabet closure, which is again all the words we are able to assemble from the alphabet, except the empty word. We denote the positive closure by $\Sigma^+$ and the validity is $\Sigma^+=\Sigma^\ast\setminus\left\{\varepsilon\right\}$.

Examples of alphabets

  • Let's go back to the Czech example. Each Czech word is found in the cap of an alphabet that contains upper and lower case letters and accented variants (plus maybe a hyphen "-"). If we add a space and punctuation marks (comma, period, exclamation mark, question mark, ...) to this alphabet, we get all the sentences we can form in the Czech language. Of course, this closure will include sentences like "sadflsf kasdf kagrjewichiarzcáýker kf fkjbsjbgvbadfgsa!!!:??: :::::::::sdf", but these are still more meaningful than Miloš Zeman's speeches.

  • If our alphabet contains all the digits $\Sigma=\left\{0, 1, \dots, 9\right\}$, then all the natural numbers will be in the cap - and then some. There will also be numbers starting with zero, e.g. 00054, or zero itself, and there will be a blank word, which of course is not a number.

  • The alphabet must be non-empty, so the smallest alphabet has at least one symbol. For example, for the alphabet $\Sigma=\left\{!\right\}$ we would get the closure $\Sigma^\ast=\left\{\varepsilon, !, !!, !!!, !!!!, \dots\right\}$, i.e., the set of words, and for every n∈ℕ0 in the closure there is a word that has n exclamation marks and no other word in the closure.

  • Let $\Sigma=\left\{qw,c\right\}$. This is a very strange alphabet because it contains the word qw. This notation is not usually used because it is confusing and strange, but we can imagine it by gluing the letters "q" and "w" together so that they form one character. Then we can see the word "qw" as the symbol "qw". If we make this symbol into the word "qwcqw", then this word would have a length of three - it would be made up of the symbols "qw", "c" and "qw". We don't come across alphabet cases like this very often, but there are situations where we need to use such symbols consisting of multiple symbols.

Formal language

The(formal) language over the alphabet $\Sigma$ is any subset of $\Sigma^\ast$. Thus, $L\subseteq\Sigma^\ast$ applies to the language L over the alphabet $\Sigma$. Language examples:

  • In the previous section, we defined the alphabet of digits $\Sigma=\left\{0, 1, \dots, 9\right\}$. The closure is all words consisting of digits only. If we add the rule that a word cannot start with zero and a word cannot be empty, we get the set of natural numbers . The relation $\mathbb{N}\subseteq\Sigma^\ast$ holds, so is a language over the alphabet $\Sigma$.

  • We can add the minus sign "-" to the previous set of digits, so $\Sigma=\left\{0, 1, \dots, 9, -\right\}$. The closure is all words that consist of digits or a minus sign. However, this means that words like "12-84-", "1-5-8" or "-" are also in the closure. We create the integer language by adding three rules:

    • The minus sign does not occur in the word at all or occurs at the beginning of the word.
    • The first digit in a word cannot be zero except for the word "0".
    • Each word must contain at least one digit.

    This set describes the set of integers and is a subset of the set $\Sigma^\ast$ - so it is the language above the alphabet $\Sigma$.

  • Let $\Sigma=\left\{0,1\right\}$. The closure is thus all words that consist of zeros and ones. The language L over this alphabet can be defined, for example, as the set of all words that have exactly three ones. We can also be more creative and say that the language L will be the set of all words that represent a valid docx file format (what comes out of Word).

  • We'll stick with the binary alphabet $\Sigma=\left\{0,1\right\}$. Any string of characters (the normal characters you have on your keyboard) can be converted to binary, i.e., zeros and ones, and back again using, for example, an ASCII table (or more accurately, an ASCII table converts characters to numbers, and numbers in the decimal system can be converted to numbers in the binary system). So we can define the language of all the words that match a valid email and it will still be a language over the binary alphabet.

Concatenating languages: we can concatenate alphabets as well. The definition will be similar to the Cartesian product of sets. Let us have two languages L1 and L2. Concatenating them $L_1\circ L_2$ will give us a new language, which we define as follows:

$$ L_1\circ L_2 = \left\{w_1\circ w_2,|,w_1\in L_1, w_2\in L_2\right\} $$

That is, we take all the words from the language L1 and concatenate them with all the words from the language L2. Example.

\begin{eqnarray} L_1&=&&\left\{0,1\right\}\ L_2&=&\left\{a,b,ahoj\right\}\\ \end{eqnarray}

Then:

\begin{eqnarray} L_1\circ L_2 &=& \left\{0a, 0b, 0ahoj, 1a, 1b, 1ahoj\right\}\\ L_2\circ L_1 &=& \left\{a0, a1, b0, b1, ahoj0, ahoj1\right\}\ L_1\circ L_1 &=& \left\{00, 01, 10, 11\right\} \end{eqnarray}

So far, we have used ordinary language to describe languages, simply describing verbally what the language should look like. But this is very impractical, so a more formal way of defining a language is being introduced. The first of these is grammar.