“
In 1913, on the eve of World War I, the Russian mathematician Andrei Markov published a paper applying probability to, of all things, poetry. In it, he modeled a classic of Russian literature, Pushkin’s Eugene Onegin, using what we now call a Markov chain. Rather than assume that each letter was generated at random independently of the rest, he introduced a bare minimum of sequential structure: he let the probability of each letter depend on the letter immediately preceding it. He showed that, for example, vowels and consonants tend to alternate, so if you see a consonant, the next letter (ignoring punctuation and white space) is much more likely to be a vowel than it would be if letters were independent. This may not seem like much, but in the days before computers, it required spending hours manually counting characters, and Markov’s idea was quite new. If Voweli is a Boolean variable that’s true if the ith letter of Eugene Onegin is a vowel and false if it’s a consonant, we can represent Markov’s model with a chain-like graph like this, with an arrow between two nodes indicating a direct dependency between the corresponding variables: Markov assumed (wrongly but usefully) that the probabilities are the same at every position in the text. Thus we need to estimate only three probabilities: P(Vowel1 = True), P(Voweli+1 = True | Voweli = True), and P(Voweli+1 = True | Voweli = False). (Since probabilities sum to one, from these we can immediately obtain P(Vowel1 = False), etc.) As with Naïve Bayes, we can have as many variables as we want without the number of probabilities we need to estimate going through the roof, but now the variables actually depend on each other.
”
”
Pedro Domingos (The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World)