“
The breakthrough came in the early 1980s, when Judea Pearl, a professor of computer science at the University of California, Los Angeles, invented a new representation: Bayesian networks. Pearl is one of the most distinguished computer scientists in the world, his methods having swept through machine learning, AI, and many other fields. He won the Turing Award, the Nobel Prize of computer science, in 2012. Pearl realized that it’s OK to have a complex network of dependencies among random variables, provided each variable depends directly on only a few others. We can represent these dependencies with a graph like the ones we saw for Markov chains and HMMs, except now the graph can have any structure (as long as the arrows don’t form closed loops). One of Pearl’s favorite examples is burglar alarms. The alarm at your house should go off if a burglar attempts to break in, but it could also be triggered by an earthquake. (In Los Angeles, where Pearl lives, earthquakes are almost as frequent as burglaries.) If you’re working late one night and your neighbor Bob calls to say he just heard your alarm go off, but your neighbor Claire doesn’t, should you call the police? Here’s the graph of dependencies: If there’s an arrow from one node to another in the graph, we say that the first node is a parent of the second. So Alarm’s parents are Burglary and Earthquake, and Alarm is the sole parent of Bob calls and Claire calls. A Bayesian network is a graph of dependencies like this, together with a table for each variable, giving its probability for each combination of values of its parents. For Burglary and Earthquake we only need one probability each, since they have no parents. For Alarm we need four: the probability that it goes off even if there’s no burglary or earthquake, the probability that it goes off if there’s a burglary and no earthquake, and so on. For Bob calls we need two probabilities (given alarm and given no alarm), and similarly for Claire. Here’s the crucial point: Bob calling depends on Burglary and Earthquake, but only through Alarm. Bob’s call is conditionally independent of Burglary and Earthquake given Alarm, and so is Claire’s. If the alarm doesn’t go off, your neighbors sleep soundly, and the burglar proceeds undisturbed. Also, Bob and Claire are independent given Alarm. Without this independence structure, you’d need to learn 25 = 32 probabilities, one for each possible state of the five variables. (Or 31, if you’re a stickler for details, since the last one can be left implicit.) With the conditional independencies, all you need is 1 + 1 + 4 + 2 + 2 = 10, a savings of 68 percent. And that’s just in this tiny example; with hundreds or thousands of variables, the savings would be very close to 100 percent.
”
”
Pedro Domingos (The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World)