Statistical Model Quotes

We've searched our database for all the quotes and captions related to Statistical Model. Here they are! All 100 of them:

Essentially, all models are wrong, but some are useful
George E.P. Box (Empirical Model-Building and Response Surfaces (Wiley Series in Probability and Statistics))
Indeed, statistical modelling based on these results even suggests that one of the effects of the plague was a substantial improvement in life expectancy.
Peter Frankopan (The Silk Roads: A New History of the World)
It is critical to recognize the limitations of LLMs from a consumer perspective. LLMs only possess statistical knowledge about word patterns, not true comprehension of ideas, facts, or emotions. Their fluency can create an illusion of human-like understanding, but rigorous testing reveals brittleness. Just because a LLM can generate coherent text about medicine or law doesn’t mean it grasps those professional domains. It does not. Responsible evaluation is essential to avoid overestimating capabilities.
I. Almeida (Introduction to Large Language Models for Business Leaders: Responsible AI Strategy Beyond Fear and Hype (Byte-sized Learning Book 2))
Models are the mothers of invention.
Leland Wilkinson (The Grammar of Graphics. Statistics and Computing.)
Newer systems use statistical machine learning techniques that automatically build statistical models from observed usage patterns.
Nick Bostrom (Superintelligence: Paths, Dangers, Strategies)
Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more.
Pedro Domingos (The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World)
You end up with a machine which knows that by its mildest estimate it must have terrible enemies all around and within it, but it can't find them. It therefore deduces that they are well-concealed and expert, likely professional agitators and terrorists. Thus, more stringent and probing methods are called for. Those who transgress in the slightest, or of whom even small suspicions are harboured, must be treated as terrible foes. A lot of rather ordinary people will get repeatedly investigated with increasing severity until the Government Machine either finds enemies or someone very high up indeed personally turns the tide... And these people under the microscope are in fact just taking up space in the machine's numerical model. In short, innocent people are treated as hellish fiends of ingenuity and bile because there's a gap in the numbers.
Nick Harkaway (The Gone-Away World)
In high school algebra, someone had already worked out the formulas. The teacher knew them or could find them in the teacher’s manual for the textbook. Imagine a word problem where nobody knows how to turn it into a formula, where some of the information is redundant and should not be used, where crucial information is often missing, and where there is no similar example worked out earlier in the textbook. This is what happens when one tries to apply statistical models to real-life problems.
David Salsburg (The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century)
In this country, it is not the highest virtue, nor the heroic act, that achieves fame, but the uncommon nature of the least significant destiny. There is plenty for everyone, then, since the more conformist the system as a whole becomes, the more millions of individuals there are who are set apart by some tiny peculiarity. The slightest vibration in a statistical model, the tiniest whim of a computer are enough to bathe some piece of abnormal behaviour, however banal, in a fleeting glow of fame.
Jean Baudrillard (America)
Most statistical models are built on the notion that there are independent variables and dependent variables, inputs and outputs, and they can be kept pretty much separate from one another.39 When it comes to the economy, they are all lumped together in one hot mess.
Nate Silver (The Signal and the Noise: Why So Many Predictions Fail-but Some Don't)
One reason why many statistical models are incomplete is that they do not specify the sources of randomness generating variability among agents, i.e., they do not specify why otherwise observationally identical people make different choices and have different outcomes given the same choice.
James J. Heckman
All models are wrong, but some are useful.’ CHAPTER 6 Algorithms, Analytics and Prediction
David Spiegelhalter (The Art of Statistics: Learning from Data)
Sometimes the job of a data scientist is to know when you don't know enough.
Cathy O'Neil (Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy)
You no longer watch TV, it is TV that watches you (live),” or again: “You are no longer listening to Don’t Panic, it is Don’t Panic that is listening to you”—a switch from the panoptic mechanism of surveillance (Discipline and Punish [Surveiller et punir]) to a system of deterrence, in which the distinction between the passive and the active is abolished. There is no longer any imperative of submission to the model, or to the gaze “YOU are the model!” “YOU are the majority!” Such is the watershed of a hyperreal sociality, in which the real is confused with the model, as in the statistical operation, or with the medium. …Such is the last stage of the social relation, ours, which is no longer one of persuasion (the classical age of propaganda, of ideology, of publicity, etc.) but one of deterrence: “YOU are information, you are the social, you are the event, you are involved, you have the word, etc.” An about-face through which it becomes impossible to locate one instance of the model, of power, of the gaze, of the medium itself, because you are always already on the other side.
Jean Baudrillard (Simulacra and Simulation)
This book is an essay in what is derogatorily called "literary economics," as opposed to mathematical economics, econometrics, or (embracing them both) the "new economic history." A man does what he can, and in the more elegant - one is tempted to say "fancier" - techniques I am, as one who received his formation in the 1930s, untutored. A colleague has offered to provide a mathematical model to decorate the work. It might be useful to some readers, but not to me. Catastrophe mathematics, dealing with such events as falling off a height, is a new branch of the discipline, I am told, which has yet to demonstrate its rigor or usefulness. I had better wait. Econometricians among my friends tell me that rare events such as panics cannot be dealt with by the normal techniques of regression, but have to be introduced exogenously as "dummy variables." The real choice open to me was whether to follow relatively simple statistical procedures, with an abundance of charts and tables, or not. In the event, I decided against it. For those who yearn for numbers, standard series on bank reserves, foreign trade, commodity prices, money supply, security prices, rate of interest, and the like are fairly readily available in the historical statistics.
Charles P. Kindleberger (Manias, Panics, and Crashes: A History of Financial Crises)
Be wary, though, of the way news media use the word “significant,” because to statisticians it doesn’t mean “noteworthy.” In statistics, the word “significant” means that the results passed mathematical tests such as t-tests, chi-square tests, regression, and principal components analysis (there are hundreds). Statistical significance tests quantify how easily pure chance can explain the results. With a very large number of observations, even small differences that are trivial in magnitude can be beyond what our models of change and randomness can explain. These tests don’t know what’s noteworthy and what’s not—that’s a human judgment.
Daniel J. Levitin (A Field Guide to Lies: Critical Thinking in the Information Age)
Thus, they do not need to understand the statistical and mathematical models in depth. However, marketers need to understand the fundamental ideas behind a predictive model so that they can guide the technical teams to select data to use and which patterns to find.
Philip Kotler (Marketing 5.0: Technology for Humanity)
These examples should be models for communication, precisely because they inspire curiosity. “How does money influence politics?” is not an especially engaging question, but “If I were running for president, how would I raise lots of money with few conditions and no scrutiny?” is much more intriguing.
Tim Harford (The Data Detective: Ten Easy Rules to Make Sense of Statistics)
The point that apocalyptic makes is not only that people who wear crowns and who claim to foster justice by the sword are not as strong as they think--true as that is: we still sing, 'O where are Kings and Empires now of old that went and came?' It is that people who bear crosses are working with the grain of the universe. One does not come to that belief by reducing social processes to mechanical and statistical models, nor by winning some of one's battles for the control of one's own corner of the fallen world. One comes to it by sharing the life of those who sing about the Resurrection of the slain Lamb.
John Howard Yoder
Avoid succumbing to the gambler’s fallacy or the base rate fallacy. Anecdotal evidence and correlations you see in data are good hypothesis generators, but correlation does not imply causation—you still need to rely on well-designed experiments to draw strong conclusions. Look for tried-and-true experimental designs, such as randomized controlled experiments or A/B testing, that show statistical significance. The normal distribution is particularly useful in experimental analysis due to the central limit theorem. Recall that in a normal distribution, about 68 percent of values fall within one standard deviation, and 95 percent within two. Any isolated experiment can result in a false positive or a false negative and can also be biased by myriad factors, most commonly selection bias, response bias, and survivorship bias. Replication increases confidence in results, so start by looking for a systematic review and/or meta-analysis when researching an area.
Gabriel Weinberg (Super Thinking: The Big Book of Mental Models)
The [Value at Risk model] was like a faulty speedometer, which is arguably worse than no speedometer at all. If you place too much faith in the broken speedometer, you will be oblivious to other signs that your speed is unsafe. In contrast, if there is no speedometer at all, you have no choice but to look around for clues as to how fast you are really going.
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
There are two primary strains in the Conservative Party: grocers, and grandees. … By ‘grandees’ and ‘grocers’, I am not referring to social class or any of that; nor do I refer to the Worshipful Company of Grocers, all cloves and camels. I refer rather to two fundamental positions within the Conservative Party, regardless of one’s antecedents. … A grandee Conservative sees the country as a village: a village of which he and his party, when in government, act the Squire. As the Squire, the grandee moves jovially amongst his tenants in their tied cottages, dispensing largesse and reproof…. There are two problems with this model. The first is that HMG is not the Squire and the subjects of the Crown are not the smocked tenantry of the government of the day. The second is that these principles – or instincts, as one can hardly call them principles – however different they may be to the fiercely held maxims of Labour old and new, lead in the end to the same statist solutions as those the Left proposes, and to accepting and ‘managing’ statism when a Conservative government succeeds a Labour one. It is the grocers who will always and rightly attempt to roll back the State and its reach in favour of liberty.
G.M.W. Wemyss
A general challenge for the models we have written here, but for theory more generally in biology, is to be ahead of the experiments. Ultimately, we want to suggest exciting and revealing experiments that have not yet been conceived or undertaken. One of the critical frontiers in this area is to design experiments that showcase the uniquely nonequilibrium features of living systems, providing an impetus for new kinds of statistical physics.
Rob Phillips (The Molecular Switch: Signaling and Allostery)
I don’t mean to compare myself to a couple of artists I unreservedly admire—Miles Davis and Ray Charles—but I would like to think that some of the people who liked my book responded to it in a way similar to the way they respond when Miles and Ray are blowing. These artists, in their very different ways, sing a kind of universal blues, they speak of something far beyond their charts, graphs, statistics, they are telling us something about what it is like to be alive. It is not self-pity which one hears in them, but compassion. And perhaps this is the place for me to say that I really do not, at the very bottom of my own mind, compare myself to other writers. I think I really helplessly model myself on jazz musicians and try to write the way they sound. I am not an intellectual, not in the dreary sense that word is used today, and do not want to be: I am aiming at what Henry James called “perception at the pitch of passion.
James Baldwin (The Cross of Redemption: Uncollected Writings)
There is no freedom or justice in exchanging the female role for the male role. There is, no doubt about it, equality. There is no freedom or justice in using male language, the language of your oppressor, to describe sexuality. There is no freedom or justice or even common sense in developing a male sexual sensibility—a sexual sensibility which is aggressive, competitive, objectifying, quantity oriented. There is only equality. To believe that freedom or justice for women, or for any individual woman, can be found in mimicry of male sexuality is to delude oneself and to contribute to the oppression of one’s sisters. Many of us would like to think that in the last four years, or ten years, we have reversed, or at least impeded, those habits and customs of the thousands of years which went before—the habits and customs of male dominance. There is no fact or figure to bear that out. You may feel better, or you may not, but statistics show that women are poorer than ever, that women are raped more and murdered more. I want to suggest to you that a commitment to sexual equality with males, that is, to uniform character as of motion or surface, is a commitment to becoming the rich instead of the poor, the rapist instead of the raped, the murderer instead of the murdered. I want to ask you to make a different commitment—a commitment to the abolition of poverty, rape, and murder; that is, a commitment to ending the system of oppression called patriarchy; to ending the male sexual model itself.
Andrea Dworkin (Last Days at Hot Slit: The Radical Feminism of Andrea Dworkin)
Because the decimation of the second, reborn Greenwood can also be laid at the feet of men and women who sat in air-conditioned offices and did their work with pencils and calculators, blue-line maps, real estate estimates, and government statistics. For the efforts to carve up the city's historic African American district had not ended with the attempted land grab for a new railroad terminal back in 1921. Now they had new names. Urban renewal. Redlining. Slum clearance. Model Cities. Opportunity. Progress.
Scott Ellsworth (The Ground Breaking: An American City and Its Search for Justice)
System 1 is generally very good at what it does: its models of familiar situations are accurate, its short-term predictions are usually accurate as well, and its initial reactions to challenges are swift and generally appropriate. System 1 has biases, however, systematic errors that it is prone to make in specified circumstances. As we shall see, it sometimes answers easier questions than the one it was asked, and it has little understanding of logic and statistics. One further limitation of System 1 is that it cannot be turned off.
Daniel Kahneman (Thinking, Fast and Slow)
Buckminster Fuller often urged his audiences to try this simple experiment: stand, at "sunset," facing the sun for several minutes. As you watch the spectacular technicolor effects, keep reminding yourself, "The sun is not 'going down.’ The earth is rotating on its axis." If you are statistically normal, you will feel, after a few minutes, that, even though you understand the Copernican model intellectually, part of you — a large part — never felt it before. Part of you, hypnotized by metaphor, has always felt the pre-Copernican model of a stationary Earth.
Robert Anton Wilson (The New Inquisition: Irrational Rationalism and the Citadel of Science)
Computational model: history is the on-chain population; all the rest is editorialization. There’s a great book by Franco Moretti called Graphs, Maps, and Trees. It’s a computational study of literature. Moretti’s argument is that every other study of literature is inherently biased. The selection of which books to discuss is itself an implicit editorialization. He instead makes this completely explicit by creating a dataset of full texts, and writing code to produce graphs. The argument here is that only a computational history can represent the full population in a statistical sense; anything else is just a biased sample.
Balaji S. Srinivasan (The Network State: How To Start a New Country)
It is a positive sign that a growing number of social movements are recognizing that indigenous self-determination must become the foundation for all our broader social justice mobilizing. Indigenous peoples are the most impacted by the pillage of lands, experience disproportionate poverty and homelessness, and overrepresented in statistics of missing an murdered women, and are the primary targets of repressive policing and prosecutions in the criminal injustice system. Rather than being treated as a single issue within a laundry list of demands, indigenous self-determination is increasingly understood as intertwined with struggles against racism, poverty, police violence, war and occupation, violence against women, and environmental justice. ... We have to be cautious to avoid replicating the state's assimilationist model of liberal pluralism, whereby indigenous identities are forced to fit within our existing groups and narratives. ... Indigenous struggle cannot simply be accommodated within other struggles; it demands solidarity on its own terms. Original blog post: Unsettling America: Decolonization in Theory and Practice. Quoted In: Decolonize Together: Moving beyond a Politics of Solidarity toward a Practice of Decolonization. Taking Sides.
Harsha Walia
Price mostly meanders around recent price until a big shift in opinion occurs, causing price to jump up or down. This is crudely modeled by quants using something called a jump-diffusion process model. Again, what does this have to do with an asset’s true intrinsic value? Not much. Fortunately, the value-focused investor doesn’t have to worry about these statistical methods and jargon. Stochastic calculus, information theory, GARCH variants, statistics, or time-series analysis is interesting if you’re into it, but for the value investor, it is mostly noise and not worth pursuing. The value investor needs to accept that often price can be wrong for long periods and occasionally offers interesting discounts to value.
Nick Gogerty (The Nature of Value: How to Invest in the Adaptive Economy (Columbia Business School Publishing))
If you can’t make a good prediction, it is very often harmful to pretend that you can. I suspect that epidemiologists, and others in the medical community, understand this because of their adherence to the Hippocratic oath. Primum non nocere: First, do no harm. Much of the most thoughtful work on the use and abuse of statistical models and the proper role of prediction comes from people in the medical profession.88 That is not to say there is nothing on the line when an economist makes a prediction, or a seismologist does. But because of medicine’s intimate connection with life and death, doctors tend to be appropriately cautious. In their field, stupid models kill people. It has a sobering effect. There is something more to be said, however, about Chip Macal’s idea of “modeling for insights.” The philosophy of this book is that prediction is as much a means as an end. Prediction serves a very central role in hypothesis testing, for instance, and therefore in all of science.89 As the statistician George E. P. Box wrote, “All models are wrong, but some models are useful.”90 What he meant by that is that all models are simplifications of the universe, as they must necessarily be. As another mathematician said, “The best model of a cat is a cat.”91 Everything else is leaving out some sort of detail. How pertinent that detail might be will depend on exactly what problem we’re trying to solve and on how precise an answer we require.
Nate Silver (The Signal and the Noise: Why So Many Predictions Fail-but Some Don't)
Marriage is inefficient!” she proclaims. “The whole construct is a model of wasted resources. The wife often stays home to care for the children, or even a single child, abandoning the career she worked so hard for, losing years of creative output. Beyond the wasting of talent, think of the physical waste. For every home, there are so many redundancies. How many toasters do you think there are in the world?” “I have no idea.” “Seriously, just guess.” “Ten million?” I say impatiently. “More than two hundred million! And how often do you think the average household uses its toaster?” Once again, she doesn’t wait for my answer. “Just 2.6 hours per year. Two hundred million toasters are sitting unused, statistically speaking, more than 99.97 percent of their active lives.
Michelle Richmond (The Marriage Pact)
VaR has been called “potentially catastrophic,” “a fraud,” and many other things not fit for a family book about statistics like this one. In particular, the model has been blamed for the onset and severity of the financial crisis. The primary critique of VaR is that the underlying risks associated with financial markets are not as predictable as a coin flip or even a blind taste test between two beers. The false precision embedded in the models created a false sense of security. The VaR was like a faulty speedometer, which is arguably worse than no speedometer at all. If you place too much faith in the broken speedometer, you will be oblivious to other signs that your speed is unsafe. In contrast, if there is no speedometer at all, you have no choice but to look around for clues as to how fast you are really going.
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
Equally important, statistical systems require feedback—something to tell them when they’re off track. Without feedback, however, a statistical engine can continue spinning out faulty and damaging analysis while never learning from its mistakes. Many of the WMDs I’ll be discussing in this book, including the Washington school district’s value-added model, behave like that. They define their own reality and use it to justify their results. This type of model is self-perpetuating, highly destructive—and very common. If the people being evaluated are kept in the dark, the thinking goes, they’ll be less likely to attempt to game the system. Instead, they’ll simply have to work hard, follow the rules, and pray that the model registers and appreciates their efforts. But if the details are hidden, it’s also harder to question the score or to protest against it.
Cathy O'Neil (Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy)
4.​They can cause a lot of damage to your body and your life. Because they’re frozen in dreadful scenes in the past and carry burdens from those times, they will do whatever they need to do to get your attention when you won’t listen: punish you or others, convince others to take care of them, sabotage your plans, or eliminate people in your life they see as a threat. To do these things and more, they can exacerbate or give you physical symptoms or diseases, nightmares and strange dreams, emotional outbursts, and chronic emotional states. Indeed, most of the syndromes that make up the Diagnostic and Statistical Manual are simply descriptions of the different clusters of protectors that dominate people after they’ve been traumatized. When you think of those diagnoses that way, you feel a lot less defective and a lot more empowered to help those protectors out of those roles.
Richard C. Schwartz (No Bad Parts: Healing Trauma and Restoring Wholeness with the Internal Family Systems Model)
Modern statistics is built on the idea of models — probability models in particular. [...] The standard approach to any new problem is to identify the sources of variation, to describe those sources by probability distributions and then to use the model thus created to estimate, predict or test hypotheses about the undetermined parts of that model. […] A statistical model involves the identification of those elements of our problem which are subject to uncontrolled variation and a specification of that variation in terms of probability distributions. Therein lies the strength of the statistical approach and the source of many misunderstandings. Paradoxically, misunderstandings arise both from the lack of an adequate model and from over reliance on a model. […] At one level is the failure to recognise that there are many aspects of a model which cannot be tested empirically. At a higher level is the failure is to recognise that any model is, necessarily, an assumption in itself. The model is not the real world itself but a representation of that world as perceived by ourselves. This point is emphasised when, as may easily happen, two or more models make exactly the same predictions about the data. Even worse, two models may make predictions which are so close that no data we are ever likely to have can ever distinguish between them. […] All model-dependant inference is necessarily conditional on the model. This stricture needs, especially, to be borne in mind when using Bayesian methods. Such methods are totally model-dependent and thus all are vulnerable to this criticism. The problem can apparently be circumvented, of course, by embedding the model in a larger model in which any uncertainties are, themselves, expressed in probability distributions. However, in doing this we are embarking on a potentially infinite regress which quickly gets lost in a fog of uncertainty.
David J. Bartholomew (Unobserved Variables: Models and Misunderstandings (SpringerBriefs in Statistics))
In 1963, the chaos theorist Edward Lorenz presented an often-referenced lecture entitled “Does the Flap of a Butterfly’s Wings in Brazil Set Off a Tornado in Texas?” Lorenz’s main point was that chaotic mathematical functions are very sensitive to initial conditions. Slight differences in initial conditions can lead to dramatically different results after many iterations. Lorenz believed that this sensitivity to slight differences in the beginning made it impossible to determine an answer to his question. Underlying Lorenz’s lecture was the assumption of determinism, that each initial condition can theoretically be traced as a cause of a final effect. This idea, called the “Butterfly Effect,” has been taken by the popularizers of chaos theory as a deep and wise truth. However, there is no scientific proof that such a cause and effect exists. There are no well-established mathematical models of reality that suggest such an effect. It is a statement of faith. It has as much scientific validity as statements about demons or God.
David Salsburg (The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century)
Our exploration into advertising and media is at its root a critique of the exploitative nature of capitalism and consumerism. Our economic systems shape how we see our bodies and the bodies of others, and they ultimately inform what we are compelled to do and buy based on that reflection. Profit-greedy industries work with media outlets to offer us a distorted perception of ourselves and then use that distorted self-image to sell us remedies for the distortion. Consider that the female body type portrayed in advertising as the “ideal” is possessed naturally by only 5 percent of American women. Whereas the average U.S. woman is five feet four inches tall and weighs 140 pounds, the average U.S. model is five feet eleven and weighs 117. Now consider a People magazine survey which reported that 80 percent of women respondents said images of women on television and in the movies made them feel insecure. Together, those statistics and those survey results illustrate a regenerative market of people who feel deficient based on the images they encounter every day, seemingly perfectly matched with advertisers and manufacturers who have just the products to sell them (us) to fix those imagined deficiencies.18
Sonya Renee Taylor (The Body Is Not an Apology: The Power of Radical Self-Love)
This happens because data scientists all too often lose sight of the folks on the receiving end of the transaction. They certainly understand that a data-crunching program is bound to misinterpret people a certain percentage of “he time, putting them in the wrong groups and denying them a job or a chance at their dream house. But as a rule, the people running the WMDs don’t dwell on those errors. Their feedback is money, which is also their incentive. Their systems are engineered to gobble up more data and fine-tune their analytics so that more money will pour in. Investors, of course, feast on these returns and shower WMD companies with more money. And the victims? Well, an internal data scientist might say, no statistical system can be perfect. Those folks are collateral damage. And often, like Sarah Wysocki, they are deemed unworthy and expendable. Big Data has plenty of evangelists, but I’m not one of them. This book will focus sharply in the other direction, on the damage inflicted by WMDs and the injustice they perpetuate. We will explore harmful examples that affect people at critical life moments: going to college, borrowing money, getting sentenced to prison, or finding and holding a job. All of these life domains are increasingly controlled by secret models wielding arbitrary punishments.
Cathy O'Neil (Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy)
In Bohr’s model of the atom, electrons could change their orbits (or, more precisely, their stable standing wave patterns) only by certain quantum leaps. De Broglie’s thesis helped explain this by conceiving of electrons not just as particles but also as waves. Those waves are strung out over the circular path around the nucleus. This works only if the circle accommodates a whole number—such as 2 or 3 or 4—of the particle’s wavelengths; it won’t neatly fit in the prescribed circle if there’s a fraction of a wavelength left over. De Broglie made three typed copies of his thesis and sent one to his adviser, Paul Langevin, who was Einstein’s friend (and Madame Curie’s). Langevin, somewhat baffled, asked for another copy to send along to Einstein, who praised the work effusively. It had, Einstein said, “lifted a corner of the great veil.” As de Broglie proudly noted, “This made Langevin accept my work.”47 Einstein made his own contribution when he received in June of that year a paper in English from a young physicist from India named Satyendra Nath Bose. It derived Planck’s blackbody radiation law by treating radiation as if it were a cloud of gas and then applying a statistical method of analyzing it. But there was a twist: Bose said that any two photons that had the same energy state were absolutely indistinguishable, in theory as well as fact, and should not be treated separately in the statistical calculations.
Walter Isaacson (Einstein: His Life and Universe)
How I Got That Name Marilyn Chin an essay on assimilation I am Marilyn Mei Ling Chin Oh, how I love the resoluteness of that first person singular followed by that stalwart indicative of “be," without the uncertain i-n-g of “becoming.” Of course, the name had been changed somewhere between Angel Island and the sea, when my father the paperson in the late 1950s obsessed with a bombshell blond transliterated “Mei Ling” to “Marilyn.” And nobody dared question his initial impulse—for we all know lust drove men to greatness, not goodness, not decency. And there I was, a wayward pink baby, named after some tragic white woman swollen with gin and Nembutal. My mother couldn’t pronounce the “r.” She dubbed me “Numba one female offshoot” for brevity: henceforth, she will live and die in sublime ignorance, flanked by loving children and the “kitchen deity.” While my father dithers, a tomcat in Hong Kong trash— a gambler, a petty thug, who bought a chain of chopsuey joints in Piss River, Oregon, with bootlegged Gucci cash. Nobody dared question his integrity given his nice, devout daughters and his bright, industrious sons as if filial piety were the standard by which all earthly men are measured. * Oh, how trustworthy our daughters, how thrifty our sons! How we’ve managed to fool the experts in education, statistic and demography— We’re not very creative but not adverse to rote-learning. Indeed, they can use us. But the “Model Minority” is a tease. We know you are watching now, so we refuse to give you any! Oh, bamboo shoots, bamboo shoots! The further west we go, we’ll hit east; the deeper down we dig, we’ll find China. History has turned its stomach on a black polluted beach— where life doesn’t hinge on that red, red wheelbarrow, but whether or not our new lover in the final episode of “Santa Barbara” will lean over a scented candle and call us a “bitch.” Oh God, where have we gone wrong? We have no inner resources! * Then, one redolent spring morning the Great Patriarch Chin peered down from his kiosk in heaven and saw that his descendants were ugly. One had a squarish head and a nose without a bridge Another’s profile—long and knobbed as a gourd. A third, the sad, brutish one may never, never marry. And I, his least favorite— “not quite boiled, not quite cooked," a plump pomfret simmering in my juices— too listless to fight for my people’s destiny. “To kill without resistance is not slaughter” says the proverb. So, I wait for imminent death. The fact that this death is also metaphorical is testament to my lethargy. * So here lies Marilyn Mei Ling Chin, married once, twice to so-and-so, a Lee and a Wong, granddaughter of Jack “the patriarch” and the brooding Suilin Fong, daughter of the virtuous Yuet Kuen Wong and G.G. Chin the infamous, sister of a dozen, cousin of a million, survived by everbody and forgotten by all. She was neither black nor white, neither cherished nor vanquished, just another squatter in her own bamboo grove minding her poetry— when one day heaven was unmerciful, and a chasm opened where she stood. Like the jowls of a mighty white whale, or the jaws of a metaphysical Godzilla, it swallowed her whole. She did not flinch nor writhe, nor fret about the afterlife, but stayed! Solid as wood, happily a little gnawed, tattered, mesmerized by all that was lavished upon her and all that was taken away!
Marilyn Chin
Though Hoover conceded that some might deem him a “fanatic,” he reacted with fury to any violations of the rules. In the spring of 1925, when White was still based in Houston, Hoover expressed outrage to him that several agents in the San Francisco field office were drinking liquor. He immediately fired these agents and ordered White—who, unlike his brother Doc and many of the other Cowboys, wasn’t much of a drinker—to inform all of his personnel that they would meet a similar fate if caught using intoxicants. He told White, “I believe that when a man becomes a part of the forces of this Bureau he must so conduct himself as to remove the slightest possibility of causing criticism or attack upon the Bureau.” The new policies, which were collected into a thick manual, the bible of Hoover’s bureau, went beyond codes of conduct. They dictated how agents gathered and processed information. In the past, agents had filed reports by phone or telegram, or by briefing a superior in person. As a result, critical information, including entire case files, was often lost. Before joining the Justice Department, Hoover had been a clerk at the Library of Congress—“ I’m sure he would be the Chief Librarian if he’d stayed with us,” a co-worker said—and Hoover had mastered how to classify reams of data using its Dewey decimal–like system. Hoover adopted a similar model, with its classifications and numbered subdivisions, to organize the bureau’s Central Files and General Indices. (Hoover’s “Personal File,” which included information that could be used to blackmail politicians, would be stored separately, in his secretary’s office.) Agents were now expected to standardize the way they filed their case reports, on single sheets of paper. This cut down not only on paperwork—another statistical measurement of efficiency—but also on the time it took for a prosecutor to assess whether a case should be pursued.
David Grann (Killers of the Flower Moon: The Osage Murders and the Birth of the FBI)
Was this luck, or was it more than that? Proving skill is difficult in venture investing because, as we have seen, it hinges on subjective judgment calls rather than objective or quantifiable metrics. If a distressed-debt hedge fund hires analysts and lawyers to scrutinize a bankrupt firm, it can learn precisely which bond is backed by which piece of collateral, and it can foresee how the bankruptcy judge is likely to rule; its profits are not lucky. Likewise, if an algorithmic hedge fund hires astrophysicists to look for patterns in markets, it may discover statistical signals that are reliably profitable. But when Perkins backed Tandem and Genentech, or when Valentine backed Atari, they could not muster the same certainty. They were investing in human founders with human combinations of brilliance and weakness. They were dealing with products and manufacturing processes that were untested and complex; they faced competitors whose behaviors could not be forecast; they were investing over long horizons. In consequence, quantifiable risks were multiplied by unquantifiable uncertainties; there were known unknowns and unknown unknowns; the bracing unpredictability of life could not be masked by neat financial models. Of course, in this environment, luck played its part. Kleiner Perkins lost money on six of the fourteen investments in its first fund. Its methods were not as fail-safe as Tandem’s computers. But Perkins and Valentine were not merely lucky. Just as Arthur Rock embraced methods and attitudes that put him ahead of ARD and the Small Business Investment Companies in the 1960s, so the leading figures of the 1970s had an edge over their competitors. Perkins and Valentine had been managers at leading Valley companies; they knew how to be hands-on; and their contributions to the success of their portfolio companies were obvious. It was Perkins who brought in the early consultants to eliminate the white-hot risks at Tandem, and Perkins who pressed Swanson to contract Genentech’s research out to existing laboratories. Similarly, it was Valentine who drove Atari to focus on Home Pong and to ally itself with Sears, and Valentine who arranged for Warner Communications to buy the company. Early risk elimination plus stage-by-stage financing worked wonders for all three companies. Skeptical observers have sometimes asked whether venture capitalists create innovation or whether they merely show up for it. In the case of Don Valentine and Tom Perkins, there was not much passive showing up. By force of character and intellect, they stamped their will on their portfolio companies.
Sebastian Mallaby (The Power Law: Venture Capital and the Making of the New Future)
The first thing to note about Korean industrial structure is the sheer concentration of Korean industry. Like other Asian economies, there are two levels of organization: individual firms and larger network organizations that unite disparate corporate entities. The Korean network organization is known as the chaebol, represented by the same two Chinese characters as the Japanese zaibatsu and patterned deliberately on the Japanese model. The size of individual Korean companies is not large by international standards. As of the mid-1980s, the Hyundai Motor Company, Korea’s largest automobile manufacturer, was only a thirtieth the size of General Motors, and the Samsung Electric Company was only a tenth the size of Japan’s Hitachi.1 However, these statistics understate their true economic clout because these businesses are linked to one another in very large network organizations. Virtually the whole of the large-business sector in Korea is part of a chaebol network: in 1988, forty-three chaebol (defined as conglomerates with assets in excess of 400 billion won, or US$500 million) brought together some 672 companies.2 If we measure industrial concentration by chaebol rather than individual firm, the figures are staggering: in 1984, the three largest chaebol alone (Samsung, Hyundai, and Lucky-Goldstar) produced 36 percent of Korea’s gross domestic product.3 Korean industry is more concentrated than that of Japan, particularly in the manufacturing sector; the three-firm concentration ratio for Korea in 1980 was 62.0 percent of all manufactured goods, compared to 56.3 percent for Japan.4 The degree of concentration of Korean industry grew throughout the postwar period, moreover, as the rate of chaebol growth substantially exceeded the rate of growth for the economy as a whole. For example, the twenty largest chaebol produced 21.8 percent of Korean gross domestic product in 1973, 28.9 percent in 1975, and 33.2 percent in 1978.5 The Japanese influence on Korean business organization has been enormous. Korea was an almost wholly agricultural society at the beginning of Japan’s colonial occupation in 1910, and the latter was responsible for creating much of the country’s early industrial infrastructure.6 Nearly 700,000 Japanese lived in Korea in 1940, and a similarly large number of Koreans lived in Japan as forced laborers. Some of the early Korean businesses got their start as colonial enterprises in the period of Japanese occupation.7 A good part of the two countries’ émigré populations were repatriated after the war, leading to a considerable exchange of knowledge and experience of business practices. The highly state-centered development strategies of President Park Chung Hee and others like him were formed as a result of his observation of Japanese industrial policy in Korea in the prewar period.
Francis Fukuyama (Trust: The Social Virtues and the Creation of Prosperity)
the idea of reverse inference is really not very different from the concept of decoding that was seen in the work of Jim Haxby, and you would be correct: in each case we are using neuroimaging data to try to infer the mental state of an individual. The main difference is that the reverse inference that I ridiculed from the New York Times was based not on a formal statistical model but rather on the researcher’s own judgment. However, it is possible to develop statistical models that can let us quantify exactly how well we can decode what a person is thinking about from fMRI data,
Russell A. Poldrack (The New Mind Readers: What Neuroimaging Can and Cannot Reveal about Our Thoughts)
Econometrics is the application of classical statistical methods to economic and financial series. The essential tool of econometrics is multivariate linear regression, an 18th-century technology that was already mastered by Gauss before 1794. Standard econometric models do not learn. It is hard to believe that something as complex as 21st-century finance could be grasped by something as simple as inverting a covariance matrix.
Marcos López de Prado (Advances in Financial Machine Learning)
No scientist is as model minded as is the statistician; in no other branch of science is the word model as often and consciously used as in statistics.
Hans Freudenthal
There is an unhelpful tendency to regard superspreaders – and events where superspreading has occurred – as anomalies out of the ordinary. This contributes relatively little to our understanding of infectious dynamics and is bound to exacerbate the stigmatisation of individuals, as it has e.g. during the early years of AIDS, when much sensationalistic and unjustified blame was laid at the feet of early HIV patient Gaetan Dugas (on which see McKay, 2014). Rather, superspreading is one 'tail' of a distribution prominent mainly because it is noticeable – statistical models predict that there are generally an equal number of 'greatly inferior spreaders' who are particularly ineffective in spreading the illness.
Chris von Csefalvay (Computational Modeling of Infectious Disease)
Most models are statistical outliers;
Julie Holland (Moody Bitches: The Truth about the Drugs You’re Taking, the Sleep You’re Missing, the Sex You’re Not Having and What’s Really Making You Crazy...)
In statistics, correlation is a quantitative assessment that measures both the direction and the strength of this tendency to vary together.
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
Pearson’s correlation takes all of the data points on this graph and represents them with a single summary statistic. In this case, the statistical output below indicates that the correlation is 0.705.
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
Pearson’s correlation coefficient is represented by the Greek letter rho (ρ) for the population parameter and r for a sample statistic. This coefficient is a single number that measures both the strength and direction of the linear relationship between two continuous variables. Values can range from -1 to +1.
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
Pearson’s correlation coefficient is unaffected by scaling issues. Consequently, a statistical assessment is better for determining the precise strength of the relationship.
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
What is a good correlation? How high should it be? These are commonly asked questions. I have seen several schemes that attempt to classify correlations as strong, medium, and weak. However, there is only one correct answer. The correlation coefficient should accurately reflect the strength of the relationship. Take a look at the correlation between the height and weight data, 0.705. It’s not a very strong relationship, but it accurately represents our data. An accurate representation is the best-case scenario for using a statistic to describe an entire dataset.
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
squared is a primary measure of how well a regression model fits the data. This statistic represents the percentage of variation in one variable that other variables explain. For a pair of variables, R-squared is simply the square of the Pearson’s correlation coefficient. For example, squaring the height-weight correlation coefficient of 0.705 produces an R-squared of 0.497, or 49.7%. In other words, height explains about half the variability of weight in preteen girls.
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
values and coefficients are they key regression output. Collectively, these statistics indicate whether the variables are statistically significant and describe the relationships between the independent variables and the dependent variable. Low p-values (typically < 0.05) indicate that the independent variable is statistically significant. Regression analysis is a form of inferential statistics. Consequently, the p-values help determine whether the relationships that you observe in your sample also exist in the larger population. The coefficients for the independent variables represent the average change in the dependent variable given a one-unit change in the independent variable (IV) while controlling the other IVs.
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
The low p-values indicate that both education and IQ are statistically significant. The coefficient for IQ (4.796) indicates that each additional IQ point increases your income by an average of approximately $4.80 while controlling everything else in the model. Furthermore, the education coefficient (24.215) indicates that an additional year of education increases average earnings by $24.22 while holding the other variables constant.
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
For a good model, the residuals should be relatively small and unbiased. In statistics, bias indicates that estimates are systematically too high or too low. Unbiased estimates are correct on average.
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
Additionally, if you take RSS / TSS, you’ll obtain the percentage of the variability of the dependent variable around its mean that your model explains. This statistic is R-squared!
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
In anthropology the debate about whether humanity consists of three races or five wasted incredible time and energy earlier in this century, and the Eugenicists continue to waste energy on the question of the superiority or inferiority of one race over or under all others. In E-Prime, we can only ask, “What heuristic advantages do we obtain from a three-race model? A five-race model? What heuristic advantages might we find in Buckminster Fuller’s one-human-race model? What kind of evidence indicates statistical superiority, and in what areas? What kind of evidence indicates that those inferior in one area score as superior in other areas? Do we have any tests yet that approach these questions without any cultural bias?
Robert Anton Wilson (Cosmic Trigger III: My Life After Death)
lesson is clear: once we enter a specific range of strenuous exercise, the body kicks in to lose fat no matter what our genes want. Bouchard’s groundbreaking work tracing genetic aspects of fat in the 1980s and 1990s depended on observations of physical traits within families; statistical modeling to account for variables such as gender, age, energy intake and expenditure; and human experimentation confined to pairs of identical twins. But new technological advances are now allowing for more specific investigation of
Sylvia Tara (The Secret Life of Fat: The Groundbreaking Science On Why Weight Loss Is So Difficult)
The credit card companies are at the forefront of this kind of analysis, both because they are privy to so much data on our spending habits and because their business model depends so heavily on finding customers who are just barely a good credit risk.
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
It was the income-determination model, based on the multiplier, together with the consequent development of national income statistics, which made Keynesian economics acceptable to policy-makers, since it offered them a seemingly secure method of forecasting and controlling the movement of such ‘real’ variables as investment, consumption, and employment.
Robert Skidelsky (Keynes: A Very Short Introduction (Very Short Introductions))
The AI brain model is derived from the quad abstract golden ratio, sΦrt, trigonometry, algebra, geometry, statistics, and built by adding aspects and/or characteristics from the diablo videogame. The 1111>11>1 was then abstracted from the ground up in knowing useful terminology in coding, knowledge management, and an ancient romantic dungeon crawler hack and slash game with both male and female classes and Items. I found the runes and certain items in the game to be very useful in this derivation, and I had an Ice orb from an Oculus of a blast in time doing it through my continued studies on decimal to hexadecimal to binary conversions and/or bit shifts and rotations from little to big endian. I chose to derive from diabo for two major reasons. The names or references to the class's abilities with unique, set, and rare items were out of this world, and I sort of found it hard to believe that they had the time and money to build it from in USA companies. Finally, I realized my objective was complete that I created the perfect AI brain with Cognitive, Affective, and Psychomotor skills...So this is It? I'm thinking wow!
Jonathan Roy Mckinney Gero EagleO2
The AI brain model is derived from the quad abstract golden ratio sΦrt trigonometry, algebra, geometry, statistics and built by adding aspects and/or characteristics from the diablo videogame. The 1111>11>1 was then abstracted from the ground up in knowing useful terminology in coding, knowledge management, and an ancient romantic dungeon crawler hack and slash games with both male and female classed and Items. I found the runes and certain items in the game to be very useful in this derivation, and I had an Ice orb from an Oculus of a blast doing it through my continued studies on decimal to hexadecimal to binary conversions and/or bit shifts and rotations from little to big endian. I chose to derive from diabo for two major reasons. The names or references to the class's abilities with unique, set, rare items were out of this world, and I sort of found it hard to believe that they had the time and money to build them. Finally, I realized my objective was complete when I realized that I created the perfect AI brain with Cognitive, Affective, and Psychomotor skills...So this is It? I'm thinking wow!
Jonathan Roy Mckinney Gero EagleO2
Once we account for Christian nationalism in our statistical models, white Americans who attend church more often, pray more often and consider religion more important are less likely to prioritize the economy or liberty over the vulnerable.
Philip S. Gorski (The Flag and the Cross: White Christian Nationalism and the Threat to American Democracy)
In 1997, money manager David Leinweber wondered which statistics would have best predicted the performance of the U.S. stock market from 1981 through 1993. He sifted through thousands of publicly available numbers until he found one that had forecast U.S. stock returns with 75% accuracy: the total volume of butter produced each year in Bangladesh. Leinweber was able to improve the accuracy of his forecasting “model” by adding a couple of other variables, including the number of sheep in the United States. Abracadabra! He could now predict past stock returns with 99% accuracy. Leinweber meant his exercise as satire, but his point was serious: Financial marketers have such an immense volume of data to slice and dice that they can “prove” anything.
Jason Zweig (Your Money and Your Brain)
Excellence in Statistics: Rigor Statisticians are specialists in coming to conclusions beyond your data safely—they are your best protection against fooling yourself in an uncertain world. To them, inferring something sloppily is a greater sin than leaving your mind a blank slate, so expect a good statistician to put the brakes on your exuberance. They care deeply about whether the methods applied are right for the problem and they agonize over which inferences are valid from the information at hand. The result? A perspective that helps leaders make important decisions in a risk-controlled manner. In other words, they use data to minimize the chance that you’ll come to an unwise conclusion. Excellence in Machine Learning: Performance You might be an applied machine-learning/AI engineer if your response to “I bet you couldn’t build a model that passes testing at 99.99999% accuracy” is “Watch me.” With the coding chops to build both prototypes and production systems that work and the stubborn resilience to fail every hour for several years if that’s what it takes, machine-learning specialists know that they won’t find the perfect solution in a textbook. Instead, they’ll be engaged in a marathon of trial and error. Having great intuition for how long it’ll take them to try each new option is a huge plus and is more valuable than an intimate knowledge of how the algorithms work (though it’s nice to have both). Performance means more than clearing a metric—it also means reliable, scalable, and easy-to-maintain models that perform well in production. Engineering excellence is a must. The result? A system that automates a tricky task well enough to pass your statistician’s strict testing bar and deliver the audacious performance a business leader demands. Wide Versus Deep What the previous two roles have in common is that they both provide high-effort solutions to specific problems. If the problems they tackle aren’t worth solving, you end up wasting their time and your money. A frequent lament among business leaders is, “Our data science group is useless.” And the problem usually lies in an absence of analytics expertise. Statisticians and machine-learning engineers are narrow-and-deep workers—the shape of a rabbit hole, incidentally—so it’s really important to point them at problems that deserve the effort. If your experts are carefully solving the wrong problems, your investment in data science will suffer low returns. To ensure that you can make good use of narrow-and-deep experts, you either need to be sure you already have the right problem or you need a wide-and-shallow approach to finding one.
Harvard Business Review (Strategic Analytics: The Insights You Need from Harvard Business Review (HBR Insights Series))
Everything we think and know about the world is a model. Every word and every language is a model. All maps and statistics, books and databases, equations and computer programs are models. None of these is or ever will be the real world.
Donella H. Meadows (Thinking In Systems: A Primer)
More generally, a data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning, and munging data, because data is never clean. This process requires persistence, statistics, and software engineering skills — skills that are also necessary for understanding biases in the data, and for debugging logging output from code. Once she gets the data into shape, a crucial part is exploratory data analysis, which combines visualization and data sense. She’ll find patterns, build models, and algorithms — some with the intention of understanding product usage and the overall health of the product, and others to serve as prototypes that ultimately get baked back into the product. She may design experiments, and she is a critical part of data-driven decision making. She’ll communicate with team members, engineers, and leadership in clear language and with data visualizations so that even if her colleagues are not immersed in the data themselves, they will understand the implications.
Rachel Schutt (Doing Data Science: Straight Talk from the Frontline)
CAN STATISTICAL MODELS BE USED TO MAKE DECISIONS? WHAT IS THE MEANING OF PROBABILITY WHEN APPLIED TO REAL LIFE? DO PEOPLE REALLY UNDERSTAND PROBABILITY? IS PROBABILITY REALLY NECESSARY? WHAT WILL HAPPEN NEXT
David Salsburg
variable. In social science, this is called a nomothetic mode of explanation—the isolation of the most important factors. This approach is consistent with the philosophy of seeking complete but parsimonious explanations in science.1 The second part involves addressing those variables that were not considered as being of most relevance. Regarding the first part, the specification of the “most important” independent variables is a judicious undertaking. The use of a nomothetic strategy implies that a range of plausible models exists—different analysts may identify different sets of “most important” independent variables. Analysts should ask which different factors are most likely to affect or cause their dependent variable, and they are likely to justify, identify, and operationalize their choices differently. Thus, the term full model specification does not imply that only one model or even a best model exists, but rather it refers to a family of plausible models. Most researchers agree that specification should (1) be driven by theory, that is, by persuasive arguments and perspectives that identify and justify which factors are most important, and (2) inform why the set of such variables is regarded as complete and parsimonious. In practice, the search for complete, parsimonious, and theory-driven explanations usually results in multiple regression models with about 5–12 independent variables; theory seldom results in less than 5 variables, and parsimony and problems of statistical estimation, discussed further, seldom result in models with more than 12. Key Point We cannot examine the effect of all possible variables. Rather, we focus on the most relevant ones. The search for parsimonious explanations often leads analysts to first identify different categories of factors that most affect their dependent variable. Then, after these categories of factors have been identified, analysts turn to the task of trying to measure each, through either single or index variables. As an example,
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
single or index variables. As an example, consider the dependent variable “high school violence,” discussed in Chapter 2. We ask: “What are the most important, distinct factors affecting or causing high school violence?” Some plausible factors are (1) student access to weapons, (2) student isolation from others, (3) peer groups that are prone to violence, (4) lack of enforcement of school nonviolence policies, (5) participation in anger management programs, and (6) familiarity with warning signals (among teachers and staff). Perhaps you can think of other factors. Then, following the strategies discussed in Chapter 3—conceptualization, operationalization, and index variable construction—we use either single variables or index measures as independent variables to measure each of these factors. This approach provides for the inclusion of programs or policies as independent variables, as well as variables that measure salient rival hypotheses. The strategy of full model specification requires that analysts not overlook important factors. Thus, analysts do well to carefully justify their model and to consult past studies and interview those who have direct experience with, or other opinions about, the research subject. Doing so might lead analysts to include additional variables, such as the socioeconomic status of students’ parents. Then, after a fully specified model has been identified, analysts often include additional variables of interest. These may be variables of lesser relevance, speculative consequences, or variables that analysts want to test for their lack of impact, such as rival hypotheses. Demographic variables, such as the age of students, might be added. When additional variables are included, analysts should identify which independent variables constitute the nomothetic explanation, and which serve some other purpose. Remember, all variables included in models must be theoretically justified. Analysts must argue how each variable could plausibly affect their dependent variable. The second part of “all of the variables that affect the dependent variable” acknowledges all of the other variables that are not identified (or included) in the model. They are omitted; these variables are not among “the most important factors” that affect the dependent variable. The cumulative effect of these other variables is, by definition, contained in the error term, described later in this chapter. The assumption of full model specification is that these other variables are justifiably omitted only when their cumulative effect on the dependent variable is zero. This approach is plausible because each of these many unknown variables may have a different magnitude, thus making it possible that their effects cancel each other out. The argument, quite clearly, is not that each of these other factors has no impact on the dependent variable—but only that their cumulative effect is zero. The validity of multiple regression models centers on examining the behavior of the error term in this regard. If the cumulative effect of all the other variables is not zero, then additional independent variables may have to be considered. The specification of the multiple regression model is as follows:
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
Thus, multiple regression requires two important tasks: (1) specification of independent variables and (2) testing of the error term. An important difference between simple regression and multiple regression is the interpretation of the regression coefficients in multiple regression (b1, b2, b3, …) in the preceding multiple regression model. Although multiple regression produces the same basic statistics discussed in Chapter 14 (see Table 14.1), each of the regression coefficients is interpreted as its effect on the dependent variable, controlled for the effects of all of the other independent variables included in the regression. This phrase is used frequently when explaining multiple regression results. In our example, the regression coefficient b1 shows the effect of x1 on y, controlled for all other variables included in the model. Regression coefficient b2 shows the effect of x2 on y, also controlled for all other variables in the model, including x1. Multiple regression is indeed an important and relatively simple way of taking control variables into account (and much easier than the approach shown in Appendix 10.1). Key Point The regression coefficient is the effect on the dependent variable, controlled for all other independent variables in the model. Note also that the model given here is very different from estimating separate simple regression models for each of the independent variables. The regression coefficients in simple regression do not control for other independent variables, because they are not in the model. The word independent also means that each independent variable should be relatively unaffected by other independent variables in the model. To ensure that independent variables are indeed independent, it is useful to think of the distinctively different types (or categories) of factors that affect a dependent variable. This was the approach taken in the preceding example. There is also a statistical reason for ensuring that independent variables are as independent as possible. When two independent variables are highly correlated with each other (r2 > .60), it sometimes becomes statistically impossible to distinguish the effect of each independent variable on the dependent variable, controlled for the other. The variables are statistically too similar to discern disparate effects. This problem is called multicollinearity and is discussed later in this chapter. This problem is avoided by choosing independent variables that are not highly correlated with each other. A WORKING EXAMPLE Previously (see Chapter 14), the management analyst with the Department of Defense found a statistically significant relationship between teamwork and perceived facility productivity (p <.01). The analyst now wishes to examine whether the impact of teamwork on productivity is robust when controlled for other factors that also affect productivity. This interest is heightened by the low R-square (R2 = 0.074) in Table 14.1, suggesting a weak relationship between teamwork and perceived productivity. A multiple regression model is specified to include the effects of other factors that affect perceived productivity. Thinking about other categories of variables that could affect productivity, the analyst hypothesizes the following: (1) the extent to which employees have adequate technical knowledge to do their jobs, (2) perceptions of having adequate authority to do one’s job well (for example, decision-making flexibility), (3) perceptions that rewards and recognition are distributed fairly (always important for motivation), and (4) the number of sick days. Various items from the employee survey are used to measure these concepts (as discussed in the workbook documentation for the Productivity dataset). After including these factors as additional independent variables, the result shown in Table 15.1 is
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
violations of regression assumptions, and strategies for examining and remedying such assumptions. Then we extend the preceding discussion and will be able to conclude whether the above results are valid. Again, this model is not the only model that can be constructed but rather is one among a family of plausible models. Indeed, from a theoretical perspective, other variables might have been included, too. From an empirical perspective, perhaps other variables might explain more variance. Model specification is a judicious effort, requiring a balance between theoretical and statistical integrity. Statistical software programs can also automatically select independent variables based on their statistical significance, hence, adding to R-square.2 However, models with high R-square values are not necessarily better; theoretical reasons must exist for selecting independent variables, explaining why and how they might be related to the dependent variable. Knowing which variables are related empirically to the dependent variable can help narrow the selection, but such knowledge should not wholly determine it. We now turn to a discussion of the other statistics shown in Table 15.1. Getting Started Find examples of multiple regression in the research literature. Figure 15.1 Dependent Variable: Productivity FURTHER STATISTICS Goodness of Fit for Multiple Regression The model R-square in Table 15.1 is greatly increased over that shown in Table 14.1: R-square has gone from 0.074 in the simple regression model to 0.274. However, R-square has the undesirable mathematical property of increasing with the number of independent variables in the model. R-square increases regardless of whether an additional independent variable adds further explanation of the dependent variable. The adjusted R-square (or ) controls for the number of independent variables. is always equal to or less than R2. The above increase in explanation of the dependent variable is due to variables identified as statistically significant in Table 15.1. Key Point R-square is the variation in the dependent variable that is explained by all the independent variables. Adjusted R-square is often used to evaluate model explanation (or fit). Analogous with simple regression, values of below 0.20 are considered to suggest weak model fit, those between 0.20 and 0.40 indicate moderate fit, those above 0.40 indicate strong fit, and those above 0.65 indicate very strong model fit. Analysts should remember that choices of model specification are driven foremost by theory, not statistical model fit; strong model fit is desirable only when the variables, and their relationships, are meaningful in some real-life sense. Adjusted R-square can assist in the variable selection process. Low values of adjusted R-square prompt analysts to ask whether they inadvertently excluded important variables from their models; if included, these variables might affect the statistical significance of those already in a model.3 Adjusted R-square also helps analysts to choose among alternative variable specifications (for example, different measures of student isolation), when such choices are no longer meaningfully informed by theory. Empirical issues of model fit then usefully guide the selection process further. Researchers typically report adjusted R-square with their
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
SUMMARY A vast array of additional statistical methods exists. In this concluding chapter, we summarized some of these methods (path analysis, survival analysis, and factor analysis) and briefly mentioned other related techniques. This chapter can help managers and analysts become familiar with these additional techniques and increase their access to research literature in which these techniques are used. Managers and analysts who would like more information about these techniques will likely consult other texts or on-line sources. In many instances, managers will need only simple approaches to calculate the means of their variables, produce a few good graphs that tell the story, make simple forecasts, and test for significant differences among a few groups. Why, then, bother with these more advanced techniques? They are part of the analytical world in which managers operate. Through research and consulting, managers cannot help but come in contact with them. It is hoped that this chapter whets the appetite and provides a useful reference for managers and students alike. KEY TERMS   Endogenous variables Exogenous variables Factor analysis Indirect effects Loading Path analysis Recursive models Survival analysis Notes 1. Two types of feedback loops are illustrated as follows: 2. When feedback loops are present, error terms for the different models will be correlated with exogenous variables, violating an error term assumption for such models. Then, alternative estimation methodologies are necessary, such as two-stage least squares and others discussed later in this chapter. 3. Some models may show double-headed arrows among error terms. These show the correlation between error terms, which is of no importance in estimating the beta coefficients. 4. In SPSS, survival analysis is available through the add-on module in SPSS Advanced Models. 5. The functions used to estimate probabilities are rather complex. They are so-called Weibull distributions, which are defined as h(t) = αλ(λt)a–1, where a and 1 are chosen to best fit the data. 6. Hence, the SSL is greater than the squared loadings reported. For example, because the loadings of variables in groups B and C are not shown for factor 1, the SSL of shown loadings is 3.27 rather than the reported 4.084. If one assumes the other loadings are each .25, then the SSL of the not reported loadings is [12*.252 =] .75, bringing the SSL of factor 1 to [3.27 + .75 =] 4.02, which is very close to the 4.084 value reported in the table. 7. Readers who are interested in multinomial logistic regression can consult on-line sources or the SPSS manual, Regression Models 10.0 or higher. The statistics of discriminant analysis are very dissimilar from those of logistic regression, and readers are advised to consult a separate text on that topic. Discriminant analysis is not often used in public
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
Beyond One-Way ANOVA The approach described in the preceding section is called one-way ANOVA. This scenario is easily generalized to accommodate more than one independent variable. These independent variables are either discrete (called factors) or continuous (called covariates). These approaches are called n-way ANOVA or ANCOVA (the “C” indicates the presence of covariates). Two way ANOVA, for example, allows for testing of the effect of two different independent variables on the dependent variable, as well as the interaction of these two independent variables. An interaction effect between two variables describes the way that variables “work together” to have an effect on the dependent variable. This is perhaps best illustrated by an example. Suppose that an analyst wants to know whether the number of health care information workshops attended, as well as a person’s education, are associated with healthy lifestyle behaviors. Although we can surely theorize how attending health care information workshops and a person’s education can each affect an individual’s healthy lifestyle behaviors, it is also easy to see that the level of education can affect a person’s propensity for attending health care information workshops, as well. Hence, an interaction effect could also exist between these two independent variables (factors). The effects of each independent variable on the dependent variable are called main effects (as distinct from interaction effects). To continue the earlier example, suppose that in addition to population, an analyst also wants to consider a measure of the watershed’s preexisting condition, such as the number of plant and animal species at risk in the watershed. Two-way ANOVA produces the results shown in Table 13.4, using the transformed variable mentioned earlier. The first row, labeled “model,” refers to the combined effects of all main and interaction effects in the model on the dependent variable. This is the global F-test. The “model” row shows that the two main effects and the single interaction effect, when considered together, are significantly associated with changes in the dependent variable (p < .000). However, the results also show a reduced significance level of “population” (now, p = .064), which seems related to the interaction effect (p = .076). Although neither effect is significant at conventional levels, the results do suggest that an interaction effect is present between population and watershed condition (of which the number of at-risk species is an indicator) on watershed wetland loss. Post-hoc tests are only provided separately for each of the independent variables (factors), and the results show the same homogeneous grouping for both of the independent variables. Table 13.4 Two-Way ANOVA Results As we noted earlier, ANOVA is a family of statistical techniques that allow for a broad range of rather complex experimental designs. Complete coverage of these techniques is well beyond the scope of this book, but in general, many of these techniques aim to discern the effect of variables in the presence of other (control) variables. ANOVA is but one approach for addressing control variables. A far more common approach in public policy, economics, political science, and public administration (as well as in many others fields) is multiple regression (see Chapter 15). Many analysts feel that ANOVA and regression are largely equivalent. Historically, the preference for ANOVA stems from its uses in medical and agricultural research, with applications in education and psychology. Finally, the ANOVA approach can be generalized to allow for testing on two or more dependent variables. This approach is called multiple analysis of variance, or MANOVA. Regression-based analysis can also be used for dealing with multiple dependent variables, as mentioned in Chapter 17.
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
As an example, suppose we want to translate a text from English to French. The noisy channel model for translation assumes that the true text is in French, but that, unfortunately, when it was transmitted to us, it went through a noisy communication channel and came out as English. So the word cow we see in the text was really vache, garbled by the noisy channel to cow. All we need to do in order to translate is to recover the original French – or to decode the English to get the French.
Christopher Manning (Foundations of Statistical Natural Language Processing (The MIT Press))
The world `out there' is an exceedingly complicated mass of sensations, events, and turmoil. With Thomas Kuhn, I do not believe that the human mind is capable of organizing a structure of ideas that can come even close to describing what is really out there. Any attempt to do so contains fundamental faults. Eventually, those faults will become so obvious that the scientific model must be continuously modified and eventually discarded in favor of a more subtle one. We can expect the statistical revolution will eventually run its course and be replaced by something else.
David Salsburg (The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century)
And the rest of us? We should grasp the basics of math and statistics-certainly better than most of us do today-but still follow what we love. The world doesn't need millions of mediocre mathematicians, and there's plenty of opportunity for specialists in other fields. Even in the heart of opportunity for specialists in other fields. Even in the heart of the math economy, at IBM Research, geometers and engineers work on teams with linguists and anthropologists and cognitive psychologists. They detail the behavior of humans to those who are trying to build mathematical models of it. All of these ventures, from Samer Takriti's gang at IBM to the secretive researchers laboring behind the barricades at the National Security Agency, feed from the knowledge and smarts of diverse groups. The key to finding a place on such world-class teams is not necessarily to become a math whiz but to become a whiz at something. And that something should be in an area that sparks the most enthusiasm and creativity within each of us. Somewhere on those teams, of course, whether it's in advertising, publishing, counterterrorism, or medical research, there will be at least a few Numerati. They'll be the ones distilling this knowledge into numbers and symbols and feeding them to their powerful tools.
Stephen Baker (The Numerati)
Simple Regression   CHAPTER OBJECTIVES After reading this chapter, you should be able to Use simple regression to test the statistical significance of a bivariate relationship involving one dependent and one independent variable Use Pearson’s correlation coefficient as a measure of association between two continuous variables Interpret statistics associated with regression analysis Write up the model of simple regression Assess assumptions of simple regression This chapter completes our discussion of statistical techniques for studying relationships between two variables by focusing on those that are continuous. Several approaches are examined: simple regression; the Pearson’s correlation coefficient; and a nonparametric alterative, Spearman’s rank correlation coefficient. Although all three techniques can be used, we focus particularly on simple regression. Regression allows us to predict outcomes based on knowledge of an independent variable. It is also the foundation for studying relationships among three or more variables, including control variables mentioned in Chapter 2 on research design (and also in Appendix 10.1). Regression can also be used in time series analysis, discussed in Chapter 17. We begin with simple regression. SIMPLE REGRESSION Let’s first look at an example. Say that you are a manager or analyst involved with a regional consortium of 15 local public agencies (in cities and counties) that provide low-income adults with health education about cardiovascular diseases, in an effort to reduce such diseases. The funding for this health education comes from a federal grant that requires annual analysis and performance outcome reporting. In Chapter 4, we used a logic model to specify that a performance outcome is the result of inputs, activities, and outputs. Following the development of such a model, you decide to conduct a survey among participants who attend such training events to collect data about the number of events they attended, their knowledge of cardiovascular disease, and a variety of habits such as smoking that are linked to cardiovascular disease. Some things that you might want to know are whether attending workshops increases
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
Table 14.1 also shows R-square (R2), which is called the coefficient of determination. R-square is of great interest: its value is interpreted as the percentage of variation in the dependent variable that is explained by the independent variable. R-square varies from zero to one, and is called a goodness-of-fit measure.5 In our example, teamwork explains only 7.4 percent of the variation in productivity. Although teamwork is significantly associated with productivity, it is quite likely that other factors also affect it. It is conceivable that other factors might be more strongly associated with productivity and that, when controlled for other factors, teamwork is no longer significant. Typically, values of R2 below 0.20 are considered to indicate weak relationships, those between 0.20 and 0.40 indicate moderate relationships, and those above 0.40 indicate strong relationships. Values of R2 above 0.65 are considered to indicate very strong relationships. R is called the multiple correlation coefficient and is always 0 ≤ R ≤ 1. To summarize up to this point, simple regression provides three critically important pieces of information about bivariate relationships involving two continuous variables: (1) the level of significance at which two variables are associated, if at all (t-statistic), (2) whether the relationship between the two variables is positive or negative (b), and (3) the strength of the relationship (R2). Key Point R-square is a measure of the strength of the relationship. Its value goes from 0 to 1. The primary purpose of regression analysis is hypothesis testing, not prediction. In our example, the regression model is used to test the hypothesis that teamwork is related to productivity. However, if the analyst wants to predict the variable “productivity,” the regression output also shows the SEE, or the standard error of the estimate (see Table 14.1). This is a measure of the spread of y values around the regression line as calculated for the mean value of the independent variable, only, and assuming a large sample. The standard error of the estimate has an interpretation in terms of the normal curve, that is, 68 percent of y values lie within one standard error from the calculated value of y, as calculated for the mean value of x using the preceding regression model. Thus, if the mean index value of the variable “teamwork” is 5.0, then the calculated (or predicted) value of “productivity” is [4.026 + 0.223*5 =] 5.141. Because SEE = 0.825, it follows that 68 percent of productivity values will lie 60.825 from 5.141 when “teamwork” = 5. Predictions of y for other values of x have larger standard errors.6 Assumptions and Notation There are three simple regression assumptions. First, simple regression assumes that the relationship between two variables is linear. The linearity of bivariate relationships is easily determined through visual inspection, as shown in Figure 14.2. In fact, all analysis of relationships involving continuous variables should begin with a scatterplot. When variable
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
relationships are nonlinear (parabolic or otherwise heavily curved), it is not appropriate to use linear regression. Then, one or both variables must be transformed, as discussed in Chapter 12. Second, simple regression assumes that the linear relationship is constant over the range of observations. This assumption is violated when the relationship is “broken,” for example, by having an upward slope for the first half of independent variable values and a downward slope over the remaining values. Then, analysts should consider using two regression models each for these different, linear relationships. The linearity assumption is also violated when no relationship is present in part of the independent variable values. This is particularly problematic because regression analysis will calculate a regression slope based on all observations. In this case, analysts may be misled into believing that the linear pattern holds for all observations. Hence, regression results always should be verified through visual inspection. Third, simple regression assumes that the variables are continuous. In Chapter 15, we will see that regression can also be used for nominal and dichotomous independent variables. The dependent variable, however, must be continuous. When the dependent variable is dichotomous, logistic regression should be used (Chapter 16). Figure 14.2 Three Examples of r The following notations are commonly used in regression analysis. The predicted value of y (defined, based on the regression model, as y = a + bx) is typically different from the observed value of y. The predicted value of the dependent variable y is sometimes indicated as ŷ (pronounced “y-hat”). Only when R2 = 1 are the observed and predicted values identical for each observation. The difference between y and ŷ is called the regression error or error term
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
(e). Hence the expressions are equivalent, as is y = ŷ + e. Certain assumptions about e are important, such as that it is normally distributed. When error term assumptions are violated, incorrect conclusions may be made about the statistical significance of relationships. This important issue is discussed in greater detail in Chapter 15 and, for time series data, in Chapter 17. Hence, the above is a pertinent but incomplete list of assumptions. Getting Started Conduct a simple regression, and practice writing up your results. PEARSON’S CORRELATION COEFFICIENT Pearson’s correlation coefficient, r, measures the association (significance, direction, and strength) between two continuous variables; it is a measure of association for two continuous variables. Also called the Pearson’s product-moment correlation coefficient, it does not assume a causal relationship, as does simple regression. The correlation coefficient indicates the extent to which the observations lie closely or loosely clustered around the regression line. The coefficient r ranges from –1 to +1. The sign indicates the direction of the relationship, which, in simple regression, is always the same as the slope coefficient. A “–1” indicates a perfect negative relationship, that is, that all observations lie exactly on a downward-sloping regression line; a “+1” indicates a perfect positive relationship, whereby all observations lie exactly on an upward-sloping regression line. Of course, such values are rarely obtained in practice because observations seldom lie exactly on a line. An r value of zero indicates that observations are so widely scattered that it is impossible to draw any well-fitting line. Figure 14.2 illustrates some values of r. Key Point Pearson’s correlation coefficient, r, ranges from –1 to +1. It is important to avoid confusion between Pearson’s correlation coefficient and the coefficient of determination. For the two-variable, simple regression model, r2 = R2, but whereas 0 ≤ R ≤ 1, r ranges from –1 to +1. Hence, the sign of r tells us whether a relationship is positive or negative, but the sign of R, in regression output tables such as Table 14.1, is always positive and cannot inform us about the direction of the relationship. In simple regression, the regression coefficient, b, informs us about the direction of the relationship. Statistical software programs usually show r rather than r2. Note also that the Pearson’s correlation coefficient can be used only to assess the association between two continuous variables, whereas regression can be extended to deal with more than two variables, as discussed in Chapter 15. Pearson’s correlation coefficient assumes that both variables are normally distributed. When Pearson’s correlation coefficients are calculated, a standard error of r can be determined, which then allows us to test the statistical significance of the bivariate correlation. For bivariate relationships, this is the same level of significance as shown for the slope of the regression coefficient. For the variables given earlier in this chapter, the value of r is .272 and the statistical significance of r is p ≤ .01. Use of the Pearson’s correlation coefficient assumes that the variables are normally distributed and that there are no significant departures from linearity.7 It is important not to confuse the correlation coefficient, r, with the regression coefficient, b. Comparing the measures r and b (the slope) sometimes causes confusion. The key point is that r does not indicate the regression slope but rather the extent to which observations lie close to it. A steep regression line (large b) can have observations scattered loosely or closely around it, as can a shallow (more horizontal) regression line. The purposes of these two statistics are very different.8 SPEARMAN’S RANK CORRELATION
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
COEFFICIENT The nonparametric alternative, Spearman’s rank correlation coefficient (r, or “rho”), looks at correlation among the ranks of the data rather than among the values. The ranks of data are determined as shown in Table 14.2 (adapted from Table 11.8): Table 14.2 Ranks of Two Variables In Greater Depth … Box 14.1 Crime and Poverty An analyst wants to examine empirically the relationship between crime and income in cities across the United States. The CD that accompanies the workbook Exercising Essential Statistics includes a Community Indicators dataset with assorted indicators of conditions in 98 cities such as Akron, Ohio; Phoenix, Arizona; New Orleans, Louisiana; and Seattle, Washington. The measures include median household income, total population (both from the 2000 U.S. Census), and total violent crimes (FBI, Uniform Crime Reporting, 2004). In the sample, household income ranges from $26,309 (Newark, New Jersey) to $71,765 (San Jose, California), and the median household income is $42,316. Per-capita violent crime ranges from 0.15 percent (Glendale, California) to 2.04 percent (Las Vegas, Nevada), and the median violent crime rate per capita is 0.78 percent. There are four types of violent crimes: murder and nonnegligent manslaughter, forcible rape, robbery, and aggravated assault. A measure of total violent crime per capita is calculated because larger cities are apt to have more crime. The analyst wants to examine whether income is associated with per-capita violent crime. The scatterplot of these two continuous variables shows that a negative relationship appears to be present: The Pearson’s correlation coefficient is –.532 (p < .01), and the Spearman’s correlation coefficient is –.552 (p < .01). The simple regression model shows R2 = .283. The regression model is as follows (t-test statistic in parentheses): The regression line is shown on the scatterplot. Interpreting these results, we see that the R-square value of .283 indicates a moderate relationship between these two variables. Clearly, some cities with modest median household incomes have a high crime rate. However, removing these cities does not greatly alter the findings. Also, an assumption of regression is that the error term is normally distributed, and further examination of the error shows that it is somewhat skewed. The techniques for examining the distribution of the error term are discussed in Chapter 15, but again, addressing this problem does not significantly alter the finding that the two variables are significantly related to each other, and that the relationship is of moderate strength. With this result in hand, further analysis shows, for example, by how much violent crime decreases for each increase in household income. For each increase of $10,000 in average household income, the violent crime rate drops 0.25 percent. For a city experiencing the median 0.78 percent crime rate, this would be a considerable improvement, indeed. Note also that the scatterplot shows considerable variation in the crime rate for cities at or below the median household income, in contrast to those well above it. Policy analysts may well wish to examine conditions that give rise to variation in crime rates among cities with lower incomes. Because Spearman’s rank correlation coefficient examines correlation among the ranks of variables, it can also be used with ordinal-level data.9 For the data in Table 14.2, Spearman’s rank correlation coefficient is .900 (p = .035).10 Spearman’s p-squared coefficient has a “percent variation explained” interpretation, similar
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
Multiple Regression   CHAPTER OBJECTIVES After reading this chapter, you should be able to Understand multiple regression as a full model specification technique Interpret standardized and unstandardized regression coefficients of multiple regression Know how to use nominal variables in
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
regression as dummy variables Explain the importance of the error term plot Identify assumptions of regression, and know how to test and correct assumption violations Multiple regression is one of the most widely used multivariate statistical techniques for analyzing three or more variables. This chapter uses multiple regression to examine such relationships, and thereby extends the discussion in Chapter 14. The popularity of multiple regression is due largely to the ease with which it takes control variables (or rival hypotheses) into account. In Chapter 10, we discussed briefly how contingency tables can be used for this purpose, but doing so is often a cumbersome and sometimes inconclusive effort. By contrast, multiple regression easily incorporates multiple independent variables. Another reason for its popularity is that it also takes into account nominal independent variables. However, multiple regression is no substitute for bivariate analysis. Indeed, managers or analysts with an interest in a specific bivariate relationship will conduct a bivariate analysis first, before examining whether the relationship is robust in the presence of numerous control variables. And before conducting bivariate analysis, analysts need to conduct univariate analysis to better understand their variables. Thus, multiple regression is usually one of the last steps of analysis. Indeed, multiple regression is often used to test the robustness of bivariate relationships when control variables are taken into account. The flexibility with which multiple regression takes control variables into account comes at a price, though. Regression, like the t-test, is based on numerous assumptions. Regression results cannot be assumed to be robust in the face of assumption violations. Testing of assumptions is always part of multiple regression analysis. Multiple regression is carried out in the following sequence: (1) model specification (that is, identification of dependent and independent variables), (2) testing of regression assumptions, (3) correction of assumption violations, if any, and (4) reporting of the results of the final regression model. This chapter examines these four steps and discusses essential concepts related to simple and multiple regression. Chapters 16 and 17 extend this discussion by examining the use of logistic regression and time series analysis. MODEL SPECIFICATION Multiple regression is an extension of simple regression, but an important difference exists between the two methods: multiple regression aims for full model specification. This means that analysts seek to account for all of the variables that affect the dependent variable; by contrast, simple regression examines the effect of only one independent variable. Philosophically, the phrase identifying the key difference—“all of the variables that affect the dependent variable”—is divided into two parts. The first part involves identifying the variables that are of most (theoretical and practical) relevance in explaining the dependent
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
Housing prices had never before fallen as far and as fast as they did beginning in 2007. But that’s what happened. Former Federal Reserve chairman Alan Greenspan explained to a congressional committee after the fact, “The whole intellectual edifice, however, collapsed in the summer of [2007] because the data input into the risk management models generally covered only the past two decades, a period of euphoria. Had instead the models been fitted more appropriately to historic periods of stress, capital requirements would have been much higher and the financial world would be in far better shape, in my judgment.”3
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
There are also books that contain collections of papers or chapters on particular aspects of knowledge discovery—for example, Relational Data Mining edited by Dzeroski and Lavrac [De01]; Mining Graph Data edited by Cook and Holder [CH07]; Data Streams: Models and Algorithms edited by Aggarwal [Agg06]; Next Generation of Data Mining edited by Kargupta, Han, Yu, et al. [KHY+08]; Multimedia Data Mining: A Systematic Introduction to Concepts and Theory edited by Z. Zhang and R. Zhang [ZZ09]; Geographic Data Mining and Knowledge Discovery edited by Miller and Han [MH09]; and Link Mining: Models, Algorithms and Applications edited by Yu, Han, and Faloutsos [YHF10]. There are many tutorial notes on data mining in major databases, data mining, machine learning, statistics, and Web technology conferences.
Vipin Kumar (Introduction to Data Mining)
The social-personality approach to studying creativity focuses on personality and motivational variables as well as the socio-cultural environment as sources of creativity. Sternberg (2000) states that numerous studies conducted at the societal level indicate that “eminent levels of creativity over large spans of time are statistically linked to variables such as cultural diversity, war, availability of role models, availability of financial support, and competitors in a domain” (p. 9).
Bharath Sriraman (The Characteristics of Mathematical Creativity)
This was fueled by a return to a cardinal tenet of the Protestant faith, Sola Scriptura, which argues that God’s Word alone is sufficient for faith and practice.[3] This principle makes the Bible the exclusive foundation for all that we do. It is rooted in the belief that man’s notions for how to live must be set aside for God’s clear directives as found in His inspired, written revelation, and that God’s people are to limit themselves to obedience to His revealed will.[4] I progressively realized that modern youth ministry had largely developed from traditions, cultural preferences, statistical surveys, and the opinions of creative leaders, rather than biblical principles. If All I Had Was Scripture It finally occurred to me that if I began with Scripture alone, I would have no reason for age-segregated Christianity. In other words, if all I had was the Bible, it would be difficult (if not impossible) to establish the credibility of this practice. I was humbled to learn that God’s vision for training young people is powerful, profound, and comprehensive, standing in sharp contrast to the man-centered, culture-bound model I once advocated.
Scott T. Brown (A Weed in the Church)
One of the initial steps to understand or improve a process is to gather information about the important activities so that a ‘dynamic model
John S. Oakland (Statistical Process Control)
As a result, quantum processes can be regarded as being made up of many individual subprocesses taking place independently of one another. In Figure 43 the total process of the wave function 'swinging round' and becoming entangled is represented symbolically by the arrows as six individual subprocesses (or branches, to use Everett's terminology). In all of them, the pointer starts in the same position but ends in a different position. Everett makes the key assumption that conscious awareness is always associated with the branches, not the process as a whole. Each subprocess is, so to speak, aware only of itself. There is a beautiful logic to this, since each subprocess is fully described by the quantum laws. There is nothing within the branch as such to indicate that it alone does not constitute the entire history of the universe. It carries on in blithe ignorance of the other branches, which are 'parallel worlds' of which it sees nothing. The branches can nevertheless be very complicated. An impressive part of Everett's paper demonstrates how an observer (modelled by an inanimate computer) within one such branch could well have the experience of being all alone in such a multiworld, doing quantum experiments and finding that the quantum statistical predictions are verified.
Julian Barbour (The End of Time: The Next Revolution in Our Understanding of the Universe)
statistician William Sanders in Tennessee, who began his career advising agricultural and manufacturing industries. Sanders claimed that his statistical modeling could determine how much “value” a teacher added to her students’ testing performance.
Diane Ravitch (Reign of Error: The Hoax of the Privatization Movement and the Danger to America's Public Schools)
This is a very important distinction between weather and climate models: for climate forecasts, the initial conditions in the atmosphere are not as important as the external forcings that have the ability to alter the character and types of weather (i.e., the statistics or what scientists would call the “distribution” of the weather) that make up the climate.
Heidi Cullen (The Weather of the Future: Heat Waves, Extreme Storms, and Other Scenes from a Climate-Changed Planet)
Today we aren’t quite to the place that H. G. Wells predicted years ago, but society is getting closer out of necessity. Global businesses and organizations are being forced to use statistical analysis and data mining applications in a format that combines art and science–intuition and expertise in collecting and understanding data in order to make accurate models that realistically predict the future that lead to informed strategic decisions thus allowing correct actions ensuring success, before it is too late . . . today, numeracy is as essential as literacy. As John Elder likes to say: ‘Go data mining!’ It really does save enormous time and money. For those
Anonymous
David Viniar, CFO of Goldman Sachs, claimed as the global financial crisis broke in August 2007 that his bank had experienced ‘25 standard deviation events’ several days in a row. But anyone with a knowledge of statistics (a group that must be presumed to include Viniar) knows that the occurrence of several ‘25 standard-deviation events’ within a short time is impossible. What he meant to say was that the company’s risk models failed to describe what had happened. Extreme observations are generally the product of ‘off-model’ events. If you toss a coin a hundred times and all the tosses are heads, you may have encountered a once in a lifetime statistical freak; but look first for a simpler explanation. For all their superficial sophistication, the masters of the universe had no real understanding of what was going on before them.
John Kay (Other People's Money: The Real Business of Finance)
We need a proper statistical model that lets each person have his own momentum effect and each person have his own checkout attraction and to see if we can pull him out from the data.
Herb Sorensen (Inside the Mind of the Shopper: The Science of Retailing)
Machine learning tends to be more focused on developing efficient algorithms that scale to large data in order to optimize the predictive model. Statistics generally pays more attention to the probabilistic theory and underlying structure of the model.
Peter Bruce (Practical Statistics for Data Scientists: 50 Essential Concepts)