“
What nature hath joined together, multiple regression analysis cannot put asunder.
”
”
Richard E. Nisbett (Mindware: Tools for Smart Thinking)
“
As many critics of religion have pointed out, the notion of a creator poses an immediate problem of an infinite regress. If God created the universe, what created God? To say that God, by definition, is uncreated simply begs the question. Any being capable of creating a complex world promises to be very complex himself. As the biologist Richard Dawkins has observed repeatedly, the only natural process we know of that could produce a being capable of designing things is evolution.
”
”
Sam Harris (Letter to a Christian Nation)
“
Regression analysis is the hydrogen bomb of the statistics arsenal.
”
”
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
“
Here is one of the most important things to remember when doing research that involves regression analysis: Try not to kill anyone. You can even put a little Post-it note on your computer monitor: “Do not kill people with your research.” Because some very smart people have inadvertently violated that rule.
”
”
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
“
It’s not enough,” one senior manager at Eastman Kodak told the author Daniel Goleman, “to be able to sit at your computer excited about a fantastic regression analysis if you’re squeamish about presenting those results to an executive group.” (Apparently it’s OK to be squeamish about doing a regression analysis if you’re excited about giving speeches.)
”
”
Susan Cain (Quiet: The Power of Introverts in a World That Can't Stop Talking)
“
Here is one of the most important things to remember when doing research that involves regression analysis: Try not to kill anyone. You can even put a little Post-it note on your computer monitor: “Do not kill people with your research.
”
”
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
“
Be wary, though, of the way news media use the word “significant,” because to statisticians it doesn’t mean “noteworthy.” In statistics, the word “significant” means that the results passed mathematical tests such as t-tests, chi-square tests, regression, and principal components analysis (there are hundreds). Statistical significance tests quantify how easily pure chance can explain the results. With a very large number of observations, even small differences that are trivial in magnitude can be beyond what our models of change and randomness can explain. These tests don’t know what’s noteworthy and what’s not—that’s a human judgment.
”
”
Daniel J. Levitin (A Field Guide to Lies: Critical Thinking in the Information Age)
“
In principle, more analytic power can be achieved by varying multiple things at once in an uncorrelated (random) way, and doing standard analysis, such as multiple linear regression. In practice, though, A/B testing is widely used, because A/B tests are easy to deploy, easy to understand, and easy to explain to management.
”
”
Christopher D. Manning (Introduction to Information Retrieval)
“
This book claims that secularization has accelerated, but we do not view religion as the product of ignorance or the opium of the people. Quite the contrary, evolutionary modernization theory implies that anything that became as pervasive and survived as long as religion is probably conducive to individual or societal survival. One reason religion spread and endured was because it encouraged norms of sharing, which were crucial to survival in an environment where there was no social security system. In bad times, one’s survival might depend on how strongly these norms were inculcated in the people around you. Religion also helped control violence. Experimental studies have examined the impact of religiosity and church attendance on violence, controlling for the effects of sociodemographic variables. Logistic regression analysis indicated that religiosity (though not church
”
”
Ronald Inglehart (Religion's Sudden Decline: What's Causing it, and What Comes Next?)
“
Thought Control
* Require members to internalize the group’s doctrine as truth
* Adopt the group’s “map of reality” as reality
* Instill black and white thinking
* Decide between good versus evil
* Organize people into us versus them (insiders versus outsiders)
* Change a person’s name and identity
* Use loaded language and clichés to constrict knowledge, stop critical thoughts, and reduce complexities into platitudinous buzzwords
* Encourage only “good and proper” thoughts
* Use hypnotic techniques to alter mental states, undermine critical thinking, and even to age-regress the member to childhood states
* Manipulate memories to create false ones
* Teach thought stopping techniques that shut down reality testing by stopping negative thoughts and allowing only positive thoughts. These techniques include:
* Denial, rationalization, justification, wishful thinking
* Chanting
* Meditating
* Praying
* Speaking in tongues
* Singing or humming
* Reject rational analysis, critical thinking, constructive criticism
* Forbid critical questions about leader, doctrine, or policy
* Label alternative belief systems as illegitimate, evil, or not useful
* Instill new “map of reality”
Emotional Control
* Manipulate and narrow the range of feelings—some emotions and/or needs are deemed as evil, wrong, or selfish
* Teach emotion stopping techniques to block feelings of hopelessness, anger, or doubt
* Make the person feel that problems are always their own fault, never the leader’s or the group’s fault
* Promote feelings of guilt or unworthiness, such as:
* Identity guilt
* You are not living up to your potential
* Your family is deficient
* Your past is suspect
* Your affiliations are unwise
* Your thoughts, feelings, actions are irrelevant or selfish
* Social guilt
* Historical guilt
* Instill fear, such as fear of:
* Thinking independently
* The outside world
* Enemies
* Losing one’s salvation
* Leaving
* Orchestrate emotional highs and lows through love bombing and by offering praise one moment, and then declaring a person is a horrible sinner
* Ritualistic and sometimes public confession of sins
* Phobia indoctrination: inculcate irrational fears about leaving the group or questioning the leader’s authority
* No happiness or fulfillment possible outside the group
* Terrible consequences if you leave: hell, demon possession, incurable diseases, accidents, suicide, insanity, 10,000 reincarnations, etc.
* Shun those who leave and inspire fear of being rejected by friends and family
* Never a legitimate reason to leave; those who leave are weak, undisciplined, unspiritual, worldly, brainwashed by family or counselor, or seduced by money, sex, or rock and roll
* Threaten harm to ex-member and family (threats of cutting off friends/family)
”
”
Steven Hassan
“
Fast-forward nearly a hundred years, and Prufrock’s protest is enshrined in high school syllabi, where it’s dutifully memorized, then quickly forgotten, by teens increasingly skilled at shaping their own online and offline personae. These students inhabit a world in which status, income, and self-esteem depend more than ever on the ability to meet the demands of the Culture of Personality. The pressure to entertain, to sell ourselves, and never to be visibly anxious keeps ratcheting up. The number of Americans who considered themselves shy increased from 40 percent in the 1970s to 50 percent in the 1990s, probably because we measured ourselves against ever higher standards of fearless self-presentation. “Social anxiety disorder”—which essentially means pathological shyness—is now thought to afflict nearly one in five of us. The most recent version of the Diagnostic and Statistical Manual (DSM-IV), the psychiatrist’s bible of mental disorders, considers the fear of public speaking to be a pathology—not an annoyance, not a disadvantage, but a disease—if it interferes with the sufferer’s job performance. “It’s not enough,” one senior manager at Eastman Kodak told the author Daniel Goleman, “to be able to sit at your computer excited about a fantastic regression analysis if you’re squeamish about presenting those results to an executive group.” (Apparently it’s OK to be squeamish about doing a regression analysis if you’re excited about giving speeches.)
”
”
Susan Cain (Quiet: The Power of Introverts in a World That Can't Stop Talking)
“
We have learned in the course of this investigation that the libido which builds up religious structures regresses in the last analysis to the mother, and thus represents the real bond through which we are connected with our origins. When the Church Fathers derive the word religio from religare (to reconnect, link back), they could at least have appealed to this psychological fact in support of their view.71 As we have seen, this regressive libido conceals itself in countless symbols of the most heterogeneous nature, some masculine and some feminine—differences of sex are at bottom secondary and not nearly so important psychologically as would appear at first sight. The essence and motive force of the sacrificial drama consist in an unconscious transformation of energy, of which the ego becomes aware in much the same way as sailors are made aware of a volcanic upheaval under the sea. Of course, when we consider the beauty and sublimity of the whole conception of sacrifice and its solemn ritual, it must be admitted that a psychological formulation has a shockingly sobering effect. The dramatic concreteness of the sacrificial act is reduced to a barren abstraction, and the flourishing life of the figures is flattened into two-dimensionality. Scientific understanding is bound, unfortunately, to have regrettable effects—on one side; on the other side abstraction makes for a deepened understanding of the phenomena in question. Thus we come to realize that the figures in the mythical drama possess qualities that are interchangeable, because they do not have the same “existential” meaning as the concrete figures of the physical world. The latter suffer tragedy, perhaps, in the real sense, whereas the others merely enact it against the subjective backcloth of introspective consciousness. The boldest speculations of the human mind concerning the nature of the phenomenal world, namely that the wheeling stars and the whole course of human history are but the phantasmagoria of a divine dream, become, when applied to the inner drama, a scientific probability. The essential thing in the mythical drama is not the concreteness of the figures, nor is it important what sort of an animal is sacrificed or what sort of god it represents; what alone is important is that an act of sacrifice takes place, that a process of transformation is going on in the unconscious whose dynamism, whose contents and whose subject are themselves unknown but become visible indirectly to the conscious mind by stimulating the imaginative material at its disposal, clothing themselves in it like the dancers who clothe themselves in the skins of animals or the priests in the skins of their human victims.
”
”
C.G. Jung (Collected Works of C. G. Jung, Volume 5: Symbols of Transformation (The Collected Works of C. G. Jung))
“
Every hour I spent doing that while at Oxford, I wasn’t studying applied econometrics. At the time, I was conflicted about whether I could really afford to take that time away from my studies, but I stuck with it. Had I instead spent that hour each day learning the latest techniques for mastering the problems of autocorrelation in regression analysis, I would have badly misspent my life. I apply the tools of econometrics a few times a year, but I apply my knowledge of the purpose of my life every day. This is the most valuable, useful piece of knowledge that I have ever gained.
”
”
Clayton M. Christensen (How Will You Measure Your Life?)
“
Wikipedia: Pygmalion effect
The Pygmalion effect, or Rosenthal effect, is a psychological phenomenon in which high expectations lead to improved performance in a given area. …
… According to the Pygmalion effect, the targets of the expectations internalize their positive labels, and those with positive labels succeed accordingly; a similar process works in the opposite direction in the case of low expectations. The idea behind the Pygmalion effect is that increasing the leader's expectation of the follower's performance will result in better follower performance.
… The educational psychologist Robert L. Thorndike described the poor quality of the Pygmalion study. The problem with the study was that the instrument used to assess the children's IQ scores was seriously flawed. The average reasoning IQ score for the children in one regular class was in the mentally disabled range, a highly unlikely outcome in a regular class in a garden variety school. In the end, Thorndike concluded that the Pygmalion findings were worthless. It is more likely that the rise in IQ scores from the mentally disabled range was the result of regression toward the mean, not teacher expectations. Moreover, a meta-analysis conducted by Raudenbush showed that when teachers had gotten to know their students for two weeks, the effect of a prior expectancy induction was reduced to virtually zero.
”
”
Wikipedia Contributors
“
The p-value for each independent variable tests the null hypothesis that the variable has no relationship with the dependent variable.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
In statistics, correlation is a quantitative assessment that measures both the direction and the strength of this tendency to vary together.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Pearson’s correlation takes all of the data points on this graph and represents them with a single summary statistic. In this case, the statistical output below indicates that the correlation is 0.705.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Pearson’s correlation coefficient is represented by the Greek letter rho (ρ) for the population parameter and r for a sample statistic. This coefficient is a single number that measures both the strength and direction of the linear relationship between two continuous variables. Values can range from -1 to +1.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
When the value is in-between 0 and +1/-1, there is a relationship, but the points don’t all fall on a line. As r approaches -1 or 1, the strength of the relationship increases and the data points tend to fall closer to a line.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Pearson’s correlation coefficient is unaffected by scaling issues. Consequently, a statistical assessment is better for determining the precise strength of the relationship.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Pearson’s correlation measures only linear relationships. Consequently, if your data contain a curvilinear relationship, the correlation coefficient will not detect it.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Correlations have a hypothesis test. As with any hypothesis test, this test takes sample data and evaluates two mutually exclusive statements about the population from which the sample was drawn. For Pearson correlations, the two hypotheses are the following: Null hypothesis: There is no linear relationship between the two variables. ρ = 0. Alternative hypothesis: There is a linear relationship between the two variables. ρ ≠ 0. A correlation of zero indicates that no linear relationship exists. If your p-value is less than your significance level, the sample contains sufficient evidence to reject the null hypothesis and conclude that the correlation does not equal zero. In other words, the sample data support the notion that the relationship exists in the population.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
correlation does not mean that the changes in one variable actually cause the changes in the other variable.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
What is a good correlation? How high should it be? These are commonly asked questions. I have seen several schemes that attempt to classify correlations as strong, medium, and weak. However, there is only one correct answer. The correlation coefficient should accurately reflect the strength of the relationship. Take a look at the correlation between the height and weight data, 0.705. It’s not a very strong relationship, but it accurately represents our data. An accurate representation is the best-case scenario for using a statistic to describe an entire dataset.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
squared is a primary measure of how well a regression model fits the data. This statistic represents the percentage of variation in one variable that other variables explain. For a pair of variables, R-squared is simply the square of the Pearson’s correlation coefficient. For example, squaring the height-weight correlation coefficient of 0.705 produces an R-squared of 0.497, or 49.7%. In other words, height explains about half the variability of weight in preteen girls.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
The dependent variable is a variable that you want to explain or predict using the model. The values of this variable depend on other variables. It’s also known as the response variable, outcome variable, and it is commonly denoted using a Y. Traditionally, analysts graph dependent variables and the vertical, or Y, axis.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Independent variables are the variables that you include in the model to explain or predict changes in the dependent variable.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Regression analysis mathematically describes the relationships between independent variables and a dependent variable. Use regression for two primary goals: To understand the relationships between these variables. How do changes in the independent variables relate to changes in the dependent variable? To predict the dependent variable by entering values for the independent variables into the regression equation.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model!
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
values and coefficients are they key regression output. Collectively, these statistics indicate whether the variables are statistically significant and describe the relationships between the independent variables and the dependent variable. Low p-values (typically < 0.05) indicate that the independent variable is statistically significant. Regression analysis is a form of inferential statistics. Consequently, the p-values help determine whether the relationships that you observe in your sample also exist in the larger population. The coefficients for the independent variables represent the average change in the dependent variable given a one-unit change in the independent variable (IV) while controlling the other IVs.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
The low p-values indicate that both education and IQ are statistically significant. The coefficient for IQ (4.796) indicates that each additional IQ point increases your income by an average of approximately $4.80 while controlling everything else in the model. Furthermore, the education coefficient (24.215) indicates that an additional year of education increases average earnings by $24.22 while holding the other variables constant.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
ordinary least squares (OLS).
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Continuous variables can take on almost any numeric value and can be meaningfully divided into smaller increments, including fractional and decimal values. You often measure a continuous variable on a scale. For example, when you measure height, weight, and temperature, you have continuous data. Categorical variables have values that you can put into a countable number of distinct groups based on a characteristic. Categorical variables are also called qualitative variables or attribute variables. For example, college major is a categorical variable that can have values such as psychology, political science, engineering, biology, etc.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Observed values of the dependent variable are the values of the dependent variable that you record during your study or experiment along with the values of the independent variables. These values are denoted using Y. Fitted values are the values that the model predicts for the dependent variable using the independent variables. If you input values for the independent variables into the regression equation, you obtain the fitted value. Predicted values and fitted values are synonyms.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
A residual is the distance between an observed value and the corresponding fitted value.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Graphically, residuals are the vertical distances between the observed values and the fitted values. On the graph, the line represents the fitted values from the regression model. We call this line . . . the fitted line! The lines that connect the data points to the fitted line represent the residuals. The length of the line is the value of the residual.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
For a good model, the residuals should be relatively small and unbiased. In statistics, bias indicates that estimates are systematically too high or too low. Unbiased estimates are correct on average.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
OLS regression squares those residuals so they’re always positive. In this manner, the process can add them up without canceling each other out.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
This process produces squared residuals, which statisticians call squared errors.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
OLS draws the line that minimizes the sum of squared errors (SSE). Hopefully, you’re gaining an appreciation for why the procedure is named ordinary least squares!
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
SSE is a measure of variability. As the points spread out further from the fitted line, SSE increases. Because the calculations use squared differences, the variance is in squared units rather the original units of the data. While higher values indicate greater variability, there is no intuitive interpretation of specific values. However, for a given data set, smaller SSE values signal that the observations fall closer to the fitted values. OLS minimizes this value, which means you’re getting the best possible line.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
The result is that an individual outlier can exert a strong influence over the entire model and, by itself, dramatically change the results.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
These three sums of squares have the following mathematical relationship: RSS + SSE = TSS
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Understanding this relationship is fairly straight forward. RSS represents the variability that your model explains. Higher is usually good. SSE represents the variability that your model does not explain. Smaller is usually good. TSS represents the variability inherent in your dependent variable. Or, Explained Variability + Unexplained Variability = Total Variability
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Additionally, if you take RSS / TSS, you’ll obtain the percentage of the variability of the dependent variable around its mean that your model explains. This statistic is R-squared!
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Based on the mathematical relationship shown above, you know that R-squared can range from 0 – 100%. Zero indicates that the model accounts for none of the variability in the dependent variable around its mean. 100% signifies that the model explains all of that variability.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
This graph shows all the observations together with a line that represents the fitted relationship. As is traditional, the Y-axis displays the dependent variable, which is weight. The X-axis shows the independent variable, which is height. The line is the fitted line. If you enter the full range of height values that are on the X-axis into the regression equation that the chart displays, you will obtain the line shown on the graph. This line produces a smaller SSE than any other line you can draw through these observations. Visually, we see that that the fitted line has a positive slope that corresponds to the positive correlation we obtained earlier. The line follows the data points, which indicates that the model fits the data. The slope of the line equals the coefficient that I circled. This coefficient indicates how much mean weight tends to increase as we increase height. We can also enter a height value into the equation and obtain a prediction for the mean weight. Each point on the fitted line represents the mean weight for a given height. However, like any mean, there is variability around the mean. Notice how there is a spread of data points around the line. You can assess this variability by picking a spot on the line and observing the range of data points above and below that point. Finally, the vertical distance between each data point and the line is the residual for that observation.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Therefore all the things that appear in CLS—network analysis, digital mapping, linear and nonlinear regressions, topic modeling, topology, entropy—are just fancier ways of talking about word frequency changes.
”
”
Nan Z. Da
“
Examples of common algorithms used in supervised learning include regression analysis (i.e. linear regression, logistic regression, non-linear regression), decision trees, k-nearest neighbors, neural networks, and support vector machines, each of which are examined in later chapters.
”
”
Oliver Theobald (Machine Learning for Absolute Beginners: A Plain English Introductiom)
“
regression results. Standardized Coefficients The question arises as to which independent variable has the greatest impact on explaining the dependent variable. The slope of the coefficients (b) does not answer this question because each slope is measured in different units (recall from Chapter 14 that b = ∆y/∆x). Comparing different slope coefficients is tantamount to comparing apples and oranges. However, based on the regression coefficient (or slope), it is possible to calculate the standardized coefficient, β (beta). Beta is defined as the change produced in the dependent variable by a unit of change in the independent variable when both variables are measured in terms of standard deviation units. Beta is unit-less and thus allows for comparison of the impact of different independent variables on explaining the dependent variable. Analysts compare the relative values of beta coefficients; beta has no inherent meaning. It is appropriate to compare betas across independent variables in the same regression, not across different regressions. Based on Table 15.1, we conclude that the impact of having adequate authority on explaining productivity is [(0.288 – 0.202)/0.202 =] 42.6 percent greater than teamwork, and about equal to that of knowledge. The impact of having adequate authority is two-and-a-half times greater than that of perceptions of fair rewards and recognition.4 F-Test Table 15.1 also features an analysis of variance (ANOVA) table. The global F-test examines the overall effect of all independent variables jointly on the dependent variable. The null hypothesis is that the overall effect of all independent variables jointly on the dependent variables is statistically insignificant. The alternate hypothesis is that this overall effect is statistically significant. The null hypothesis implies that none of the regression coefficients is statistically significant; the alternate hypothesis implies that at least one of the regression coefficients is statistically significant. The
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
Assume that a welfare manager in our earlier example (see discussion of path analysis) takes a snapshot of the status of the welfare clients. Some clients may have obtained employment and others not yet. Clients will also vary as to the amount of time that they have been receiving welfare. Examine the data in Table 18.2. It shows that neither of the two clients, who have yet to complete their first week on welfare, has found employment; one of the three clients who have completed one week of welfare has found employment. Censored observations are observations for which the specified outcome has yet to occur. It is assumed that all clients who have not yet found employment are still waiting for this event to occur. Thus, the sample should not include clients who are not seeking employment. Note, however, that a censored observation is very different from one that has missing data, which might occur because the manager does not know whether the client has found employment. As with regression, records with missing data are excluded from analysis. A censored
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
Note: The median survival time is 5.19. Survival analysis can also examine survival rates for different “treatments” or conditions. Assume that data are available about the number of dependents that each client has. Table 18.3 is readily produced for each subset of this condition. For example, by comparing the survival rates of those with and those without dependents, the probability density figure, which shows the likelihood of an event occurring, can be obtained (Figure 18.5). This figure suggests that having dependents is associated with clients’ finding employment somewhat faster. Beyond Life Tables Life tables require that the interval (time) variable be measured on a discrete scale. When the time variable is continuous, Kaplan-Meier survival analysis is used. This procedure is quite analogous to life tables analysis. Cox regression is similar to Kaplan-Meier but allows for consideration of a larger number of independent variables (called covariates). In all instances, the purpose is to examine the effect of treatment on the survival of observations, that is, the occurrence of a dichotomous event. Figure 18.5 Probability Density FACTOR ANALYSIS A variety of statistical techniques help analysts to explore relationships in their data. These exploratory techniques typically aim to create groups of variables (or observations) that are related to each
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
other and distinct from other groups. These techniques usually precede regression and other analyses. Factor analysis is a well-established technique that often aids in creating index variables. Earlier, Chapter 3 discussed the use of Cronbach alpha to empirically justify the selection of variables that make up an index. However, in that approach analysts must still justify that variables used in different index variables are indeed distinct. By contrast, factor analysis analyzes a large number of variables (often 20 to 30) and classifies them into groups based on empirical similarities and dissimilarities. This empirical assessment can aid analysts’ judgments regarding variables that might be grouped together. Factor analysis uses correlations among variables to identify subgroups. These subgroups (called factors) are characterized by relatively high within-group correlation among variables and low between-group correlation among variables. Most factor analysis consists of roughly four steps: (1) determining that the group of variables has enough correlation to allow for factor analysis, (2) determining how many factors should be used for classifying (or grouping) the variables, (3) improving the interpretation of correlations and factors (through a process called rotation), and (4) naming the factors and, possibly, creating index variables for subsequent analysis. Most factor analysis is used for grouping of variables (R-type factor analysis) rather than observations (Q-type). Often, discriminant analysis is used for grouping of observations, mentioned later in this chapter. The terminology of factor analysis differs greatly from that used elsewhere in this book, and the discussion that follows is offered as an aid in understanding tables that might be encountered in research that uses this technique. An important task in factor analysis is determining how many common factors should be identified. Theoretically, there are as many factors as variables, but only a few factors account for most of the variance in the data. The percentage of variation explained by each factor is defined as the eigenvalue divided by the number of variables, whereby the
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
others seek to create and predict classifications through independent variables. Table 18.4 Factor Analysis Note: Factor analysis with Varimax rotation. Source: E. Berman and J. West. (2003). “What Is Managerial Mediocrity? Definition, Prevalence and Negative Impact (Part 1).” Public Performance & Management Review, 27 (December): 7–27. Multidimensional scaling and cluster analysis aim to identify key dimensions along which observations (rather than variables) differ. These techniques differ from factor analysis in that they allow for a hierarchy of classification dimensions. Some also use graphics to aid in visualizing the extent of differences and to help in identifying the similarity or dissimilarity of observations. Network analysis is a descriptive technique used to portray relationships among actors. A graphic representation can be made of the frequency with which actors interact with each other, distinguishing frequent interactions from those that are infrequent. Discriminant analysis is used when the dependent variable is nominal with two or more categories. For example, we might want to know how parents choose among three types of school vouchers. Discriminant analysis calculates regression lines that distinguish (discriminate) among the nominal groups (the categories of the dependent variable), as well as other
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
regression lines that describe the relationship of the independent variables for each group (called classification functions). The emphasis in discriminant analysis is the ability of the independent variables to correctly predict values of the nominal variable (for example, group membership). Discriminant analysis is one strategy for dealing with dependent variables that are nominal with three or more categories. Multinomial logistic regression and ordinal regression have been developed in recent years to address nominal and ordinal dependent variables in logic regression. Multinomial logistic regression calculates functions that compare the probability of a nominal value occurring relative to a base reference group. The calculation of such probabilities makes this technique an interesting alternative to discriminant analysis. When the nominal dependent variable has three values (say, 1, 2, and 3), one logistic regression predicts the likelihood of 2 versus 1 occurring, and the other logistic regression predicts the likelihood of 3 versus 1 occurring, assuming that “1” is the base reference group.7 When the dependent variable is ordinal, ordinal regression can be used. Like multinomial logistic regression, ordinal regression often is used to predict event probability or group membership. Ordinal regression assumes that the slope coefficients are identical for each value of the dependent variable; when this assumption is not met, multinomial logistic regression should be considered. Both multinomial logistic regression and ordinal regression are relatively recent developments and are not yet widely used. Statistics, like other fields of science, continues to push its frontiers forward and thereby develop new techniques for managers and analysts. Key Point Advanced statistical tools are available. Understanding the proper circumstances under which these tools apply is a prerequisite for using them.
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
SUMMARY A vast array of additional statistical methods exists. In this concluding chapter, we summarized some of these methods (path analysis, survival analysis, and factor analysis) and briefly mentioned other related techniques. This chapter can help managers and analysts become familiar with these additional techniques and increase their access to research literature in which these techniques are used. Managers and analysts who would like more information about these techniques will likely consult other texts or on-line sources. In many instances, managers will need only simple approaches to calculate the means of their variables, produce a few good graphs that tell the story, make simple forecasts, and test for significant differences among a few groups. Why, then, bother with these more advanced techniques? They are part of the analytical world in which managers operate. Through research and consulting, managers cannot help but come in contact with them. It is hoped that this chapter whets the appetite and provides a useful reference for managers and students alike. KEY TERMS Endogenous variables Exogenous variables Factor analysis Indirect effects Loading Path analysis Recursive models Survival analysis Notes 1. Two types of feedback loops are illustrated as follows: 2. When feedback loops are present, error terms for the different models will be correlated with exogenous variables, violating an error term assumption for such models. Then, alternative estimation methodologies are necessary, such as two-stage least squares and others discussed later in this chapter. 3. Some models may show double-headed arrows among error terms. These show the correlation between error terms, which is of no importance in estimating the beta coefficients. 4. In SPSS, survival analysis is available through the add-on module in SPSS Advanced Models. 5. The functions used to estimate probabilities are rather complex. They are so-called Weibull distributions, which are defined as h(t) = αλ(λt)a–1, where a and 1 are chosen to best fit the data. 6. Hence, the SSL is greater than the squared loadings reported. For example, because the loadings of variables in groups B and C are not shown for factor 1, the SSL of shown loadings is 3.27 rather than the reported 4.084. If one assumes the other loadings are each .25, then the SSL of the not reported loadings is [12*.252 =] .75, bringing the SSL of factor 1 to [3.27 + .75 =] 4.02, which is very close to the 4.084 value reported in the table. 7. Readers who are interested in multinomial logistic regression can consult on-line sources or the SPSS manual, Regression Models 10.0 or higher. The statistics of discriminant analysis are very dissimilar from those of logistic regression, and readers are advised to consult a separate text on that topic. Discriminant analysis is not often used in public
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
Beyond One-Way ANOVA The approach described in the preceding section is called one-way ANOVA. This scenario is easily generalized to accommodate more than one independent variable. These independent variables are either discrete (called factors) or continuous (called covariates). These approaches are called n-way ANOVA or ANCOVA (the “C” indicates the presence of covariates). Two way ANOVA, for example, allows for testing of the effect of two different independent variables on the dependent variable, as well as the interaction of these two independent variables. An interaction effect between two variables describes the way that variables “work together” to have an effect on the dependent variable. This is perhaps best illustrated by an example. Suppose that an analyst wants to know whether the number of health care information workshops attended, as well as a person’s education, are associated with healthy lifestyle behaviors. Although we can surely theorize how attending health care information workshops and a person’s education can each affect an individual’s healthy lifestyle behaviors, it is also easy to see that the level of education can affect a person’s propensity for attending health care information workshops, as well. Hence, an interaction effect could also exist between these two independent variables (factors). The effects of each independent variable on the dependent variable are called main effects (as distinct from interaction effects). To continue the earlier example, suppose that in addition to population, an analyst also wants to consider a measure of the watershed’s preexisting condition, such as the number of plant and animal species at risk in the watershed. Two-way ANOVA produces the results shown in Table 13.4, using the transformed variable mentioned earlier. The first row, labeled “model,” refers to the combined effects of all main and interaction effects in the model on the dependent variable. This is the global F-test. The “model” row shows that the two main effects and the single interaction effect, when considered together, are significantly associated with changes in the dependent variable (p < .000). However, the results also show a reduced significance level of “population” (now, p = .064), which seems related to the interaction effect (p = .076). Although neither effect is significant at conventional levels, the results do suggest that an interaction effect is present between population and watershed condition (of which the number of at-risk species is an indicator) on watershed wetland loss. Post-hoc tests are only provided separately for each of the independent variables (factors), and the results show the same homogeneous grouping for both of the independent variables. Table 13.4 Two-Way ANOVA Results As we noted earlier, ANOVA is a family of statistical techniques that allow for a broad range of rather complex experimental designs. Complete coverage of these techniques is well beyond the scope of this book, but in general, many of these techniques aim to discern the effect of variables in the presence of other (control) variables. ANOVA is but one approach for addressing control variables. A far more common approach in public policy, economics, political science, and public administration (as well as in many others fields) is multiple regression (see Chapter 15). Many analysts feel that ANOVA and regression are largely equivalent. Historically, the preference for ANOVA stems from its uses in medical and agricultural research, with applications in education and psychology. Finally, the ANOVA approach can be generalized to allow for testing on two or more dependent variables. This approach is called multiple analysis of variance, or MANOVA. Regression-based analysis can also be used for dealing with multiple dependent variables, as mentioned in Chapter 17.
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
Simple Regression CHAPTER OBJECTIVES After reading this chapter, you should be able to Use simple regression to test the statistical significance of a bivariate relationship involving one dependent and one independent variable Use Pearson’s correlation coefficient as a measure of association between two continuous variables Interpret statistics associated with regression analysis Write up the model of simple regression Assess assumptions of simple regression This chapter completes our discussion of statistical techniques for studying relationships between two variables by focusing on those that are continuous. Several approaches are examined: simple regression; the Pearson’s correlation coefficient; and a nonparametric alterative, Spearman’s rank correlation coefficient. Although all three techniques can be used, we focus particularly on simple regression. Regression allows us to predict outcomes based on knowledge of an independent variable. It is also the foundation for studying relationships among three or more variables, including control variables mentioned in Chapter 2 on research design (and also in Appendix 10.1). Regression can also be used in time series analysis, discussed in Chapter 17. We begin with simple regression. SIMPLE REGRESSION Let’s first look at an example. Say that you are a manager or analyst involved with a regional consortium of 15 local public agencies (in cities and counties) that provide low-income adults with health education about cardiovascular diseases, in an effort to reduce such diseases. The funding for this health education comes from a federal grant that requires annual analysis and performance outcome reporting. In Chapter 4, we used a logic model to specify that a performance outcome is the result of inputs, activities, and outputs. Following the development of such a model, you decide to conduct a survey among participants who attend such training events to collect data about the number of events they attended, their knowledge of cardiovascular disease, and a variety of habits such as smoking that are linked to cardiovascular disease. Some things that you might want to know are whether attending workshops increases
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
Table 14.1 also shows R-square (R2), which is called the coefficient of determination. R-square is of great interest: its value is interpreted as the percentage of variation in the dependent variable that is explained by the independent variable. R-square varies from zero to one, and is called a goodness-of-fit measure.5 In our example, teamwork explains only 7.4 percent of the variation in productivity. Although teamwork is significantly associated with productivity, it is quite likely that other factors also affect it. It is conceivable that other factors might be more strongly associated with productivity and that, when controlled for other factors, teamwork is no longer significant. Typically, values of R2 below 0.20 are considered to indicate weak relationships, those between 0.20 and 0.40 indicate moderate relationships, and those above 0.40 indicate strong relationships. Values of R2 above 0.65 are considered to indicate very strong relationships. R is called the multiple correlation coefficient and is always 0 ≤ R ≤ 1. To summarize up to this point, simple regression provides three critically important pieces of information about bivariate relationships involving two continuous variables: (1) the level of significance at which two variables are associated, if at all (t-statistic), (2) whether the relationship between the two variables is positive or negative (b), and (3) the strength of the relationship (R2). Key Point R-square is a measure of the strength of the relationship. Its value goes from 0 to 1. The primary purpose of regression analysis is hypothesis testing, not prediction. In our example, the regression model is used to test the hypothesis that teamwork is related to productivity. However, if the analyst wants to predict the variable “productivity,” the regression output also shows the SEE, or the standard error of the estimate (see Table 14.1). This is a measure of the spread of y values around the regression line as calculated for the mean value of the independent variable, only, and assuming a large sample. The standard error of the estimate has an interpretation in terms of the normal curve, that is, 68 percent of y values lie within one standard error from the calculated value of y, as calculated for the mean value of x using the preceding regression model. Thus, if the mean index value of the variable “teamwork” is 5.0, then the calculated (or predicted) value of “productivity” is [4.026 + 0.223*5 =] 5.141. Because SEE = 0.825, it follows that 68 percent of productivity values will lie 60.825 from 5.141 when “teamwork” = 5. Predictions of y for other values of x have larger standard errors.6 Assumptions and Notation There are three simple regression assumptions. First, simple regression assumes that the relationship between two variables is linear. The linearity of bivariate relationships is easily determined through visual inspection, as shown in Figure 14.2. In fact, all analysis of relationships involving continuous variables should begin with a scatterplot. When variable
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
relationships are nonlinear (parabolic or otherwise heavily curved), it is not appropriate to use linear regression. Then, one or both variables must be transformed, as discussed in Chapter 12. Second, simple regression assumes that the linear relationship is constant over the range of observations. This assumption is violated when the relationship is “broken,” for example, by having an upward slope for the first half of independent variable values and a downward slope over the remaining values. Then, analysts should consider using two regression models each for these different, linear relationships. The linearity assumption is also violated when no relationship is present in part of the independent variable values. This is particularly problematic because regression analysis will calculate a regression slope based on all observations. In this case, analysts may be misled into believing that the linear pattern holds for all observations. Hence, regression results always should be verified through visual inspection. Third, simple regression assumes that the variables are continuous. In Chapter 15, we will see that regression can also be used for nominal and dichotomous independent variables. The dependent variable, however, must be continuous. When the dependent variable is dichotomous, logistic regression should be used (Chapter 16). Figure 14.2 Three Examples of r The following notations are commonly used in regression analysis. The predicted value of y (defined, based on the regression model, as y = a + bx) is typically different from the observed value of y. The predicted value of the dependent variable y is sometimes indicated as ŷ (pronounced “y-hat”). Only when R2 = 1 are the observed and predicted values identical for each observation. The difference between y and ŷ is called the regression error or error term
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
COEFFICIENT The nonparametric alternative, Spearman’s rank correlation coefficient (r, or “rho”), looks at correlation among the ranks of the data rather than among the values. The ranks of data are determined as shown in Table 14.2 (adapted from Table 11.8): Table 14.2 Ranks of Two Variables In Greater Depth … Box 14.1 Crime and Poverty An analyst wants to examine empirically the relationship between crime and income in cities across the United States. The CD that accompanies the workbook Exercising Essential Statistics includes a Community Indicators dataset with assorted indicators of conditions in 98 cities such as Akron, Ohio; Phoenix, Arizona; New Orleans, Louisiana; and Seattle, Washington. The measures include median household income, total population (both from the 2000 U.S. Census), and total violent crimes (FBI, Uniform Crime Reporting, 2004). In the sample, household income ranges from $26,309 (Newark, New Jersey) to $71,765 (San Jose, California), and the median household income is $42,316. Per-capita violent crime ranges from 0.15 percent (Glendale, California) to 2.04 percent (Las Vegas, Nevada), and the median violent crime rate per capita is 0.78 percent. There are four types of violent crimes: murder and nonnegligent manslaughter, forcible rape, robbery, and aggravated assault. A measure of total violent crime per capita is calculated because larger cities are apt to have more crime. The analyst wants to examine whether income is associated with per-capita violent crime. The scatterplot of these two continuous variables shows that a negative relationship appears to be present: The Pearson’s correlation coefficient is –.532 (p < .01), and the Spearman’s correlation coefficient is –.552 (p < .01). The simple regression model shows R2 = .283. The regression model is as follows (t-test statistic in parentheses): The regression line is shown on the scatterplot. Interpreting these results, we see that the R-square value of .283 indicates a moderate relationship between these two variables. Clearly, some cities with modest median household incomes have a high crime rate. However, removing these cities does not greatly alter the findings. Also, an assumption of regression is that the error term is normally distributed, and further examination of the error shows that it is somewhat skewed. The techniques for examining the distribution of the error term are discussed in Chapter 15, but again, addressing this problem does not significantly alter the finding that the two variables are significantly related to each other, and that the relationship is of moderate strength. With this result in hand, further analysis shows, for example, by how much violent crime decreases for each increase in household income. For each increase of $10,000 in average household income, the violent crime rate drops 0.25 percent. For a city experiencing the median 0.78 percent crime rate, this would be a considerable improvement, indeed. Note also that the scatterplot shows considerable variation in the crime rate for cities at or below the median household income, in contrast to those well above it. Policy analysts may well wish to examine conditions that give rise to variation in crime rates among cities with lower incomes. Because Spearman’s rank correlation coefficient examines correlation among the ranks of variables, it can also be used with ordinal-level data.9 For the data in Table 14.2, Spearman’s rank correlation coefficient is .900 (p = .035).10 Spearman’s p-squared coefficient has a “percent variation explained” interpretation, similar
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
regression as dummy variables Explain the importance of the error term plot Identify assumptions of regression, and know how to test and correct assumption violations Multiple regression is one of the most widely used multivariate statistical techniques for analyzing three or more variables. This chapter uses multiple regression to examine such relationships, and thereby extends the discussion in Chapter 14. The popularity of multiple regression is due largely to the ease with which it takes control variables (or rival hypotheses) into account. In Chapter 10, we discussed briefly how contingency tables can be used for this purpose, but doing so is often a cumbersome and sometimes inconclusive effort. By contrast, multiple regression easily incorporates multiple independent variables. Another reason for its popularity is that it also takes into account nominal independent variables. However, multiple regression is no substitute for bivariate analysis. Indeed, managers or analysts with an interest in a specific bivariate relationship will conduct a bivariate analysis first, before examining whether the relationship is robust in the presence of numerous control variables. And before conducting bivariate analysis, analysts need to conduct univariate analysis to better understand their variables. Thus, multiple regression is usually one of the last steps of analysis. Indeed, multiple regression is often used to test the robustness of bivariate relationships when control variables are taken into account. The flexibility with which multiple regression takes control variables into account comes at a price, though. Regression, like the t-test, is based on numerous assumptions. Regression results cannot be assumed to be robust in the face of assumption violations. Testing of assumptions is always part of multiple regression analysis. Multiple regression is carried out in the following sequence: (1) model specification (that is, identification of dependent and independent variables), (2) testing of regression assumptions, (3) correction of assumption violations, if any, and (4) reporting of the results of the final regression model. This chapter examines these four steps and discusses essential concepts related to simple and multiple regression. Chapters 16 and 17 extend this discussion by examining the use of logistic regression and time series analysis. MODEL SPECIFICATION Multiple regression is an extension of simple regression, but an important difference exists between the two methods: multiple regression aims for full model specification. This means that analysts seek to account for all of the variables that affect the dependent variable; by contrast, simple regression examines the effect of only one independent variable. Philosophically, the phrase identifying the key difference—“all of the variables that affect the dependent variable”—is divided into two parts. The first part involves identifying the variables that are of most (theoretical and practical) relevance in explaining the dependent
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
Figure 3.35 shows examples of nonstandard trend lines: FIGURE 3.35 Nonstandard Trend Lines in XLF A is drawn between lows in a downtrend instead of between highs in a downtrend. B is also drawn between lows in a downtrend. Furthermore, it ignores a large price spike in an effort to fit the line to later data. C is more of a best-fit line drawn through the center of a price area. These may be drawn freehand or via a procedure like linear regression. D is drawn between highs in an uptrend. E raises a critical point about trend lines: They are lines drawn between successive swings in the market. If there are no swings, there should be no trend line. It would be hard to argue that the market was showing any swings at E, at least on this time frame. This trend line may be valid on a lower time frame, but it is nonstandard on this time frame. In general, trend lines are tools to define the relationship between swings, and are a complement to the simple length of swing analysis. As such, one of the requirements for drawing trend lines is that there must actually be swings in the market. We see many cases where markets are flat, and it is possible to draw trend lines that touch the tops or bottoms of many consecutive price bars. With one important exception later in this chapter, these types of trend lines do not tend to be very significant. They are penetrated easily by the smallest motions in the market, and there is no reliable price action after the penetration. Avoid drawing these trend lines in flat markets with no definable swings.
”
”
Adam H. Grimes (The Art and Science of Technical Analysis: Market Structure, Price Action, and Trading Strategies (Wiley Trading Book 547))
“
How can this type of data be made to tell a reliable story? By subjecting it to the economist’s favorite trick: regression analysis. No, regression analysis is not some forgotten form of psychiatric treatment. It is a powerful—if limited—tool that uses statistical techniques to identify otherwise elusive correlations.
”
”
Steven D. Levitt (Freakonomics: A Rogue Economist Explores the Hidden Side of Everything)
“
here are some steps to identify and track code that should be reviewed carefully: Tagging user stories for security features or business workflows which handle money or sensitive data. Grepping source code for calls to dangerous function calls like crypto functions. Scanning code review comments (if you are using a collaborative code review tool like Gerrit). Tracking code check-in to identify code that is changed often: code with a high rate of churn tends to have more defects. Reviewing bug reports and static analysis to identify problem areas in code: code with a history of bugs, or code that has high complexity and low automated test coverage. Looking out for code that has recently undergone large-scale “root canal” refactoring. While day-to-day, in-phase refactoring can do a lot to simplify code and make it easier to understand and safer to change, major refactoring or redesign work can accidentally change the trust model of an application and introduce regressions.
”
”
Laura Bell (Agile Application Security: Enabling Security in a Continuous Delivery Pipeline)
“
Figure 1.1 Correlation of Sales Characteristics to HubSpot Sales Success (Results of the First Regression Analysis).
”
”
Mark Roberge (The Sales Acceleration Formula: Using Data, Technology, and Inbound Selling to go from $0 to $100 Million)
“
Identify some of the actual individuals who are your best customers. Evaluate those with the highest customer lifetime value (CLV) and develop hypotheses about their shared traits. Although demographics and psychographics might be the most obvious, you’ll find additional insights if you examine their behavior. What channels did they come through? What messages resonated? How did they onboard? How recently, frequently, and deeply have they engaged? Compare best customers and worst customers—those you acquired who weren’t ultimately profitable or who weren’t satisfied with your offering. Notice people who exhaust your free trial but don’t convert to paid, or who join but cancel within the first few months. The best customers have the greatest customer lifetime value (CLV); they will spend more with you over time than anyone else. Produce either a qualitative write-up of your best customer or use regression analysis to prioritize characteristics. Share these conclusions with your frontline team—retail workers, customer support, sales—to accrue early insights. With a concrete conception of your best customer, you can discern if the customer segment is sufficiently large to justify addressing. Test and adjust as needed. Then make these best customers and their forever promise as “real” as possible to the team. If you have actual customers who fit the profile, talk about them, invite them in, or have their pictures on your wall. You’re going to feel their pain, share their objectives, and design experiences for them. It’s important to know them well.
”
”
Robbie Kellman Baxter (The Forever Transaction: How to Build a Subscription Model So Compelling, Your Customers Will Never Want to Leave)
“
There are a hundred thousand species of love, separately invented, each more ingenious than the last, and every one of them keeps making things. OLIVIA VANDERGRIFF SNOW IS THIGH-HIGH and the going slow. She plunges through drifts like a pack animal, Olivia Vandergriff, back to the boardinghouse on the edge of campus. Her last session ever of Linear Regression and Time Series Models has finally ended. The carillon on the quad peals five, but this close to the solstice, blackness closes around Olivia like midnight. Breath crusts her upper lip. She sucks it back in, and ice crystals coat her pharynx. The cold drives a metal filament up her nose. She could die out here, for real, five blocks from home. The novelty thrills her. December of senior year. The semester so close to over. She might stumble now, fall face-first, and still roll across the finish line. What’s left? A short-answer exam on survival analysis. Final paper in Intermediate Macroeconomics. Hundred and ten slide IDs in Masterpieces of World Art, her blow-off elective. Ten
”
”
Richard Powers (The Overstory)
“
The ‘quantitative revolution’ in geography required the discipline to adopt an explicitly scientific approach, including numerical and statistical methods, and mathematical modelling, so ‘numeracy’ became another necessary skill. Its immediate impact was greatest on human geography as physical geographers were already using these methods. A new lexicon encompassing the language of statistics and its array of techniques entered geography as a whole. Terms such as random sampling, correlation, regression, tests of statistical significance, probability, multivariate analysis, and simulation became part both of research and undergraduate teaching. Correlation and regression are procedures to measure the strength and form, respectively, of the relationships between two or more sets of variables. Significance tests measure the confidence that can be placed in those relationships. Multivariate methods enable the analysis of many variables or factors simultaneously – an appropriate approach for many complex geographical data sets. Simulation is often linked to probability and is a set of techniques capable of extrapolating or projecting future trends.
”
”
John A. Matthews (Geography: A Very Short Introduction)
“
processed, and transformed into a format that is suitable for analysis. This often involves removing duplicate data, correcting errors, and dealing with missing values. After data is prepared, exploratory data analysis is performed to better understand the data and identify patterns, trends, and outliers. Descriptive statistics, data visualization, and data clustering techniques are often used to explore data. Once the data is understood, statistical methods such as hypothesis testing and regression analysis can be applied to identify relationships and make predictions.
”
”
Brian Murray (Data Analysis for Beginners: The ABCs of Data Analysis. An Easy-to-Understand Guide for Beginners)
“
Think back for a moment to the mental imagery used to explain regression analysis in the last chapter. We divide our data sample into different “rooms” in which each observation is identical except for one variable, which then allows us to isolate the effect of that variable while controlling for other potential confounding factors. We may have 692 individuals in our sample who have used both cocaine and heroin. However, we may have only 3 individuals who have used cocaine but not heroin and 2 individuals who have used heroin and not cocaine. Any inference about the independent effect of just one drug or the other is going to be based on these tiny samples.
”
”
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
“
Second, like most other statistical inference, regression analysis builds only a circumstantial case. An association between two variables is like a fingerprint at the scene of the crime. It points us in the right direction, but it’s rarely enough to convict. (And sometimes a fingerprint at the scene of a crime doesn’t belong to the perpetrator.) Any regression analysis needs a theoretical underpinning: Why are the explanatory variables in the equation? What phenomena from other disciplines can explain the observed results? For instance, why do we think that wearing purple shoes would boost performance on the math portion of the SAT or that eating popcorn can help prevent prostate cancer? The results need to be replicated, or at least consistent with other findings.
”
”
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
“
Specifically, regression analysis allows us to quantify the relationship between a particular variable and an outcome that we care about while controlling for other factors.
”
”
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
“
At its core, regression analysis seeks to find the “best fit” for a linear relationship between two variables. A simple example is the relationship between height and weight. People who are taller tend to weigh more—though that is obviously not always the case.
”
”
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
“
Sheldon George’s chapter, “Jouissance and Discontent: A Meeting of Psychoanalysis, Race and American Slavery,” provides a captivating account of Freud’s contribution to these problems. Freud contended that civilization served to usurp the individual, and interrupt the impulse gratification that is more natural and instinctive for human beings. George points out that this places the blame for human misery on civilization and culture, which renounce the pursuit of pleasure. For Freud, then, ethnicity and racial identity are regressive. Civilization moves toward unity, but ethnicity and racial identification pull us back toward the aggressive and antisocial instincts. George provides a fascinating account of the consequences of Freud’s thesis, both for psychoanalysis and the broader cultural problem of racism. Utilizing Lacan, he offers a fresh analysis of the phenomenon of American slavery and demonstrates the way race mediates the way people access jouissance. The problem at hand runs deep, and George’s chapter provides a unique perspective on the development of the modern idea of “race” and its many consequences. Donna
”
”
David M. Goodman (Race, Rage, and Resistance: Philosophy, Psychology, and the Perils of Individualism (Psychology and the Other))
“
Life gets a little trickier when we are doing our regression analysis (or other forms of statistical inference) with a small sample of data.
”
”
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
“
Regression analysis enables us to go one step further and “fit a line” that best describes a linear relationship between the two variables.
”
”
Charles Wheelan (Naked Statistics: Stripping the Dread from the Data)
“
Do not believe in anything simply because you have heard it. Do not believe in anything simply because it is spoken of and rumored by many. Do not believe in anything simply because it is found written in your religious books. Do not believe in anything merely on the authority of your teachers and elders. Do not believe in traditions because they have been handed down for many generations. But after observation and analysis, when you find that anything agrees with reason and is conducive to the good and benefit of one and all, then accept it and live up to it.” — Buddha
”
”
David Rippy (The Immortal Soul; the Journey to Enlightenment: Case Studies of Hypnotically Regressed Subjects and their Afterlives)
“
Balint introduced his concept of primary love (Balint, 1937) specifically to refute Freud's concept of primary narcissism. Balint believed, like Ferenczi and Suttie, that human beings are relationally oriented from the beginning. In the stage of primary love, mother and child ideally live interdependently, with boundaries blurred, in “an harmonious interpenetrating mix-up” (Balint, 1968). He saw the origin of psychopathology in disruptions and failures of this primary love experience. He observed that analysands, often after reaching more mature forms of relating to the analyst, would regress to the level of “the basic fault” (1968), the area of the personality formed by traumatic disruptions of the state of primary love. Analysands would then seek to use their analysis for the purpose of making a “new beginning.” The new beginning helps the analysand to “free himself of complex, rigid, and oppressive forms of relationship to his objects of love and hate … and to start simpler, less oppressive forms” (Balint, 1968, p. 134). Balint spoke memorably of the analyst's stance at this stage: the analyst … must allow his patients to relate to, or exist with, him as if he were one of the primary substances. This means that he should be but like water carries the swimmer or the earth carries the walker … [H]e must be there, must always be there, and must be indestructible—as are water and earth.
”
”
Daniel Shaw (Traumatic Narcissism: Relational Systems of Subjugation (Relational Perspectives Book Series 58))
“
Several methods have been developed to attempt to address confounding in observational research such as adjusting for the confounder in regression equations if it is known and measured
”
”
Mit Critical Data (Secondary Analysis of Electronic Health Records)
“
Davies takes the view that if, on the one hand, this state is a return to an earlier “frozen experience” (Winnicott, 1954a, p. 86 in this book), on the other hand, it may be an entirely new experience for the patient who has taken the risk of letting himself be known in this narcissistically vulnerable state. Where primary process thinking dominates, the potential for change may be greater, since defences are breached and access to unconscious material may be enhanced.
Davies suggests that the regressed state is ruptured by the recog- nition of dependence within the analysis: this can be a life or death moment, as hate finds expression when the patient emerges from the regression. Destructiveness is definitively present as the subject recognizes the presence of a discrete other, the “otherness” of the other (p. 92)
”
”
Rosine J. Perelberg (Time and Memory (The Psychoanalytic Ideas Series))
“
I speak about technical debt quite a bit in this chapter, so I wanted to leave you with a few thoughts on how to measure it. There are, of course, the static code-analysis tools that provide a view of technical debt based on coding issues. I think that is a good starting point. I would add to this the cost to deploy the application (e.g., how many hours of people does it require to deploy into a test or production environment), the cost of regression testing the product (how many hours or people time does it take to validate nothing has broken), and the cost of creating a new environment with the application. If you are more ambitious, you can also look for measures of complexity and dependencies with other applications, but I have not yet seen a good repeatable way for measuring those. The first four I mention are relatively easy to determine and should therefore be the basis for your measure of technical debt.
”
”
Mirco Hering (DevOps for the Modern Enterprise: Winning Practices to Transform Legacy IT Organizations)
Herbert Jones (Data Science: The Ultimate Guide to Data Analytics, Data Mining, Data Warehousing, Data Visualization, Regression Analysis, Database Querying, Big Data for Business and Machine Learning for Beginners)
“
hedonic regression analysis to identify how markets valued individual attributes and how those attribute values changed over time.
”
”
Clayton M. Christensen (The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail (Management of Innovation and Change))
“
hedonic regression analysis expresses the total price of a product as the sum of individual so-called shadow prices (some positive, others negative) that the market places on each of the product’s characteristics
”
”
Clayton M. Christensen (The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail (Management of Innovation and Change))
“
In fact, regression analysis demonstrates that the number one determinant of deal multiples is the growth rate of the business.
”
”
Andrew J. Sherman (Mergers and Acquisitions from A to Z)