“
I am now convinced that Google searches are the most important dataset ever collected on the human psyche. This
”
”
Seth Stephens-Davidowitz (Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are)
“
So the problem is not the algorithms, or the big datasets. The problem is a lack of scrutiny, transparency, and debate.
”
”
Tim Harford (The Data Detective: Ten Easy Rules to Make Sense of Statistics)
“
Forget everything you ordinarily associate with religious study. Strip away all the reverence and the awe and the art and the philosophy of it. Treat the subject coldly. Imagine yourself to be a theologist, but a special kind of theologist, one who studies gods the way an entomologist studies insects. Take as your dataset the entirety of world mythology and treat it as a collection of field observations and statistics pertaining to a hypothetical species: the god. Proceed from there.
”
”
Lev Grossman (The Magician King (The Magicians, #2))
“
Some readers are bound to want to take the techniques we’ve introduced here and try them on the problem of forecasting the future price of securities on the stock market (or currency exchange rates, and so on). Markets have very different statistical characteristics than natural phenomena such as weather patterns. Trying to use machine learning to beat markets, when you only have access to publicly available data, is a difficult endeavor, and you’re likely to waste your time and resources with nothing to show for it.
Always remember that when it comes to markets, past performance is not a good predictor of future returns—looking in the rear-view mirror is a bad way to drive. Machine learning, on the other hand, is applicable to datasets where the past is a good predictor of the future.
”
”
François Chollet (Deep Learning with Python)
“
I’ve laid down ten statistical commandments in this book. First, we should learn to stop and notice our emotional reaction to a claim, rather than accepting or rejecting it because of how it makes us feel. Second, we should look for ways to combine the “bird’s eye” statistical perspective with the “worm’s eye” view from personal experience. Third, we should look at the labels on the data we’re being given, and ask if we understand what’s really being described. Fourth, we should look for comparisons and context, putting any claim into perspective. Fifth, we should look behind the statistics at where they came from—and what other data might have vanished into obscurity. Sixth, we should ask who is missing from the data we’re being shown, and whether our conclusions might differ if they were included. Seventh, we should ask tough questions about algorithms and the big datasets that drive them, recognizing that without intelligent openness they cannot be trusted. Eighth, we should pay more attention to the bedrock of official statistics—and the sometimes heroic statisticians who protect it. Ninth, we should look under the surface of any beautiful graph or chart. And tenth, we should keep an open mind, asking how we might be mistaken, and whether the facts have changed.
”
”
Tim Harford (The Data Detective: Ten Easy Rules to Make Sense of Statistics)
“
Humans do weird things to datasets.
”
”
Janelle Shane (You Look Like a Thing and I Love You: How Artificial Intelligence Works and Why It's Making the World a Weirder Place)
“
As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category, and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples.
”
”
Ian Goodfellow
“
At this risk of being redundant, everything must be checked in to code. This includes the following: DB object migrations Triggers Procedures and functions Views Configurations Sample datasets for functionality Data cleanup scripts
”
”
Laine Campbell (Database Reliability Engineering: Designing and Operating Resilient Database Systems)
“
We received a new dataset each day. Because it took time for new cases to be reported, there were fewer recent cases in each of these datasets: if someone fell ill on a Monday, they generally wouldn’t show up in the data until Wednesday or Thursday. The epidemic was still going, but these delays made it look like it was almost over.
”
”
Adam Kucharski (The Rules of Contagion: Why Things Spread - and Why They Stop)
“
What astounded me was that the cutting edge of human knowledge was so close. Before I educated myself, I assumed that there was a great depth of science, that every question of importance had been cataloged, studied, that all the answers were there, if only someone could query the datasets the right way. And for some things, that was true.
”
”
James S.A. Corey (The Vital Abyss (Expanse, #5.5))
“
Big Data allows us to meaningfully zoom in on small segments of a dataset to gain new insights on who we are.
”
”
Seth Stephens-Davidowitz (Everybody Lies)
“
We must not confuse performance on a dataset with the acquisition of an underlying ability.
”
”
Geiros
“
I’m rich in interpretation and poor in datasets,
”
”
James S.A. Corey (Cibola Burn (Expanse, #4))
“
Hashing is more secure than encryption, at least in the sense that there exists no private key that can “reverse” a hash back into its original, readable form. Thus, if a machine doesn’t need to know the contents of a dataset, it should be given the hash of the dataset instead.
”
”
Chris Dannen (Introducing Ethereum and Solidity: Foundations of Cryptocurrency and Blockchain Programming for Beginners)
“
Roy Jastram has produced a systematic study of the purchasing power of gold over the longest consistent datasets available.6 Observing English data from 1560 to 1976 to analyze the change in gold's purchasing power in terms of commodities, Jastram finds gold dropping in purchasing power during the first 140 years, but then remaining relatively stable from 1700 to 1914, when Britain went off the gold standard. For more than two centuries during which Britain primarily used gold as money, its purchasing power remained relatively constant, as did the price of wholesale commodities.
”
”
Saifedean Ammous (The Bitcoin Standard: The Decentralized Alternative to Central Banking)
“
Understanding reduces the complexity of data by collapsing the dimensionality of information to a lower set of known variables. "
"There you have it: a generalizable principle. What was once a massive, high-dimensional dataset has now collapsed to a single dimension, a simple principle that comes from using the data but is not the data itself. Understanding transcends context, since the different contexts collapse according to their previously unknown similarity, which the principle contains. That is what understanding does. And you actually feel it in your brain when it happens. Your “cognitive load” decreases, your level of stress and anxiety decrease, and your emotional state improves.
”
”
Beau Lotto (Deviate: The Science of Seeing Differently)
“
In the 1990s, a set of renegade researchers set aside many of the earlier era’s assumptions, shifting their focus to machine learning. While machine learning dated to the 1950s, new advances enabled practical applications. The methods that have worked best in practice extract patterns from large datasets using neural networks. In philosophical terms, AI’s pioneers had turned from the early Enlightenment’s focus on reducing the world to mechanistic rules to constructing approximations of reality. To identify an image of a cat, they realized, a machine had to “learn” a range of visual representations of cats by observing the animal in various contexts. To enable machine learning, what mattered was the overlap between various representations of a thing, not its ideal—in philosophical terms, Wittgenstein, not Plato. The modern field of machine learning—of programs that learn through experience—was born.
”
”
Henry Kissinger (The Age of A.I. and Our Human Future)
“
Computational model: history is the on-chain population; all the rest is editorialization. There’s a great book by Franco Moretti called Graphs, Maps, and Trees. It’s a computational study of literature. Moretti’s argument is that every other study of literature is inherently biased. The selection of which books to discuss is itself an implicit editorialization. He instead makes this completely explicit by creating a dataset of full texts, and writing code to produce graphs. The argument here is that only a computational history can represent the full population in a statistical sense; anything else is just a biased sample.
”
”
Balaji S. Srinivasan (The Network State: How To Start a New Country)
“
Both measurement error and sampling error are unpredictable, but they’re predictably unpredictable. You can always expect data from different samples, measures or groups to have somewhat different characteristics – in terms of the averages, the highest and lowest scores, and practically everything else. So even though they’re normally a nuisance, measurement error and sampling error can be useful as a means of spotting fraudulent data. If a dataset looks too neat, too tidily similar across different groups, something strange might be afoot. As the geneticist J. B. S. Haldane put it, ‘man is an orderly animal’ who ‘finds it very hard to imitate the disorder of nature’, and that goes for fraudsters as much as for the rest of us.
”
”
Stuart Ritchie (Science Fictions)
“
I have spent just about every day of the past four years analyzing Google data. This included a stint as a data scientist at Google, which hired me after learning about my racism research. And I continue to explore this data as an opinion writer and data journalist for the New York Times. The revelations have kept coming. Mental illness; human sexuality; child abuse; abortion; advertising; religion; health. Not exactly small topics, and this dataset, which didn’t exist a couple of decades ago, offered surprising new perspectives on all of them. Economists and other social scientists are always hunting for new sources of data, so let me be blunt: I am now convinced that Google searches are the most important dataset ever collected on the human psyche.
”
”
Seth Stephens-Davidowitz (Everybody Lies)
“
One approach involves looking at three different kinds of bias: physical bias, computational bias, and interpretation bias. This approach was proposed by UCLA engineering professor Achuta Kadambi in a 2021 Science paper.26 Physical bias manifests in the mechanics of the device, as when a pulse oximeter works better on light skin than darker skin. Computational bias might come from the software or the dataset used to develop a diagnostic, as when only light skin is used to train a skin cancer detection algorithm. Interpretation bias might occur when a doctor applies unequal, race-based standards to the output of a test or device, as when doctors give a different GFR threshold to Black patients. “Bias is multidimensional,” Kadambi told Scientific American. “By understanding where it originates, we can better correct it.
”
”
Meredith Broussard (More than a Glitch: Confronting Race, Gender, and Ability Bias in Tech)
“
Bertrand Russell famously said: “It is undesirable to believe a proposition when there is no ground whatsoever for supposing it is true.” [but] Russell’s maxim is the luxury of a technologically advanced society with science, history, journalism, and their infrastructure of truth-seeking, including archival records, digital datasets, high-tech instruments, and communities of editing, fact-checking, and peer review. We children of the Enlightenment embrace the radical creed of universal realism: we hold that all our beliefs should fall within the reality mindset. We care about whether our creation story, our founding legends, our theories of invisible nutrients and germs and forces, our conceptions of the powerful, our suspicions about our enemies, are true or false. That’s because we have the tools to get answers to these questions, or at least to assign them warranted degrees of credence. And we have a technocratic state that should, in theory, put these beliefs into practice.
But as desirable as that creed is, it is not the natural human way of believing. In granting an imperialistic mandate to the reality mindset to conquer the universe of belief and push mythology to the margins, we are the weird ones—or, as evolutionary social scientists like to say, the WEIRD ones: Western, Educated, Industrialized, Rich, Democratic. At least, the highly educated among us are, in our best moments. The human mind is adapted to understanding remote spheres of existence through a mythology mindset. It’s not because we descended from Pleistocene hunter-gatherers specifically, but because we descended from people who could not or did not sign on to the Enlightenment ideal of universal realism. Submitting all of one’s beliefs to the trials of reason and evidence is an unnatural skill, like literacy and numeracy, and must be instilled and cultivated.
”
”
Pinker Steven (Rationality: What It Is, Why It Seems Scarce, Why It Matters)
“
Determine and fill in the missing value manually. In general, this approach is the most accurate but it is also time-consuming and often is not feasible in a large dataset with many missing
”
”
Mit Critical Data (Secondary Analysis of Electronic Health Records)
“
Data integration is the process of combining data derived from various data sources (such as databases, flat files, etc.) into a consistent dataset. There are a number of issues to consider
”
”
Mit Critical Data (Secondary Analysis of Electronic Health Records)
“
Census taking is among the oldest ways of collecting statistics. Much newer, but with similar aspirations to reach everyone, is “big data.” Professor Viktor Mayer-Schönberger of Oxford’s Internet Institute, and coauthor of the book Big Data, told me that his favored definition of a big dataset is one where “N = All”—where we no longer have to sample, because we have the entire background population.[18
”
”
Tim Harford (The Data Detective: Ten Easy Rules to Make Sense of Statistics)
“
What is a good correlation? How high should it be? These are commonly asked questions. I have seen several schemes that attempt to classify correlations as strong, medium, and weak. However, there is only one correct answer. The correlation coefficient should accurately reflect the strength of the relationship. Take a look at the correlation between the height and weight data, 0.705. It’s not a very strong relationship, but it accurately represents our data. An accurate representation is the best-case scenario for using a statistic to describe an entire dataset.
”
”
Jim Frost (Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models)
“
Heterogeneous effects might be hidden because PD plots only show the average marginal effects. Suppose that for a feature half your data points have a positive association with the prediction – the larger the feature value the larger the prediction – and the other half has a negative association – the smaller the feature value the larger the prediction. The PD curve could be a horizontal line, since the effects of both halves of the dataset could cancel each other out. You then conclude that the feature has no effect on the prediction. By plotting the individual conditional expectation curves instead of the aggregated line, we can uncover heterogeneous effects.
”
”
Christoph Molnar (Interpretable Machine Learning: A Guide For Making Black Box Models Explainable)
“
The computation of partial dependence plots is intuitive: The partial dependence function at a particular feature value represents the average prediction if we force all data points to assume that feature value. In my experience, lay people usually understand the idea of PDPs quickly. If the feature for which you computed the PDP is not correlated with the other features, then the PDPs perfectly represent how the feature influences the prediction on average. In the uncorrelated case, the interpretation is clear: The partial dependence plot shows how the average prediction in your dataset changes when the j-th feature is changed. It is more complicated when features are correlated, see also disadvantages. Partial dependence plots are easy to implement. The calculation for the partial dependence plots has a causal interpretation. We intervene on a feature and measure the changes in the predictions. In doing so, we analyze the causal relationship between the feature and the prediction.3 The relationship is causal for the model – because we explicitly model the outcome as a function of the features – but not necessarily for the real world!
”
”
Christoph Molnar (Interpretable Machine Learning: A Guide For Making Black Box Models Explainable)
“
am now convinced that Google searches are the most important dataset ever collected on the human psyche.
”
”
Seth Stephens-Davidowitz (Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are)
“
Tesla > NYT. Elon Musk used the instrumental record of a Tesla drive to knock down an NYT story. The New York Times Company claimed the car had run out of charge, but his dataset showed they had purposefully driven it around to make this happen, lying about their driving history. His numbers overturned their letters. Timestamp > Macron, NYT. Twitter posters used a photo’s timestamp to disprove a purported photo of the Brazilian fires that was tweeted by Emmanuel Macron and printed uncritically by NYT. The photo was shown via reverse image search to be taken by a photographer who had died in 2003, so it was more than a decade old. This was a big deal because The Atlantic was literally calling for war with Brazil over these (fake) photos. Provable patent priority. A Chinese court used an on-chain timestamp to establish priority in a patent suit. One company proved that it could not have infringed the patent of the other, because it had filed “on chain” before the other company had filed. In the first and second examples, the employees of the New York Times Company simply misrepresented the facts as they are wont to do, circulating assertions that were politically useful against two of their perennial opponents: the tech founder and the foreign conservative. Whether these misrepresentations were made intentionally or out of “too good to check” carelessness, they were both attempts to exercise political power that ran into the brick wall of technological truth. In the third example, the Chinese political system delegated the job of finding out what was true to the blockchain. In all three cases, technology provided a more robust means of determining what was true than the previous gold standards — whether that be the “paper of record” or the party-state. It decentralized the determination of truth away from the centralized establishment.
”
”
Balaji S. Srinivasan (The Network State: How To Start a New Country)
“
What is certain is that the implants have facilitated the most comprehensive dataset on human consciousness ever conceived.
”
”
Andrew Gillsmith (Our Lady of the Artilects)
“
Do your methods present an authentic account of LGBTQ lives?
Rather than adopt methods that promise a tidy dataset, recognize that data about identity characteristics is leaky, pluralistic and can change over time. A queer approach involves the use of innovative collection and analysis methods, such as multiple response options and the provision of open-text boxes, to produce a more authentic reflection of lives and experiences.
”
”
Kevin Guyan (Queer Data: Using Gender, Sex and Sexuality Data for Action (Bloomsbury Studies in Digital Cultures))
“
For every one cradle none in the dataset who now has a religious affiliation, there are five who were brought up religiously who now identify as nones. That’s five nonverts for every convert.
”
”
Stephen Bullivant (Nonverts: The Making of Ex-Christian America)
“
Presently, foundational resources essential to cutting-edge AI research and development like compute power, datasets, development frameworks and pre-trained models, remain overwhelmingly centralized under the control of Amazon, Microsoft, Google and several other giants who operate the dominant cloud computing platforms. Open source efforts cannot truly flourish or compete if trapped within the confines of the Big Tech clouds and proprietary ecosystems.
”
”
I. Almeida (Introduction to Large Language Models for Business Leaders: Responsible AI Strategy Beyond Fear and Hype (Byte-sized Learning Book 2))
“
In the early days of computing, machines were seen as tools to be controlled by their human operators. The relationship was entirely one-sided; humans input instructions, and the computer executed them. However, as computers advanced, they took on roles previously in the human domain. They could calculate complex equations, manage large datasets, and even defeat humans at chess. This marked the beginning of a shift from viewing computers as mere tools to seeing them as collaborators.
”
”
Enamul Haque (AI Horizons: Shaping a Better Future Through Responsible Innovation and Human Collaboration)
“
Heffernan’s research was based in the rural area of Union Parish in Louisiana, where a booming poultry industry was expanding in the 1960s. Vertically integrated poultry production was still a radical concept back then, and Heffernan wanted to study it. So he undertook an effort that no one else seems to have duplicated. He went door to door, made phone calls, and drove hundreds of miles between farms. He surveyed farmers and documented their income and their debt. Crucially, he followed up with farmers in Union Parish every ten years until the turn of the century, building a dataset that was forty years deep. But Heffernan did something more than ask about money. He did something most agricultural economists never thought of doing: He asked the farmers how they felt. He asked them, decade after decade, how much they trusted the companies they worked for and how well they were treated. In doing this, Heffernan assembled a picture that most economists missed. He tracked the relationship between the powerful and the powerless.
”
”
Christopher Leonard (The Meat Racket: The Secret Takeover of America's Food Business)
“
An astute reader may note that the former kind of scientist may publish results faster if the datasets are pristine and reviewers are generous. Fools and optimists are invited to rely on these two miracles. Realists should automate.
”
”
Anthony Scopatz (Effective Computation in Physics: Field Guide to Research with Python)
“
Thus, multiple regression requires two important tasks: (1) specification of independent variables and (2) testing of the error term. An important difference between simple regression and multiple regression is the interpretation of the regression coefficients in multiple regression (b1, b2, b3, …) in the preceding multiple regression model. Although multiple regression produces the same basic statistics discussed in Chapter 14 (see Table 14.1), each of the regression coefficients is interpreted as its effect on the dependent variable, controlled for the effects of all of the other independent variables included in the regression. This phrase is used frequently when explaining multiple regression results. In our example, the regression coefficient b1 shows the effect of x1 on y, controlled for all other variables included in the model. Regression coefficient b2 shows the effect of x2 on y, also controlled for all other variables in the model, including x1. Multiple regression is indeed an important and relatively simple way of taking control variables into account (and much easier than the approach shown in Appendix 10.1). Key Point The regression coefficient is the effect on the dependent variable, controlled for all other independent variables in the model. Note also that the model given here is very different from estimating separate simple regression models for each of the independent variables. The regression coefficients in simple regression do not control for other independent variables, because they are not in the model. The word independent also means that each independent variable should be relatively unaffected by other independent variables in the model. To ensure that independent variables are indeed independent, it is useful to think of the distinctively different types (or categories) of factors that affect a dependent variable. This was the approach taken in the preceding example. There is also a statistical reason for ensuring that independent variables are as independent as possible. When two independent variables are highly correlated with each other (r2 > .60), it sometimes becomes statistically impossible to distinguish the effect of each independent variable on the dependent variable, controlled for the other. The variables are statistically too similar to discern disparate effects. This problem is called multicollinearity and is discussed later in this chapter. This problem is avoided by choosing independent variables that are not highly correlated with each other. A WORKING EXAMPLE Previously (see Chapter 14), the management analyst with the Department of Defense found a statistically significant relationship between teamwork and perceived facility productivity (p <.01). The analyst now wishes to examine whether the impact of teamwork on productivity is robust when controlled for other factors that also affect productivity. This interest is heightened by the low R-square (R2 = 0.074) in Table 14.1, suggesting a weak relationship between teamwork and perceived productivity. A multiple regression model is specified to include the effects of other factors that affect perceived productivity. Thinking about other categories of variables that could affect productivity, the analyst hypothesizes the following: (1) the extent to which employees have adequate technical knowledge to do their jobs, (2) perceptions of having adequate authority to do one’s job well (for example, decision-making flexibility), (3) perceptions that rewards and recognition are distributed fairly (always important for motivation), and (4) the number of sick days. Various items from the employee survey are used to measure these concepts (as discussed in the workbook documentation for the Productivity dataset). After including these factors as additional independent variables, the result shown in Table 15.1 is
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
In the past, firms could employ teams of statisticians, modelers, and analysts to explore datasets manually, but the volume and variety of data have far outstripped the capacity of manual analysis.
”
”
Foster Provost (Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking)
“
COEFFICIENT The nonparametric alternative, Spearman’s rank correlation coefficient (r, or “rho”), looks at correlation among the ranks of the data rather than among the values. The ranks of data are determined as shown in Table 14.2 (adapted from Table 11.8): Table 14.2 Ranks of Two Variables In Greater Depth … Box 14.1 Crime and Poverty An analyst wants to examine empirically the relationship between crime and income in cities across the United States. The CD that accompanies the workbook Exercising Essential Statistics includes a Community Indicators dataset with assorted indicators of conditions in 98 cities such as Akron, Ohio; Phoenix, Arizona; New Orleans, Louisiana; and Seattle, Washington. The measures include median household income, total population (both from the 2000 U.S. Census), and total violent crimes (FBI, Uniform Crime Reporting, 2004). In the sample, household income ranges from $26,309 (Newark, New Jersey) to $71,765 (San Jose, California), and the median household income is $42,316. Per-capita violent crime ranges from 0.15 percent (Glendale, California) to 2.04 percent (Las Vegas, Nevada), and the median violent crime rate per capita is 0.78 percent. There are four types of violent crimes: murder and nonnegligent manslaughter, forcible rape, robbery, and aggravated assault. A measure of total violent crime per capita is calculated because larger cities are apt to have more crime. The analyst wants to examine whether income is associated with per-capita violent crime. The scatterplot of these two continuous variables shows that a negative relationship appears to be present: The Pearson’s correlation coefficient is –.532 (p < .01), and the Spearman’s correlation coefficient is –.552 (p < .01). The simple regression model shows R2 = .283. The regression model is as follows (t-test statistic in parentheses): The regression line is shown on the scatterplot. Interpreting these results, we see that the R-square value of .283 indicates a moderate relationship between these two variables. Clearly, some cities with modest median household incomes have a high crime rate. However, removing these cities does not greatly alter the findings. Also, an assumption of regression is that the error term is normally distributed, and further examination of the error shows that it is somewhat skewed. The techniques for examining the distribution of the error term are discussed in Chapter 15, but again, addressing this problem does not significantly alter the finding that the two variables are significantly related to each other, and that the relationship is of moderate strength. With this result in hand, further analysis shows, for example, by how much violent crime decreases for each increase in household income. For each increase of $10,000 in average household income, the violent crime rate drops 0.25 percent. For a city experiencing the median 0.78 percent crime rate, this would be a considerable improvement, indeed. Note also that the scatterplot shows considerable variation in the crime rate for cities at or below the median household income, in contrast to those well above it. Policy analysts may well wish to examine conditions that give rise to variation in crime rates among cities with lower incomes. Because Spearman’s rank correlation coefficient examines correlation among the ranks of variables, it can also be used with ordinal-level data.9 For the data in Table 14.2, Spearman’s rank correlation coefficient is .900 (p = .035).10 Spearman’s p-squared coefficient has a “percent variation explained” interpretation, similar
”
”
Evan M. Berman (Essential Statistics for Public Managers and Policy Analysts)
“
Note that master-follower databases are not distributed: every machine has a full copy of the dataset. Master-follower replication is great for scaling up the processing power available for handling read requests, but does nothing to accommodate arbitrarily large datasets. Master-follower replication also provides some resilience against machine failure: in particular, failure of a machine will not result in data loss, since other machines have a full copy of the same dataset.
”
”
Mat Brown (Learning Apache Cassandra: Manage Fault Tolerant and Scalable Real-Time Data)
“
Summary Gaining insight from massive and growing datasets, such as those generated by large organizations, requires specialized technologies for each step in the data analysis process. Once organizational data is cleaned, merged, and shaped into the form desired, the process of asking questions about data is often an iterative one. MapReduce frameworks, such as the open-source Apache Hadoop project, are flexible platforms for the economical processing of large amounts of data using a collection of commodity machines. Although it is often the best choice for large batch-processing operations, MapReduce is not always the ideal solution for quickly running iterative queries over large datasets. MapReduce can require a great deal of disk I/O, a great deal of administration, and multiple steps to return the result of a single query. Waiting for results to complete makes iterative, ad hoc analysis difficult. Analytical databases
”
”
Anonymous
“
Google BigQuery, a technology that is very different from, and often complementary to, many of the other technologies covered in the rest of this book. BigQuery, which is a hosted service accessed through an API, allows developers to run queries over large datasets and obtain results very quickly. We’ll
”
”
Anonymous
“
Climate Model Predictions vs. Reality Sources: Hansen et al. (1988); RSS; Met Office Hadley Centre HadCRUT4 dataset; RSS Lower troposphere data Note in particular that since the late 1990s, there has been no increase in average temperatures. Hansen and every other believer in catastrophic global warming expected that there would be, for the simple reason that we have used record, accelerating amounts of CO2.
”
”
Alex Epstein (The Moral Case for Fossil Fuels)
“
into large and complex datasets is a prevalent theme in current visualization research for which different approaches are pursued. Topology-based methods are built on the idea of abstracting characteristic structures such as the topological skeleton from the data and to construct the visualization accordingly. Even
”
”
Helwig Hauser (Topology-based Methods in Visualization (Mathematics and Visualization))
“
Raw data has to be edited, converted to other formats, and linked with other datasets; statistical analysis has to be performed, sometimes with custom software; and plots and tables have to be created from the results. This is often done by hand, with bits of data copied and pasted into different data files and spreadsheets—a tremendously error-prone process. There
”
”
Alex Reinhart (Statistics Done Wrong: The Woefully Complete Guide)
“
His doubts recall Benford’s Law, a theory about the frequency with which digits will appear in data. One implication of this law is that datasets with lots of zeroes at the end often turn out to be fraudulent.
”
”
Simon Kuper (Soccernomics: Why England Loses, Why Germany and Brazil Win, and Why the U.S., Japan, Australia, Turkey--and Even Iraq--Are Destined to Become the Kings of the World's Most Popular Sport)
“
Streamgraph,6 a comparative, flowing area graph developed as an improved way to compare large, changing datasets over time. Similarly, Edward Tufte’s Sparklines visualization7 concept (developed in the late 1990s) embedded short, intense line-graph metrics of a single changing metric within surrounding text. This type of visualization has become a common feature of online financial publications.
”
”
Anonymous
“
climate researchers who produce the Vostok dataset well know, there is an average 800-year lag between these two variables, with the temperature changes preceding the CO2 changes.
”
”
Roy W. Spencer (The Great Global Warming Blunder: How Mother Nature Fooled the World’s Top Climate Scientists)
“
Among more than 11,000 long-term couples, machine learning models found that the traits listed below, in a mate, were among the least predictive of happiness with that mate. Let’s call these traits the Irrelevant Eight, as partners appear about as likely to end up happy in their relationship when they pair off with people with any combo of these traits: Race/ethnicity Religious affiliation Height Occupation Physical attractiveness Previous marital status Sexual tastes Similarity to oneself What should we make of this list, the Irrelevant Eight? I was immediately struck by an overlap between the list of irrelevant traits and another data-driven list discussed in this chapter. Recall that I had previously discussed the qualities that make people most desirable as romantic partners, according to Big Data from online dating sites. It turns out that that list—the qualities that are most valued in the dating market, according to Big Data from online dating sites—almost perfectly overlaps with the list of traits in a partner that don’t correlate with long-term relationship happiness, according to the large dataset Joel and her coauthors analyzed. Consider, say, conventional attractiveness. Beauty, you will recall, is the single most valued trait in the dating market; Hitsch, Hortaçsu, and Ariely found in their study of tens of thousands of single people on an online dating site that who receives messages and who has their messages responded to can, to a large degree, be explained by how conventionally attractive they are. But Joel and her coauthors found, in their study of more than 11,000 long-term couples, that the conventional attractiveness of one’s partner does not predict romantic happiness. Similarly, tall men, men with sexy occupations, people of certain races, and people who remind others of themselves are valued tremendously in the dating market. (See: the evidence from earlier in this chapter.) But ask thousands of long-term couples and there is no evidence that people who succeeded in pairing off with mates with these desired traits are any happier in their relationship.
”
”
Seth Stephens-Davidowitz (Don't Trust Your Gut: Using Data to Get What You Really Want in Life)
“
What I am telling you here is actually nothing new. So why switch from analyzing assumption-based, transparent models to analyzing assumption-free black box models? Because making all these assumptions is problematic: They are usually wrong (unless you believe that most of the world follows a Gaussian distribution), difficult to check, very inflexible and hard to automate. In many domains, assumption-based models typically have a worse predictive performance on untouched test data than black box machine learning models. This is only true for big datasets, since interpretable models with good assumptions often perform better with small datasets than black box models. The black box machine learning approach requires a lot of data to work well. With the digitization of everything, we will have ever bigger datasets and therefore the approach of machine learning becomes more attractive. We do not make assumptions, we approximate reality as close as possible (while avoiding overfitting of the training data).
”
”
Christoph Molnar (Interpretable Machine Learning: A Guide For Making Black Box Models Explainable)
“
But the dawn of datasets cast doubt on this theory.11 While civil wars were increasingly being fought by ethnic factions, researchers such as Paul Collier and Anke Hoeffler at Oxford, and Fearon and Laitin at Stanford, found that ethnically diverse countries were not necessarily more prone to war than ethnically homogeneous ones. This was a puzzling finding: If diversity didn’t matter, then why did so many civil wars break down along ethnic or religious lines? This prompted the Political Instability Task Force to include more nuanced measures of ethnicity in their model. Instead of looking at the number of ethnic or religious groups in a country or the different types of groups, they looked at how ethnicity was connected to power: Did political parties in a country break down along ethnic, religious, or racial lines, and did they try to exclude one another from power? The PITF had been collecting and analyzing data for years when they discovered a striking pattern. One particular feature of countries turned out to be strongly
”
”
Barbara F. Walter (How Civil Wars Start: And How to Stop Them)
“
There is perhaps no more heartening proof of the role of environment in human intelligence than the Flynn effect, the worldwide phenomenon of upwardly trending IQ, named for the New Zealand psychologist who first described it. Since the early years of the twentieth century, gains have ranged between nine and twenty points per generation in the United States, Britain, and other industrialized nations for which reliable data-sets are available. With our knowledge of evolutionary processes, we can be sure of one thing: we are not seeing wholesale genetic change in the global population. No, these changes must be recognized as largely the fruits of improvement in overall standards both of education and of health and nutrition. Other factors as yet not understood doubtless play a role, but the Flynn effect serves nicely to make the point that even a trait whose variation is largely determined by genetic differences is in the end significantly malleable. We are not mere puppets upon whose strings our genes alone tug.
”
”
James D. Watson (DNA: The Secret of Life, Fully Revised and Updated)
“
FIGURE 1.10 “Alternative datasets” derived from web scraping: most popular at funds at present.
”
”
Alexander Denev (The Book of Alternative Data: A Guide for Investors, Traders and Risk Managers)
“
Another legal issue associated with alternative data is whether a particular dataset constitutes material non-public information (MNPI).
”
”
Alexander Denev (The Book of Alternative Data: A Guide for Investors, Traders and Risk Managers)
“
To avoid getting fooled by spurious correlations, we need to consider additional variables that would be expected to change if a particular causal explanation were true. Twenge does this by examining all the daily activities reported by individual students, in the two datasets that include such measures. Twenge finds that there are just two activities that are significantly correlated with depression and other suicide-related outcomes (such as considering suicide, making a plan, or making an actual attempt): electronic device use (such as a smartphone, tablet, or computer) and watching TV. On the other hand, there are five activities that have inverse relationships with depression (meaning that kids who spend more hours per week on these activities show lower rates of depression): sports and other forms of exercise, attending religious services, reading books and other print media, in-person social interactions, and doing homework.
”
”
Jonathan Haidt (The Coddling of the American Mind: How Good Intentions and Bad Ideas Are Setting up a Generation for Failure)
“
The advantage with data is that it’s not self-important or verbose. It doesn’t have a mission and it isn’t looking to deceive you. It’s simply there, and you can check it. Every good dataset can be collated with reality and that’s exactly what you must do as a journalist before you start to write. At some stage you also have to consider very carefully which part of the data you’re going to exploit.
”
”
Frederik Obermaier (The Panama Papers: Breaking the Story of How the Rich and Powerful Hide Their Money)
“
also to synthesize medical images along with lesion annotations (Frid-Adar et al. 2018). Learning to synthesize medical images, along with the segmentation of the lesions in the synthesized image, opens the possibility of automatically generating massive labeled datasets that can be used for supervised learning.
”
”
John D. Kelleher (Deep Learning)
“
But what do you do if your dataset is as inclusive as possible—say, something approximating the entirety of written English, some hundred billion words—and it’s the world itself that’s biased?
”
”
Brian Christian (The Alignment Problem: Machine Learning and Human Values)
“
columns and rows of a dataset so that it conforms with the following 3 rules of a “tidy” dataset [2, 3]: 1. Each variable forms a column 2. Each observation forms a row 3. Each value has its
”
”
Mit Critical Data (Secondary Analysis of Electronic Health Records)
“
complete dataset it will be necessary to integrate the patient’s full set of lab values (including those not associated with the same MIMIC ICUSTAY identifier) with the record of that ICU admission without repeating or missing records. Using shared
”
”
Mit Critical Data (Secondary Analysis of Electronic Health Records)
“
Tidy” datasets have the advantage of being more easily visualized and manipulated for later statistical analysis. Datasets exported from MIMIC usually are fairly “tidy” already; therefore, rule 2 is hardly ever broken. However,
”
”
Mit Critical Data (Secondary Analysis of Electronic Health Records)
“
more effective representation of the dataset without compromising the integrity of the original data. The objective of this step is to provide a version of the dataset on which the subsequent statistical analysis will be more effective. Data reduction
”
”
Mit Critical Data (Secondary Analysis of Electronic Health Records)
“
Viewing the Dataset There are several commands in R that are very useful for getting a ‘feel’ of your datasets and see what they look like before you start manipulating them. View the first and last
”
”
Mit Critical Data (Secondary Analysis of Electronic Health Records)
“
Let the dataset change your mindset
”
”
Hans Rosling
“
The likes of Google and Target are no more keen to share their datasets and algorithms than Newton was to share his alchemical experiments. Sometimes
”
”
Tim Harford (The Data Detective: Ten Easy Rules to Make Sense of Statistics)
“
You can find hundreds of interesting datasets in CSV format from kaggle.com.
”
”
Oliver Theobald (Machine Learning For Absolute Beginners: A Plain English Introduction (Second Edition) (AI, Data Science, Python & Statistics for Beginners))
“
A similar result from Bristol scientists using a huge dataset called the Avon Longitudinal Study—the gold standard of transgenerational research—showed that men who smoked before puberty sired fatter sons than those who smoked after. Again, something was apparently being acquired and passed on.
”
”
Adam Rutherford (A Brief History of Everyone Who Ever Lived: The Human Story Retold Through Our Genes)
“
or surprise illnesses. Yes, there is competition for this product from less complete datasets, but market share will grow as more people become aware of it. An initial PR push might help to get the ball rolling on links that will help search visibility, but eventually, there will be branded search, too, as users look for this most complete dataset. Non-branded queries will consist of all illnesses combined with a cost or price keyword. As the product grows, there could be iterations that incorporate more things beyond price, but at least from the outset, you have validated that there’s lots of demand. Zooming in on this product, there are many aspects that make it an ideal Product-Led SEO strategy. It is programmatic and scalable, creates something new, and addresses untapped search demand. Additionally, and most importantly, there is a direct path to a paying telehealth user. Users can access the data without being a current customer, but the cost differential between telehealth (when appropriate) versus in-person will lead some users down a discovery journey that ends with a conversion. A user who might never have considered telehealth might be drawn to the cost savings in reduced transportation and waiting times that they would never have known about had they not seen your content. Making a Decision Now, as the telehealth executive, you have two competing product ideas to choose from. While you can eventually do both, you can only do one at a time, as I suggested earlier. You will take both of these product ideas and spec out all the requirements. The conditions library might require buying a medical repository and licensing many stock photos, while the cost directory is built on open-source datasets.
”
”
Eli Schwartz (Product-Led SEO: The Why Behind Building Your Organic Growth Strategy)
“
La ingeniería de variables se da cuando transformamos datos crudos en un dataset trabajable. La misma requerirá tanto conocimiento del negocio o área que se trabaja como creatividad.
”
”
Ana Tavarez (Ciencia de Datos: Una Guía Práctica (Spanish Edition))
“
Another application that may be particularly vulnerable to adversarial attack is fingerprint reading. A team from New York University Tandon and Michigan State University showed that it could use adversarial attacks to design what it called a masterprint—a single fingerprint that could pass for 77 percent of the prints in a low-security fingerprint reader.14 The team was also able to fool higher-security readers, or commercial fingerprint readers trained on different datasets, a significant portion of the time. The masterprints even looked like regular fingerprints—unlike other spoofed images that contain static or other distortions—which made the spoofing harder to spot.
”
”
Janelle Shane (You Look Like a Thing and I Love You: How Artificial Intelligence Works and Why It's Making the World a Weirder Place)
“
As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples
”
”
Ian Goodfellow (Deep Learning (Adaptive Computation and Machine Learning series))
“
Working successfully with datasets smaller than this is an important research area, focusing in particular on how we can take advantage of large quantities of unlabeled examples, with unsupervised or semi-supervised learning.
”
”
Ian Goodfellow (Deep Learning (Adaptive Computation and Machine Learning series))
“
And so, as time passed, more and more people begin to trust the algorithm to find them the best people to meet, the best jobs to apply for, even who they should vote for.” Scott felt a nudge from Cyrus, followed by a whisper: “Kill me now.” But there was no stopping Padooa. “Soon, society began to change. Subtly at first, yet over time it became clear to some that things were not right, as age-old institutions began to topple, and the fragmentation and polarization of society began to take root. Yet so powerful and useful was the algorithm in people's lives that those who gave themselves over to it completely began to prosper where others would fall behind. Now came a time where those who wished to have a good future had no option but to give themselves over to the algorithm, whether they wanted to or not. They had no choice. “This was the time when the Dataists came to power—zealots and extremists who believed in the all-pervading greatness of the algorithm. They advocated that all people should submit every aspect of their lives so that the algorithm could create a more all-encompassing dataset. Those that resisted this intrusion into their lives were seen as hindrances to the advancement of society. They were shunned and they were vilified, and ultimately, they were persecuted. “So strong did the Dataists become, with their quasi-religious mindset, that they began to gain executive power in many regions of Earth. It became a crime to withhold data from the algorithm, initially punishable by fines, but soon this became incarceration and finally, death by execution. So great was their belief that the future of humanity lay with the algorithm that they could not countenance errors.
”
”
Gerald M. Kilby (Evolution (The Belt #3))