Seth Stephens-Davidowitz is a data
scientist who writes for The New York
Times. His NYT bestselling book is Everybody
Lies: What the Internet Can Tell Us about Who We Really Are.
A big problem with the traditional ways
social scientists have collected information about people’s behaviors, desires,
and opinions is that people lie even when there is no rational reason to do so.
The irrational reason to do so is that we like to virtue-posture and to present
ourselves as better than we are even when polls and surveys are completely
anonymous. Somehow that makes us feel better. So we exercise less and we eat more
pie than we tell pollsters. On anonymous surveys people underreport alcohol
consumption by some 50% (source: Substance Abuse and Mental Health Services Administration)
and underreport tobacco use by a similar percentage. We are able to determine
those last two pretty accurately because we collect taxes on alcohol and
tobacco, and they tell how much we actually consume – to which we should add
whatever amount somehow escapes taxation. (This barely compares to the extent
we like to virtue-posture in public and on social media of course.) Search
engines and ad clicks on the internet, however, provide vast amounts of unfiltered
data about our real interests. We can tell for what people search and what they
access. We can tell how billions upon billions of clicks correlate with other
online clicks and how they correlate with the characteristics (such as sex and geographical
location) of the users. Data mining and analyses that were once time and labor
intensive now can be done with ease: most often without any human intervention
beyond the original algorithm. The AIs of Google, Amazon, and other major
online players never stop doing them.
A decade ago neuroscientists Ogi
Ogas and Sai Gaddam analyzed 55 million sexually oriented searches terms
compiled by Dogpile, broke them down into categories, and tried to see what
these searches tell us about human sexuality. They published the creepy but
intriguing answers in their 2011 book A
Billion Wicked Thoughts. Searches both by men and women were fiercely un-PC
and offered many surprises. (I reviewed this book back on November
7, 2017.) Everybody Lies
is an updated and thematically less constrained look at what can be gleaned from
online data (not just personal searches and postings) and how online behavior
relates to real world behavior.
The title of the book is misleading. To be
sure, there some attention to giving the lie to public posturing. For example,
the data show us that the more a borrower says how trustworthy he is, the less
likely he is to pay you back. (Many of us have been warned of this anecdotally,
but an analysis of words used on loan applications and subsequent default rates
proves the warning is well-founded.) Most of the book, though, is dedicated not
to lying per se but to data
correlations that sometimes defy and sometimes confirm expectations. He also
tells us of the ways businesses and political groups learn to manipulate us
with those correlations. As a real but minor example, Google tested differing
responses to otherwise identical ads depending on what shade of blue was used
in them. Big Data can be exploited in ways large and small to make online sites
for any purpose ever more addictive and persuasive.
For social scientists, online Big Data are a
vast resource. Says Stephens-Davidowitz, “If a violent movie comes to a city
does crime go up or down? If more people are exposed to an ad, do more people
use the product? If a baseball team wins when a boy is twenty, will he be more
likely to root for them when he is forty? These are all clear questions with
clear yes or no answers. And in the mountains of honest data we can find them.”
He warns us that correlation is not causation. For example, moderate drinkers
are healthier than either teetotalers or heavy drinkers, but that doesn’t
necessarily mean moderate drinking is healthful. It could be the other way
around: healthy people might be more inclined to drink moderately. Or there
might be some third factor (such as socializing) at work. From the correlation
alone we cannot tell. Nonetheless, it does tell us where to look, and with
enough other Big Data, we might well be able to tease out causation. Big Data
results also can mislead depending on what questions are asked. A
non-controversial example in our hyper-political age: in the matter of the
dispute between Lilliput and Blefuscu, if one searches the data for pirates who
break eggs at the small end and philanthropists who break eggs at the big end,
you will get real and fairly reliable answers, but the questions themselves are
biased and therefore so are the answers. They really don’t tell us much about
the characters of ordinary non-pirate non-philanthropist egg-crackers; the
answers are merely fodder for propaganda. However, properly done Big Data
analyses are ideal for discovering how effective that propaganda is.
The book has flaws. In many ways it is
oversimplified, though I think that comes from the habit of writing for The New York Times, which is written to
a 10th grade (15-year-old) reading level. (This is high for a
newspaper: most are written to an 8th grade level or lower.) Nonetheless,
if only as a warning to make an extra effort to think for oneself at a time
when so many tools are available to those who would prefer to think for us, the
book is still very much worth a read. Once again, though, don’t rely entirely on
the title.
Bo Diddley - You Can't Judge a
Book by Its Cover
No comments:
Post a Comment