Tuesday, January 22, 2019

Unpretty Little Liars

Seth Stephens-Davidowitz is a data scientist who writes for The New York Times. His NYT bestselling book is Everybody Lies: What the Internet Can Tell Us about Who We Really Are.

A big problem with the traditional ways social scientists have collected information about people’s behaviors, desires, and opinions is that people lie even when there is no rational reason to do so. The irrational reason to do so is that we like to virtue-posture and to present ourselves as better than we are even when polls and surveys are completely anonymous. Somehow that makes us feel better. So we exercise less and we eat more pie than we tell pollsters. On anonymous surveys people underreport alcohol consumption by some 50% (source:  Substance Abuse and Mental Health Services Administration) and underreport tobacco use by a similar percentage. We are able to determine those last two pretty accurately because we collect taxes on alcohol and tobacco, and they tell how much we actually consume – to which we should add whatever amount somehow escapes taxation. (This barely compares to the extent we like to virtue-posture in public and on social media of course.) Search engines and ad clicks on the internet, however, provide vast amounts of unfiltered data about our real interests. We can tell for what people search and what they access. We can tell how billions upon billions of clicks correlate with other online clicks and how they correlate with the characteristics (such as sex and geographical location) of the users. Data mining and analyses that were once time and labor intensive now can be done with ease: most often without any human intervention beyond the original algorithm. The AIs of Google, Amazon, and other major online players never stop doing them.

A decade ago neuroscientists Ogi Ogas and Sai Gaddam analyzed 55 million sexually oriented searches terms compiled by Dogpile, broke them down into categories, and tried to see what these searches tell us about human sexuality. They published the creepy but intriguing answers in their 2011 book A Billion Wicked Thoughts. Searches both by men and women were fiercely un-PC and offered many surprises. (I reviewed this book back on November 7, 2017.) Everybody Lies is an updated and thematically less constrained look at what can be gleaned from online data (not just personal searches and postings) and how online behavior relates to real world behavior.

The title of the book is misleading. To be sure, there some attention to giving the lie to public posturing. For example, the data show us that the more a borrower says how trustworthy he is, the less likely he is to pay you back. (Many of us have been warned of this anecdotally, but an analysis of words used on loan applications and subsequent default rates proves the warning is well-founded.) Most of the book, though, is dedicated not to lying per se but to data correlations that sometimes defy and sometimes confirm expectations. He also tells us of the ways businesses and political groups learn to manipulate us with those correlations. As a real but minor example, Google tested differing responses to otherwise identical ads depending on what shade of blue was used in them. Big Data can be exploited in ways large and small to make online sites for any purpose ever more addictive and persuasive.

For social scientists, online Big Data are a vast resource. Says Stephens-Davidowitz, “If a violent movie comes to a city does crime go up or down? If more people are exposed to an ad, do more people use the product? If a baseball team wins when a boy is twenty, will he be more likely to root for them when he is forty? These are all clear questions with clear yes or no answers. And in the mountains of honest data we can find them.” He warns us that correlation is not causation. For example, moderate drinkers are healthier than either teetotalers or heavy drinkers, but that doesn’t necessarily mean moderate drinking is healthful. It could be the other way around: healthy people might be more inclined to drink moderately. Or there might be some third factor (such as socializing) at work. From the correlation alone we cannot tell. Nonetheless, it does tell us where to look, and with enough other Big Data, we might well be able to tease out causation. Big Data results also can mislead depending on what questions are asked. A non-controversial example in our hyper-political age: in the matter of the dispute between Lilliput and Blefuscu, if one searches the data for pirates who break eggs at the small end and philanthropists who break eggs at the big end, you will get real and fairly reliable answers, but the questions themselves are biased and therefore so are the answers. They really don’t tell us much about the characters of ordinary non-pirate non-philanthropist egg-crackers; the answers are merely fodder for propaganda. However, properly done Big Data analyses are ideal for discovering how effective that propaganda is.

The book has flaws. In many ways it is oversimplified, though I think that comes from the habit of writing for The New York Times, which is written to a 10th grade (15-year-old) reading level. (This is high for a newspaper: most are written to an 8th grade level or lower.) Nonetheless, if only as a warning to make an extra effort to think for oneself at a time when so many tools are available to those who would prefer to think for us, the book is still very much worth a read. Once again, though, don’t rely entirely on the title.


Bo Diddley - You Can't Judge a Book by Its Cover

No comments:

Post a Comment