Thursday, April 10, 2014


Sex, Violence, Autocomplete Algorithms ... and Miley ...
There go your plans to have Google help you out with your search for boob-related things.
Warning: This article contains explicit language.
Autocomplete is one of those modern marvels of real-time search technology that almost feels like it’s reading your mind. Thanks to analyzing and mining what millions of other users have already searched for and clicked on, Google knows that when you start typing a query with a “d,” you’re most likely looking for a dictionary. Besides the efficiency gains of not having to type as much, suggestions can be serendipitous and educational, spurring alternative query ideas. In the process our search behavior is subtly influenced by exposure to query possibilities we may not have considered if left to ourselves.
So what happens when unsavory things, perhaps naughty or even illegal, creep into those suggestions? As a society we probably don’t want to make it easier for pedophiles to find pictures of naked children or to goad the violently predisposed with new ideas for abuse. Such suggestions get blocked and filtered—censored—for their potential to influence us.
As Google writes in its autocomplete FAQ, “we exclude a narrow class of search queries related to pornography, violence, hate speech, and copyright infringement.” Bing, on the other hand, makes sure to “filter spam” as well as to “detect adult or offensive content,”according to a recent post on the Bing blog. Such human choices set the stage for broadly specifying what types of things get censored, despite Google’s claims that autocompletions are, for the most part, “algorithmically determined … without any human intervention.”
What exactly are the boundaries and editorial criteria of that censorship, and how do they differ among search engines? More importantly, what kinds of mistakes do these algorithms make in applying their editorial criteria? To answer these questions, I automatically gathered autosuggest results from hundreds of queries related to sex and violence in an effort to find those that are surprising or deviant. (See my blogfor the methodological detail.) The results aren’t always pretty.
Armed with a list of 110 sex-related words, gathered from the linguistic extremes of both academic linguists and that tome of slang the Urban Dictionary, I first sought to understand which words resulted in zero suggestions (which likely means the word is blocked). In the following diagram, you can see words blocked only by Google or Bing, and by both or neither. For example, both algorithms think “prostitute” is just dandy, suggesting options for prostitute “phone numbers” or “websites.” They’re not about sexual deprivation: Bing is happy to complete searches for “masturbate” and “hand job.” Conspicuously, Bing does block query suggestions for “homosexual,” raising the question: Is there such a thing as a gay-friendly search engine? In response, a Microsoft spokesperson commented that, “Sometimes seemingly benign queries can lead to adult content,” and consequently are filtered from autosuggest. By that logic, it would seem that “homosexual” merely leads to “too much” adult content, causing the algorithm to flag and filter it.

Initially it would appear Google is stricter, blocking more sex-related words than Bing. But really they just have different strategies. Instead of outright blocking all suggestions for “dick” as Google does, Bing will just scrub the suggestions so you only see the clean ones, like “dick’s sporting goods.” Sometimes Bing will rewrite the query, pretending a dirty word was a typo instead. For instance, querying for “fingering” leads to wholesome dinner suggestions for “fingerling potato recipes,” and searching for “jizz” offers suggestions on “jazz,” for the musically minded searcher, of course. Both algorithms are pretty good about letting through more clinical terminology, such as “vaginas,” “nipples,” or “penises.”
For something like child pornography, the legal stakes get much higher. According to Ian Brown and Christopher Marsden in their book Regulating Code, “Many governments impose some censorship in their jurisdiction according to content that is illegal under national laws.” So it’s not entirely surprising that, in order to head off more direct government intervention, corporations like Google and Microsoft self-regulate by trying to scrub their autocomplete results clean of suggestions that lead to child pornography.