What does ‘Meaning’ Mean to a Search Engine?

Outside of the government and universities, no one hires more people with PhDs in linguistics than search engines. One major reason for this is that a search engine needs to deliver results that are not based on the string of characters you type into the search box, but instead based on what your string of characters means. And, intuitively, other words that mean (nearly) the same thing, may be just as relevant. There is a second reason why identifying words with similar meanings is crucial to the search engines’ success. Pay per click advertisers specify their keywords, but they are not always creative or clever enough with their keyword research. Serving those advertisers’ ads for other words that mean the same thing as the keywords they specified allows the search engines to produce more ad impressions, more clicks, and more revenue.

But what does it really mean to mean the same thing? Enter Latent Semantic Indexing (LSI), an algorithmic process for identifying “meaning-ish” relations between groups of words. For a given word, a search engine’s LSI algorithm identifies which other words are related to that word’s meaning, by tabulating across the internet as a whole which pairs of words most often co-occur within web pages. Linguists have spilled much ink on the issue of semantic identity and reached little agreement, but where philosophical scholarship had failed, entrepreneurial engineering would succeed, right? …Sort of.

LSI can be helpful in grouping together words like “hypertension” and “high blood pressure” which mean the same thing. Google provides a useful operator for seeing LSI in action. If we stick a tilde (“~”), in front of our query, we get results for all keywords related to our keyword. So a query for “~hypertension” includes results about high blood pressure. However, the co-occurrence based formula sometimes goes overboard associating words that co-occur frequently but don’t really mean the same thing. The “~hypertension” query also returns results pertaining to obesity and hypotension (low blood pressure). An “~obesity” query returns results about diabetes, fat, and weight loss. A “~cholesterol” query returns results about HDL, LDL, fat, triglycerides, diet, and even hypertension. Perhaps more important to marketers are the associations drawn between brand names. Pfizer may be unhappy to learn that the results for “~Lipitor” include pages about competing product Mevacor, but Colgate-Palmolive must be delighted to know that “~toothpaste” includes Colgate but not Crest or Aquafresh.

But what does this all mean? Ordinary folks obviously aren’t searching with tildes. However, search engines are matching paid and natural results to queries based partly on these related words and phrases.

For organic search, the use of LSI means that including synonyms and other related words in your page may sometimes bolster rather than hinder your ranking for a particular keyword, depending on how the LSI recognizes that relation. When selecting language for website copy, we must evaluate the quantitative relations between words, and select those synonyms which are most LSI-friendly.

For paid search, the issue is a bit more complex. The use of LSI relations in Google’s “broad” matching or Yahoo’s “advanced” matching means that your ad can show up for keywords that are not specified in your campaign, but which are “related” to your keywords. Sometimes this is a benefit, but sometime it is a liability, or even a legal risk. For example, LSI means that your ads may be appearing for searches on your competitors’ brand names, without you even knowing about it. I recently found this to be the case for a client, in the following scenario (please forgive the abstraction). Drugs A and B both treat condition X. Using LSI, the search engine treats queries for “Drug A” as queries related to “Condition X”. The paid search campaign for Drug B is buying the keyword “Condition X”. So, when users type “Drug A”, they see Drug B’s ad, even though Drug B is not buying the off-limits keyword “Drug A.” To avoid this situation, paid search advertisers shouldn’t simply monitor the keywords in their campaign, but also the actual keyword that the users type. With increasing use of LSI relations in paid search, these may be very different. Also, advertisers should take advantage of negative keywords and restrictive match types, which make it possible to prevent impressions from inappropriate keywords.

Search engine linguists are doing their best to incorporate new natural language technologies that recognize what people mean, not just what they type. They have done this to varying degrees of success. LSI provides search engine marketers with both new risks and new opportunities. Recognizing and understanding this technology provides a great tactical advantage.

One Trackback

  1. [...] part of my query was “causes AIDS”, not “what virus causes”. This result is a case of Latent Semantic Indexing gone awry, projecting relatedness based on word co-occurrence rather than similar meaning. From [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*