Already our short career as blog authors has the potential to become lucrative. In our sidebar to your right, you'll notice that, as subscribers to Google's adsense, we're provided with both a search feature and ads for products and services that are supposed to relate to the content of our site, and thus interest you, the reader.

I don't know much about how this works, and I'd love some clarification, but it's my understanding that Google "crawls" our site to determine its content - probably by searching for keywords, or determining the most common words on the site, or something like that. So Scott and I have been interested in keeping tabs on what the ads display.

Here's what Google thinks this website is about:
1. Blogs. Not a big surprise, it is a new blog, and a lot of the discussion on the site has been about the nature of blogging, so we've gotten linked up with a lot of ads for blog hosting sites and so on.
2. English for foreign speakers. In particular, these ads tend to use phrases like "perfect pronunciation," "speak like a native," and "English without an accent." It's pretty clear how our posts generated those types of phrases, too, but we were both baffled by:
3. Help patenting your invention. Until we remembered that Scott's notes to an american post (which was on the front page when these ads ran) contained a prose poem about that very topic.

The key thing that Google's crawlers don't understand is context, and the way it changes language usage. I'm also reminded of a time that I tried to use Google's image search to find a picture of that symbol that means a picture didn't display on a website. It was the one time the image search disappointed me, because no matter how i phrased it: "broken jpg image" "missing file symbol" and so on, the search would return jpgs of broken things, or pictures of missing things, or so on. There's no way to tell Google to search for a word only as it's used in a certain context.

When humans read a poem, they think "this is a poem" and they interpret it accordingly. For Google, all language has the same weight. You could write an entire website about how English should be spoken with an accent, but you'd still get those ads for perfect pronunciation because Google doesn't know what your intention is, and for the most part, it doesn't need to. When we humans read, we build up a set of intentions, and an idea of what the author is trying to mean, but for Google, language is reducible to a set of statistics and a list of words.


Anonymous said...

Tracing the ad displays will be interesting. Not long ago, I was pursuing a thread of discussions at a variety of sites on the topic of plagiarism. At one weblog, created by an academic with an .edu address in her profile and a hearty editorial opposed to plagiarism, the Google ad list included a pulsating popup for a site named something along the lines of "College Research Papers for Free."

Dave P said...

You are mostly correct that Google uses the words on your site to match up ads with content. But you may be overestimating the complexity of the "similarity" algorithm.

Google doesn't do all the work in determining relevancy. Advertisers submit keywords with their ads, which Google matches to what it thinks are keywords in your content.

Perhaps a more complicated algorithm would work better in some cases, but then it would be more complicated :)

As for your image search, you can force some very limited context recognition. I did an image search for "broken image" (in quotations), and the first result seems what you were looking for:

(What that tells Google is I want these words to appear in results 'as-is', next to each other, in this order.)

