Kaufer: We tried...
Kaufer: We tried different approaches. We tried to randomly crawl the Web
and we thought, “How are we going to randomly select out of the billions of
pages?” So we tried crawling from known travel hubs. We’d start from the
Yahoo Travel directory and see where those sites led us. We tried to pick out
good, interesting information and automatically categorize it. That didn’t work
so well. What we call the signal-to-noise ratio wasn’t good enough—meaning
that, when they got our results back, people wouldn’t say, “Oh yeah, that’s what
I was looking for.”
We ended up looking at all of the published sources of information—newspapers,
magazines—and manually went through all the websites from all these
places to find the ones that had free access to the back issues of their travel articles.
Then we hired people to read every single travel article we could find on
the Net, and classify that article into our database, and write a one-line summary.
It’s a fairly significant effort, and people that we talked to said, “You’re
nuts. You’ll never finish.” But if you actually do the math, you realize that you
can work through the backlog (it took us a couple of years, but it was only a
couple of years) and then can stay current with what’s being published without
too much of an effort.
We take half an hour to read an article, on average, and we’ll tag that article
as being relevant to everything the article talks about. If the article is about
Maui and things to do in Hawaii and these two resorts, whenever you’re searching
for Maui or things to do in Hawaii or those two resorts, that article will
come up. If that article happened to mention, “The beaches in Maui are much
better than the beaches in Fort Lauderdale,” and you were to search on the
beaches in Fort Lauderdale, that article is not going to come up, because our
search isn’t keyword-based. It doesn’t matter if the article happens to mention
something; you only want to read the article if it’s actually giving you an opinion
on the topic you’re researching.
What we ended up with was a much smaller database as measured by the
number of documents that we’d indexed, but extremely, extremely relevant.
You go to a page about Maui, and every article on that page really is about
Maui, sorted to a pretty good degree based upon which article most people
would rather read first. Would you rather read an article that has a paragraph
about Maui in talking about fun beaches around the world, or an article all
about beaches in Maui? Probably the latter, so that’s why the article is sorted
| ← there—where we would | first. Your experience → |