Much has been made of the sinking fortune of newspapers in the face of competition from the web. I know little about the news industry, but that won’t stop me here from putting in my technological two cents here.
The only way for the newspapers to survive and flourish on the internet is to become a news web. Becoming a web is much more than simply putting pages on the web, which is what newspapers have done so far.
Mosts papers today toss up a web site with an article archive and a superficial full text search engine, and call it a day. (It’s considered the height of innovation to not hide the archive behind a pay wall.) Google eats the newspaper’s lunch because Google can offer cross-cutting (though equally superficial) keyword search of all the newspapers.
Google enjoys a financial advantage because it catches users at the monetizable moment when they are searching. By the time the user clicks through to the article they are too deeply absorbed to click offsite and generate revenue for the paper.
This is ironic, because locked within the heads of reporters and editors is an understanding of the news that Google couldn’t possibly hope to recreate or even recapture through automated means.
What the newspapers need is technology that captures their unique understanding of the news, so that they, rather than Google, become the vertical search engine of choice for news. This is a tricky problem not only in information architecture but also in usability. How do you create a system that mere mortals can use, that allows semantically rich querying and browsing across events, people, and time? How do you coax and incent reporters and editors into capturing their implicit knowledge?
By saying that newspapers must “become a web”, I mean that they must convert their effectively flat archives into a meaningfully interlinked and semantically searchable web of news. I also mean to imply that newspapers must give up their walled-silo approach to information and support cross-cutting search and interlinking of content from all different publishers, otherwise Google and other search engines will continue to win out.
Here’s a concrete example. Suppose I want to find out what speeches John McCain made on Iraq in 2002. Here are the results from the
New York Times archive for “John McCain Iraq Speech”, and here is the
Google News search for “John McCain Iraq Speech”, both restricted to 2002.
Because both Google and the NYT offer only full-text search, relevance is poor, as is the quality of the summary snippets. For example, in cases like this one, the speeches on Iraq are actually by Bush, and McCain is quoted only in passing.
Imagine instead that the reporter who wrote the article had tagged the articles in the internal news database as a speech, and recording that the speaker was Senator John McCain. Then the news site could offer a search that precisely answered my query. Furthermore, it could easily offer a browseable web of related links, such as other speeches by John McCain and speeches by other politicians on Iraq.
By capturing the implicit knowledge of the reporter, the newspaper not only made their information asset more valuable but created a semantic web that is difficult for Google to compete with. On the internet as a whole, such a semantic web is perhaps a pipe dream, but within the specialized domain of professional news gathering it should be obtainable.
Newspapers need to realize that their futures lie in neither news nor paper, but in capturing and organizing online the meaning of a complex web of events.