Monday, May 5, 2008

Should smart photos and smart text editors lead the semantic web?

There has been on and off again chatter about the semantic web. I can appreciate the goal of making the information on the Internet more searchable, processable and valuable.

Exactly how we get there is anyone’s guess.

The classic approach is to focus on text content. That makes sense, because that’s where the most value is on the web up to this point. However, with the explosive growth of digital cameras and live video feeds pounding at the door and ever smarter cameras as I outlined in an earlier post this all may be changing.

Here’s the deal: Automatic semantic interpretation of text is a tough problem. And human-based tagging of text is a pain. It’ll only get us so far. What we need are algorithmic friendly tools that will ease the growth of the semantic web.

As I pointed out in the last post, one of the tricks we need to employee is leveraging sensed data. The thing is that for the most part, text is written by a human and only consists of text. Photos and video streams come from devices and as such potentially also have augmenting sensory information. There might be local and global positioning information, there might be depth maps that go beyond the images themselves, and so on. Combining this information with a priori knowledge as I described in the post linked to above, you could make some rather good inferences about what’s in the images or at least what their context might be.

I think that leveraging the collective world of a priori knowledge plus sensory information that can “index” into it, would give the semantic web the most scalable and powerful results for the near term.

In fact, it could change the whole search game. Assume for instance that you’re searching for information on some new gadget. Text searches work well. But all text being equal it can be a bit tricky to find the best match based on the text. Search engines use authority and other measures to guess at what to return as search results. But assume that a writer of an article took and posted a photo auto-tagged with the product name, taken by him or herself, taken from the conference where the product was announced, and from within a private press area in the conference? Now it might be a bad judgement, but this may be the closest thing to a primary reporting source based on the image in the article, not just the text. As such, it probably ought to rank higher than other articles–no matter how authoritative they might be in other respects.

Now text does give us useful information. Analyzing the words, sentences, quotes (essentially social links), text format (short declarative, essay, Q&A, bulleted, etc), temporal context, and the like can give us clues about the meaning or context. But I also see potential information beyond what can be analytically extracted from the static text itself. For instance, editors could pay more attention to what we’re writing. For instance, when you’re typing all text is equal. But if you’re going back time and again editing a particular paragraph or sentence, that’s pointing out something to the program. It may be useful. It may not. Or what about collaborative edits? From your coworker? From your boss? From an anonymous online editor within a Wiki? Looking at these deltas the editor may be able to infer what’s important. After all, you’re probably putting more time into the key points, than minor ones. This may be a bad guess, but it points out that how we type may contain quite useful information. Think about it: A movie about the US’s Declaration of Independence doesn’t focus primarily on the words of the document itself, but rather the struggles over key words and phrases in the document as it was written. The edits.

There’s one other area where semantic processing may be relatively easy and that’s with processing computer generated content–for the most part. (Think databases at this point.) Column names and table names in databases often mean something. An app searching the web, scanning computer generated data ought to be able to leverage these and the databases themselves. With developers coallescing around a common language, or subsets of languages, it’ll ease interpretation of the results later. In some cases, the database-hosted information will be most important to interpretation. In some cases, the human-focused web pages will. In some cases, it’ll be the intersection, union, or non-overlapping nature of the information.

No matter what techniques actually make up the semantic web, my guess is that they will be incremental and will probably gain popularity and value because of some additional changes in how we do things. Might this be with smarter, sensory-based cameras? Dunno, but that’s where my guess is now.