Monday, May 5, 2008

Image recognition problem solved? The solution is easy to see.

There’s another round of bloggers talking this morning about image recognition–this time because tagging startup tagcow.com has entered the mix. Tagcow wants to help you tag images using some as of yet undisclosed processes. However it is done, photographer Thomas Hawk is impressed with the service. Michael Arrington suspects that humans are behind the magical process. Could be. Image recognition is tough–no matter how much startup passion you apply to it.

My stomach churns every time I hear about another image reco startup. Why? Because I think they’re essentially starting at the wrong end of the problem. For most image recognition, you don’t want to start with the image, you want to start before you’ve taken the image. Using whatever hardware or software combination you can, you want to be able to sense directly or infer directly at the time that the image is captured and tag the photo based on this data. If you’re taking a photo of people, let the camera tag the general area where the people are in the image. The camera at least has the potential of detecting the people (via motion or IR sensing) This is actually quite doable. Not perfect, but doable for many standup shots.

How might this work? The cameras need more sensing built in and open access to this information.

Yes, cameras already include fairly sophisticated sensing. They can adjust image capture based on distance measurements or light measurements or guesses about horizons and objects moving in the image and so on. This is a good start. But it puts the pressure on the camera companies to do all the work. As people want to do more and more electronically with their images, however, as you can see with tagging, the camera companies can’t keep up. One result is that people start dreaming up businesses to try to address the problems that the cameras aren’t solving. Unfortunately they are trying to solve a problem late in the pipeline, which only makes their work quite challenging, and quite often is a waste of money.

The better solution? Build cameras that are open platforms–both in terms of software and hardware. You need to be able to add sensors focused on your tasks at hand. You need to be able to tweak the camera’s software not only to improve the photo quality, but to target the tagging you need for the way you take photos. Many of the best techniques–whatever they are–eventually will make their way into the cameras themselves–but for the early adopters and trend setters, there’s not usually going to be enough there.

So what kind of hardware and software am I suggesting? I’d like to see hardware and software solutions that directly sense or infer the tags about the photos I’m taking at the time–or at least based on the sensed information of the image at the time.

If there is any image processing to do, processing image sequences yields better data than that which you can get from analyzing a single frame. You can see motion. You can average out noise. You can build confidence measures over time. You can try to build context from frame to frame. Working with one frame is tough–at times even for a human.

OK, so you’re shaking your head insisting that there’s no way all this hardware and software can be supported in a camera. Even if it were available today, you’d weigh down the cameras or eat up all the power. Quite possibly. But there’s nothing forcing everything to be within the camera itself. The key is to build cameras with open communication and enable a market of companion devices and services.

What kind of communication am I suggesting? You want all the data being collected by the camera sent to the companion device–in real time. You want access to the all the control within the camera from the companion unit. In essense, you want to be able to process the images using whatever it takes and then turn around and tell the camera to adjust the image this way or that way before and after the image is taken and then tag this or that part of the image based on sensed information. So at its most basic level you want a real-time video stream out from the camera and a control path back (possibly including processed image(s) and possibly additional sensed EXIF data). Alernatively, you want to have open extensibility within the cameras themselves. If you want to add a gyro sensor, you should be able to do so.

So what kind of sensors and data am I envisioning that cameras collect? Some simple ones: GPS (for global positioning); some new ones: camera orientation for location orientation (including inclination, compass heading, elevation, etc), light conditions, distance measurements using time of flight or whatever technique, etc. The trick here is that if there’s anything you want to know about the image, try to sense it directly rather than try to guess about it later in software. Likewise, whatever you can sense directly, try to build up processes that leverage this information the most, because it’s probably the most reliable and consistent.

But sensors will get you only so far. And here comes the next big step. Cameras (or the processing of images) need to leverage as much a priori knowledge about its surroundings as possible. If you’re taking a picture that intersects the GPS point latitude 37.74611 and longitude -119.53194, then you’re probably taking a picture of Yosemite’s Half Dome. If you’re at this location and your elevation is 8,836 feet then you’re at the top of Half Dome. Now let’s place the elevation at 30,000 feet. Now you might assume the Half Dome photo is taken from a plane. Three different interpretive tags. All useful. Essentially you’re leveraging “pre-tagging” or “a priori tags” of information.

This pre-tagging notion can go even further. Think about it. There can be a priori-tag services for sporting events, for graduations, for conferences and even showroom floors, for the national parks, and on and on. Imagine a service that the camera or post-processing of the camera location/orientation data can leverage to automatically tag the photo. Some of the tags could be entered by a Mahalo-like service, some by community efforts, some by the organizers of events. The point is: Why are all 10,000 people attending a basketball game expected to tag their own photos of the game, when we all already know they were there and the main context of the location?

Why are we not leveraging a priori knowledge that such and such location is of Robert Scoble’s house (notice the implication of time)? Or the beach? Or going further–my kitchen–or my backyard–or a booth at a conference–or a particular display area of a booth at a conference–or with the right local positioning information a particuilar gadget within the display area of a booth at a conference? It all depends on the collected sensed data from the camera. Some of these tags are easier to come by than others, but there’s lots of low-hanging a priori fruit.

Maybe such a service is provided by flickr, maybe by Live Search, maybe by the camera companies themselves, maybe by a Photoshop plugin, maybe all of the above. No doubt this would be a massive service on par witih Google Earth or Virtual Earth, but can you imagine??? Now this is where the VCs should be putting their tagging money.

Can a tagging service help me find all pictures of my dog? Probably not. It may not even be able to recognize a dog from a cat or a person (although maybe someone will figure that out too), but with the right information you may be able to leverage a priori tags to help in the search. You might have to think different about searches–kind of like how we all have adjusted to searching the “Google” way, if you will. For instance, to find all pictures of my dog I might think in terms of where he was and when. Was he in the backyard when I took a picture of him? Was he inside my house? This would yield a much smaller set of images that someone could quickly scan through.

This doesn’t help with tagging the names of people in the photos either. True. Maybe the human is best for this. But there are some possibilities. Maybe tags could be shared and cross-referenced so if two people took the same photo with intersecting rays at nearly the same time and both include people and one is tagged, then maybe the photo from the other person could be auto-tagged–maybe not at the level of faces, but of the image itself. Again, this would depend on additional sensory informaion collected at the time a photo is taken.

Anyway, lots of possibilities here. Lots of market potential. My guess is that Google has the right mindset to do it, but I wouldn’t count out Microsoft or Yahoo. Who knows.