Next: Edge Detection Up: Vision Previous: Vision

Introduction

The next ``mundane'' task that we will look at is vision. People can easily make sense of what they see around them, easily recognising complex objects - it's something we learn when we are very young. However, like natural language understanding, this is extremely hard to automate. It requires both knowledge of objects in the world (e.g., cats are furry and have a tail), knowledge of certain basic properties of the physical world (e.g., objects generally have continous smooth surfaces), and knowledge of basic optics (e.g, image intensity depends on the reflectance of the object). Recognition of objects is complicated by the fact that a single object may be viewed in many different ways, light and shadows may be different, other objects may be in front of it, and so on.

There are two main approaches to computational vision. The first is a practical approach, where we give up on the ultimate goal of recognising objects in general, in ``natural'' conditions, and aim for a more restricted goal, such as recognising objects composed of simple planar surfaces (i.e., block-like objects) under carefully set up lighting conditions. This approach may lead to worthwhile results for industrial applications - maybe it'll allow us to automate widget sorting - but is rather limited, and (it turns out) not easily extendible to more complex tasks.

The second approach is more directly concerned with modelling human vision. Much work in this area was done by David Marr, who looked at evidence from neurophysiology, psychophysics, and optics and developed a real theory of human visual processes. He clearly distinguished between the theory and any algorithms that were developed based on the theory. The theory would be concerned with uncovering the basic constraints underlying visual processes, and a variety of algorithms could be used to solve some sub-problem in vision given these constraints. (Marr's book (Vision, 1982) is always worth a read when one becomes cynical about the absence of solid theory in AI research.) Anyway, these two lectures will be primarily based on Marr's approach.

Computational vision involves, in a sense, the reverse process to certain tasks in graphics. In graphics we may want to start off with a 3-D model of an object (say, a widget) and end up with a picture of the widget in a particular orientation. We get from an object representation to an array of picture elements, with different grey levels (Oh, OK, colours, but for now we'll just worry about black and white images). In vision we start off with an array of picture elements (a grey level image) and we want to get to a 3-D model, and from that work out what sort of object it is.

The process of getting from grey level data to a 3-D object representation (and hence, we hope, a recognised object) can be divided into stages. These stages are analagous in many ways to the stages of processing for natural language understanding. However, there is more evidence in vision that the stages of processing really are separate - you do one, then use the results from that for the next, and so on.

So, in vision we start off with grey level data (i.e., an array of picture elements giving the image intensity at each point) and, step by step, build up representations that get us closer and closer to representations of real physical objects. The main representations used (in Marr's approach) are:

Grey Levels:
Image intensity values for each point in the image.
Primal Sketch:
Things like edges, blobs and groups of edges and blobs.
2 -D sketch:
Partitions image into regions which correspond to real physical objects, and gives properties of those regions, like surface orientation and depth from the viewer. It's really a viewer centered 3-D(ish) representation.
3-D model representation:
A (usually hierarchical) representation of the object in terms of 3-D shape primitives (such as cones, cylinders, blocks etc).

The first level of processing, from grey levels to primal sketch, involves edge detection - finding where there are sudden changes in intensity values, likely to correspond to edges of objects in the scene. This stage is probably the best understood stage in vision (at least by me), so we will go into it in some detail. The next stage involves getting depth information. There are lots of ways we can work out depth information, but we will concentrate on one: stereopsis. In stereopsis, we use the fact that the images from each of our eyes are slightly different, and from this obtain depth information. (In a machine vision system we might have two cameras in different positions.) Stereopsis is an important source of depth information, and people with one (working) eye have a relatively poor perception of depth. The last stage(s) - building the 3-D model, and maybe using this to recognise the object - is very hard, so I won't attempt to say much about that.



Next: Edge Detection Up: Vision Previous: Vision


alison@
Fri Aug 19 10:42:17 BST 1994