Using stereopsis, motion, texture and (possibly) line labelling allows us to partition a scene into objects, and obtain depth and orientation information for surfaces on that object. We now have to obtain a 3-D model from that, and then try to work out what object(s) it is. I won't attempt to describe this process, just mention the sorts of 3-D object representations used, and some of the problems with matching these to models of actual known objects.
OK, so our language for representing 3-D objects will be in terms of shape primitives such as cones, cylinders, blocks etc. A typical shape (a banana?) might be:
shape55: shape: cylinder end1: shape23 end2: shape22 length: 20cm width: 4cm curvature: 0.1 colour: yellow texture: smooth shape23: shape: cone etc
To recognise the object as a banana we need a library of 3-D models. Bananas obviously come in a whole range of shapes and sizes, within certain limits, so our model should specify the range of possible values (e.g., 10cm-30cm), not a precise value. To recognise an object would involve checking through the library of objects to find one that fits. If there are only a few objects in your world (widgets and wodgets) this may be fairly straightforward. If there are lots, then we will need to worry about clever indexing schemes, and take advantage of the hierarchical structure of the models (e.g., worrying about the banana's basic (bent) cylindrical shape before worrying about its conical ends.