![]() |
SENTIENCE
Stereo Vision System |
![]() |
| Introduction | Tests | Future | Source |
|
Sentience is a stereoscopic vision and mapping system for mobile robots. It was developed initially as part of the Rodney humanoid robot project, and has been refined over several years. The system uses cheap low resolution webcam technology to acquire images and calculate a depth map from them. The main problem in stereo vision is finding matching features in both the left and right camera images. This might seem simple to the uninitiated. As a human I can merely glance at two images and see which parts of it correspond with virtually no mental effort. However, this is because our brains are superbly adapted for dealing with such everyday vision problems. For a computer the problem is much tougher and can only be achieved with a lot of processing power. This is why although various researchers, most notably Hans Moravec, have worked on the problem for the last quarter of a century only now with the advent of multi-gigahertz processors are such systems beginning to become practical.
The initial version of the software used an algorithm based upon a system developed by Stan Birchfield. Images received from both cameras were pre-processed using a mean shift algorithm and the resulting horizontal segments were then matched. This worked up to a point provided that the images were of good quality, but the depth accuracy was very poor and even after optimisation using a genetic algorithm false matches were frequent. The high rate of false matches meant that the system wasn't very useful except for being able to tell whether there was a large object, such as a person, sitting in front of the robot. The second version of Sentience, developed in 2004/5, used a more traditional style of algorithm. For each point in one of the images the classical stereo algorithm searches for a corresponding point within the other image, usually by matching small sub-regions called patches. However, this patch matching process is often highly unreliable since it involves matching pixel intensity values. Pixel intensities may appear to be the same in many areas of the same image, and may vary considerably between two different images due to slightly different camera optics and automatic brightness control. With traditional algorithms you are usually advised to switch off automatic brightness adjustment and perform an elaborate calibration, but even this does not assist the situation very much.
To summarise, one of the golden rules of computer vision is never rely upon absolute pixel values. Anything which breaks this rule is bound to fail, or at best behave in a very flaky manner under varying conditions. For the second version of Sentience I abandoned trying to match pixels or horizontal segment values altogether. Instead I had the system look for corresponding local orientations. At any point in the image you can examine the neighbouring region and look for the direction with the greatest or least intensity. Because this calculation is a relative one it does not depend upon ambient lighting conditions or apparent colours. David Lowe has also used similar orientation features for successful stereo matching and object recognition.
It's also interesting to note that much of the initial stages of visual processing in the brain are dedicated to recognising different local orientations. It's likely that such features have been selected through millions of years of evolution as particularly useful for recognising things in an efficient and robust way. For a good introduction to biological vision see lectures 3 and 4 of Christof Koch's online video course. One popular saying is that the eyes are the windows to the soul. I like to say that if this is true then the soul must indeed be a machine, since a large part of the way our eyes move and perceive things is completely automatic and unconscious. Another addition to the current version of the software is the idea of eye motion. When we look at something, such as a beer bottle or a computer screen, although we may think that our eyes aren't doing much other than looking forwards in fact they are actually darting about at least twice every second. Quite why our eyes do this isn't altogether clear, but one of the main purposes of this movement is to repeatedly sample the visual scene from very slightly different points of view and thus enable a mental image to be built up which has a resolution considerably higher than the resolution of optical sensors on the retina. According to Koch, this allows the human visual system to have a resolution right down to the level of an individual photon. My system isn't quite that accurate, but it does use multiple sampling to help to resolve some of the ambiguities within both camera images. No matter how good any system is errors always crop up. In life there is always a difference between how things should be and how they actually are. I've worked in industrial automation for a long time and most of my job is really all about reducing the margin between fantasy and reality (servo control). In the Sentience system when local orientations are calculated inevitably there is always some margin of error. In order to reduce this as much as possible the system re-samples each image with a small variable offset of a few pixels. Detected orientations are then averaged over a few frames, which gives a more reliable result and makes it less likely that small variations in pixel intensities will affect performance. The resulting system, whilst still far from perfect, is an improvement upon the previous version (especially with variations in lighting). In the animation below you can see what happens when the robot rotates its head from left to right. I have entered an arbitrary distance threshold such that anything within a metre or so is displayed. This is useful for separating the foreground from background. Things observed within the foreground are typically items of immediate interest, such as people with which the robot may interact or toys.
The system uses a pair of Pyro 1394 webcams mounted on a moveable humanoid head. Each camera has independent lateral movement and both can be rotated together vertically. In addition the neck of the robot can also be rotated horizontally and vertically. In practice for testing the stereo vision I keep the cameras fixed in parallel and only use the neck pan/tilt movements. The two cameras are mounted with a distance of 7cm between their centres - approximately the same as an adult human - although their field of vision is narrower at about 40 degrees.
On a PC with an AMD3000 processor and 1GB RAM the stereo algorithm runs at speeds ranging from 20Hz to 0.5Hz, depending upon the sub-sampling resolution and maximum disparity settings. If the images are sub-sampled down to a very small 120x60 pixel resolution then the system can run at >20Hz, but at this image size the depth information returned is little better than using low resolution active sensors such as infrared or sonar.
Blurry looking depth maps are all very well, but before you start feeling too seasick this data needs to be re-arranged into something which is more useful for identification or navigation purposes. To achieve this the two dimensional maps are projected into a three dimensional space known as an occupancy grid. Each point in the grid, known as a cell, has a number of properties including an estimate of the probability that the location is occupied and if so what its colour is. By successive samplings as the robot moves its head around or as it moves through its environment a three dimensional model can be built up.
Below you can see some preliminary results of the formation of an occupancy grid surrounding the robot as it sits in a static position on the desk. As it pans its head right to left successive stereo depth maps are projected into three dimensional space and these three dimensional coordinates are then registered together in real time (ish). The quality of the grid formed is quite poor and at present certainly couldn't be used for object recognition, but is useful for general familiarisation with the local spatial environment.
One advantage of using three dimensional occupancy grids is that it is possible for the robot to imagine (and here I use the word imagine in a very narrow sense) the scene from viewpoints which it has not directly experienced.
The next stage is to try to improve the quality of the occupancy grid data, such that it could be suitable for certain types of object recognition. Another avenue is to give the robot the ability to recognise certain locations - something known in the literature as the kidnapped robot problem. Presently Rodney is a desk bound robot with no legs or wheels. However, this system would be well suited to a mobile robot moving through space and building up a map. At a later stage I might try using one of the WhiteBox robots as a test platform for this kind of development. It would also be interesting to install a couple of cameras into the dashboard of a car and try to build up a three dimensional model as you drive along. Such a system could be used for automatic collision avoidance, or be combined with GPS data to form three dimensional street maps.
For source code and demo programs see the Sentience Google Code site.
Towards General Purpose Vision, Bob Mottram Robot Spatial Perception by Stereoscopic Vision and 3D Evidence Grids, Hans Moravec Stereo Vision and Occupancy Grids, Don Murray Probabilistic Robotics by Sebastian Thrun, Wolfram Burgard and Dieter Fox
|