SENTIENCE

Stereo Vision System

 

  Introduction Tests Future  Source  

 


A wall or desktop mounted version of the Sentience stereo vision system, used to detect people in 3D.

Introduction

Sentience is a stereoscopic vision and mapping system for mobile robots.  It was developed initially as part of the Rodney humanoid robot project, and has been refined over several years.  The system uses cheap low resolution webcam technology to acquire images and calculate a depth map from them.

The main problem in stereo vision is finding matching features in both the left and right camera images.  This might seem simple to the uninitiated.  As a human I can merely glance at two images and see which parts of it correspond with virtually no mental effort.  However, this is because our brains are superbly adapted for dealing with such everyday vision problems.  For a computer the problem is much tougher and can only be achieved with a lot of processing power.  This is why although various researchers, most notably Hans Moravec, have worked on the problem for the last quarter of a century only now with the advent of multi-gigahertz processors are such systems beginning to become practical. 


An early version of Sentience

The initial version of the software used an algorithm based upon a system developed by Stan Birchfield.  Images received from both cameras were pre-processed using a mean shift algorithm and the resulting horizontal segments were then matched.  This worked up to a point provided that the images were of good quality, but the depth accuracy was very poor and even after optimisation using a genetic algorithm false matches were frequent.  The high rate of false matches meant that the system wasn't very useful except for being able to tell whether there was a large object, such as a person, sitting in front of the robot.

The second version of Sentience, developed in 2004/5, used a more traditional style of algorithm.  For each point in one of the images the classical stereo algorithm searches for a corresponding point within the other image, usually by matching small sub-regions called patches.  However, this patch matching process is often highly unreliable since it involves matching pixel intensity values.  Pixel intensities may appear to be the same in many areas of the same image, and may vary considerably between two different images due to slightly different camera optics and automatic brightness control.  With traditional algorithms you are usually advised to switch off automatic brightness adjustment and perform an elaborate calibration, but even this does not assist the situation very much.


The dynamic duo


A wooden mannequin standing on an old Moravec book is used as an impromptu camera stand.

To summarise, one of the golden rules of computer vision is never rely upon absolute pixel values.  Anything which breaks this rule is bound to fail, or at best behave in a very flaky manner under varying conditions.

For the second version of Sentience I abandoned trying to match pixels or horizontal segment values altogether.  Instead I had the system look for corresponding local orientations.  At any point in the image you can examine the neighbouring region and look for the direction with the greatest or least intensity.  Because this calculation is a relative one it does not depend upon ambient lighting conditions or apparent colours.  David Lowe has also used similar orientation features for successful stereo matching and object recognition.


Orientation columns in the V1 primary visual cortex of a monkey

It's also interesting to note that much of the initial stages of visual processing in the brain are dedicated to recognising different local orientations.  It's likely that such features have been selected through millions of years of evolution as particularly useful for recognising things in an efficient and robust way.  For a good introduction to biological vision see lectures 3 and 4 of Christof Koch's online video course.

One popular saying is that the eyes are the windows to the soul.  I like to say that if this is true then the soul must indeed be a machine, since a large part of the way our eyes move and perceive things is completely automatic and unconscious.

Another addition to the current version of the software is the idea of eye motion.  When we look at something, such as a beer bottle or a computer screen, although we may think that our eyes aren't doing much other than looking forwards in fact they are actually darting about at least twice every second.  Quite why our eyes do this isn't altogether clear, but one of the main purposes of this movement is to repeatedly sample the visual scene from very slightly different points of view and thus enable a mental image to be built up which has a resolution considerably higher than the resolution of optical sensors on the retina.  According to Koch, this allows the human visual system to have a resolution right down to the level of an individual photon.  My system isn't quite that accurate, but it does use multiple sampling to help to resolve some of the ambiguities within both camera images.

No matter how good any system is errors always crop up.  In life there is always a difference between how things should be and how they actually are.  I've worked in industrial automation for a long time and most of my job is really all about reducing the margin between fantasy and reality (servo control).  In the Sentience system when local orientations are calculated inevitably there is always some margin of error.  In order to reduce this as much as possible the system re-samples each image with a small variable offset of a few pixels.  Detected orientations are then averaged over a few frames, which gives a more reliable result and makes it less likely that small variations in pixel intensities will affect performance.

The resulting system, whilst still far from perfect, is an improvement upon the previous version (especially with variations in lighting).  In the animation below you can see what happens when the robot rotates its head from left to right.  I have entered an arbitrary distance threshold such that anything within a metre or so is displayed.  This is useful for separating the foreground from background.  Things observed within the foreground are typically items of immediate interest, such as people with which the robot may interact or toys.


Extraction of the foreground as the robot looks left and right

 

 

 

Tests

The system uses a pair of Pyro 1394 webcams mounted on a moveable humanoid head.  Each camera has independent lateral movement and both can be rotated together vertically.  In addition the neck of the robot can also be rotated horizontally and vertically.  In practice for testing the stereo vision I keep the cameras fixed in parallel and only use the neck pan/tilt movements.  The two cameras are mounted with a distance of 7cm between their centres - approximately the same as an adult human - although their field of vision is narrower at about 40 degrees.

 


An early version of the Rodney robot with a stereo pan/tilt head using Logitech Quickcams


Stereo camera system

On a PC with an AMD3000 processor and 1GB RAM the stereo algorithm runs at speeds ranging from 20Hz to 0.5Hz, depending upon the sub-sampling resolution and maximum disparity settings.  If the images are sub-sampled down to a very small 120x60 pixel resolution then the system can run at >20Hz, but at this image size the depth information returned is little better than using low resolution active sensors such as infrared or sonar.


Depth maps calculated at low resolution (left) running at 16Hz and high resolution (right) running at 4Hz.  Lighter intensities indicate areas which are closer to the cameras.

 


Picking out objects at different distances

 


The thin leg of my desk is easily detected by stereo.  This would normally only be possible to do reliably using an expensive scanning laser rangefinder

Blurry looking depth maps are all very well, but before you start feeling too seasick this data needs to be re-arranged into something which is more useful for identification or navigation purposes.  To achieve this the two dimensional maps are projected into a three dimensional space known as an occupancy grid.  Each point in the grid, known as a cell, has a number of properties including an estimate of the probability that the location is occupied and if so what its colour is.  By successive samplings as the robot moves its head around or as it moves through its environment a three dimensional model can be built up.


Two dimensional observations made from Rodney's cameras are mapped into a three dimensional coordinate frame.

 

Below you can see some preliminary results of the formation of an occupancy grid surrounding the robot as it sits in a static position on the desk.  As it pans its head right to left successive stereo depth maps are projected into three dimensional space and these three dimensional coordinates are then registered together in real time (ish).  The quality of the grid formed is quite poor and at present certainly couldn't be used for object recognition, but is useful for general familiarisation with the local spatial environment.

 


The view from the robot's left camera showing my desk with a number of objects on it.  You can just about see flint's head peeping up from behind the desk.

 


The occupancy grid.  The dark blob to the left is me sitting close to the computer keyboard.


An overhead view of the occupancy grid formed as the robot's head pans around.  The green circle represents the position of the robot, and you can just about make out the red, yellow and black objects on the desk.

One advantage of using three dimensional occupancy grids is that it is possible for the robot to imagine (and here I use the word imagine in a very narrow sense) the scene from viewpoints which it has not directly experienced.  


Rodney views objects on the desktop.

 

 


Working out the tilt angle of the desktop surface relative to the robot's head.

 


Dense 3D grids are produced at 10Hz in a non-egocentric coordinate frame

 

 

 

 

Future Developments

The next stage is to try to improve the quality of the occupancy grid data, such that it could be suitable for certain types of object recognition.  Another avenue is to give the robot the ability to recognise certain locations - something known in the literature as the kidnapped robot problem.

Presently Rodney is a desk bound robot with no legs or wheels.  However, this system would be well suited to a mobile robot moving through space and building up a map.  At a later stage I might try using one of the WhiteBox robots as a test platform for this kind of development.  It would also be interesting to install a couple of cameras into the dashboard of a car and try to build up a three dimensional model as you drive along.  Such a system could be used for automatic collision avoidance, or be combined with GPS data to form three dimensional street maps.

 

 

 

Source

For source code and demo programs see the Sentience Google Code site.


 

 

Related Articles

Towards General Purpose Vision, Bob Mottram

Robot Spatial Perception by Stereoscopic Vision and 3D Evidence Grids, Hans Moravec

Stereo Vision and Occupancy Grids, Don Murray

Probabilistic Robotics by Sebastian Thrun, Wolfram Burgard and Dieter Fox