Bottom-Up Visual Attention for Virtual Human Animation
In Computer Animation for Social Agents (CASA) 2003
Submitted December 2002, published May 2003.

Download paper: casa2003.pdf
The bottom-up model of attention is based on work done at the iLab in the University of Southern California


Overview

Animating virtual humans in a plausible way is a complicated problem. Conventional techniques fall short of providing a sense of presence when dealing with agents, who do not appear to respond in a human-like manner to environmental stimuli and internal goals. This project seeks to address the problem by using research from cognitive psychology and engineering to model aspects of human vision, memory and attention for the purpose of providing internally driven, autonomous attention behaviours. Behaviours driven in such a way serve both aesthetic and functional needs, forming the basis of a more general framework envisaged for use with a cognitive modelling paradigm.

 


Synthetic Vision

The synthetic vision component takes three renderings of the scene: a 256x256 full-scene rendering, a 128x128 false-colour foveal rendering and a 128x128 false-colour peripheral rendering.
The images below show samples of these renderings for two scenes. The rightmost image in each case consists of the foveal rendering superimposed on top of the periperhal rendering.





Bottom-up Visual Attention

This component helps us to answer the question "where is the agent looking?". It uses a biologically plausible algorithm to calculate salient parts of the scene in a bottom-up manner (parts of the scene that may 'pop-out' at the viewer).





The top row of pictures depict the original view on the retina of the agent, the intensity channel for intensity and orientation calculations, and the Red-Green and Blue-Yellow opponency channels for colour contrast calculation. The second row of images show the final intensity, orientation and colour conspicuity maps respectively. The rightmost images are the overall saliency map and a view of the saliency map superimposed over the original image.

Memory

Every object and group observation in the scene contains an uncertainty level representing the agents knowledge about that object or group. These uncertainty levels modulate the output of the attention component in order to provide attention behaviours towards parts of the scene where both the saliency and uncertainty are high.

The example above illustrates 14 separate attention 'snapshots' of the pub scene. On the left of each snapshot is the false-colour view from the fovea as it moves around the scene, while the corresponding object-based memory uncertainty levels are shown on the right. Lighter values indicate higher uncertainty levels. Initially, most of the scene is white, since the agent is not familiar with the surroundings. As the fovea moves around the scene, objects are entered into short-term memory and the agent thus becomes more familiar with them. In this example, the bottom-up attention algorithm evaluates the scene only once at the start to obtain salient locations in the visual field. Also, the memory information is not used to modulate the attention output.



The memory uncertainty levels for the visual field are calculated at the resolution of the retinal image (128x128 pixels) and must be resized to the resolution of the saliency map (16x16 pixels). The final memory value is calculated as the sum of the uncertainty levels of all objects contained in that area, weighted by the number of pixels in the area that represent the object. The saliency map is modulated by the memory map using point-by-point multiplication. An example of this process is shown above. From left to right, the figures depict: peripheral and foveal view of the scene, unmodulated saliency map, memory uncertainty map, and final modulated saliency map. In that example, the agent has been previously familiarised with the sky and, to a lesser extent, the footpath. Because of this, salient locations in the sky and path are inhibited and elicit less attention.

Gaze Generation

After modulation with the memory component, the salient locations in the scene are passed to the gaze component. This component provides high-level commands for generation gaze motions towards interesting locations in the visual field.



These videos show the system in action. Our agent has entered a pub and surveys the environment before ordering a drink. In this video, the environment is new to the agent (no items are in his short-term memory beforehand). The foveal view is shown in the second video. Note that the agent pays as much attention to the pictures on either side of the bar as to the bar itself. This is because the attention model is driven by external stimuli in a purely bottom-up manner; coupled with a top-down approach, the saliency of objects relating to the task at had may also be increased - in this case, the bar and bartender objects.



These videos illustrate the same process in a street setting.