Walking through a crowd or driving on a busy street requires monitoring your own movement and that of others. The segmentation of these other, independently moving, objects is one of the most challenging tasks in vision as it requires fast and accurate computations for the disentangling of independent motion from egomotion, often in cluttered scenes. This is accomplished in our brain by the dorsal visual stream relying on heavy parallel-hierarchical processing across many areas. This study is the first to utilize the potential of such design in an artificial vision system. We emulate large parts of the dorsal stream in an abstract way and implement an architecture with six interdependent feature extraction stages (e.g., edges, stereo, optical flow, etc.). The computationally highly demanding combination of these features is used to reliably extract moving objects in real time. This way—utilizing the advantages of parallel-hierarchical design—we arrive at a novel and powerful artificial vision system that approaches richness, speed, and accuracy of visual processing in biological systems.

*f*= 1/16 cyc/pixel) for the left image of the image pair in Figure 1A.

Time (in ms) | fps | ||
---|---|---|---|

320 × 256 | 33 | 30 | GPU |

2929 | 0.34 | CPU | |

640 × 512 | 47 | 21 | GPU |

10,007 | 0.10 | CPU |

^{−4}radians/frame).

Speed | Camera | Slow car | Fast car |
---|---|---|---|

x | 2.45 | 2.46 | 4.73 |

y | 0.59 | 0.66 | 1.37 |

z | 0.22 | 0.20 | 0.73 |

Mag. | 2.53 | 2.55 | 4.98 |

Real | 2.54 | 2.54 | 5.08 |

**= (−0.002, −0.004, −0.006)**

*ω*^{T}radians/frame, to the sequence (see 1 for a description of the egomotion parameters) while keeping the fifth frame unaltered. Since this rotation is added on top of the already present camera translation, the resulting sequence is highly complex (see the optical flow in Figures 3C and 3D). Nevertheless, the camera rotation is extracted with high precision, as shown in Figure 3B. It is not possible to extract disparity and independent motion in this way, since the right camera needs to be translated (and this cannot be simulated) to maintain the rectification required by the disparity algorithm.

*θ*, the spatial phase at pixel location

**x**= (

*x*,

*y*)

^{T}is extracted using 2D complex Gabor filters:

*ω*

_{0}and spatial extension

*σ*. We use a total of eight evenly distributed orientations in our implementation. The peak frequency is doubled from one scale to the next. To accommodate this, the filters span an octave bandwidth:

*B*=

*ω*

_{0}/3. With a cutoff frequency equal to half the amplitude spectrum, the spatial extension is then equal to

*σ*=

*B*.

*ω*

_{0}=

*π*/2 rad/pixel, which results in a spatial extension

*σ*= 2.25 pixels. The lower frequency responses are obtained by applying the same filters to an image pyramid that is constructed by repeatedly blurring the images with a Gaussian kernel and subsampling (Burt & Adelson, 1983). We use a total of six scales. This is a technically very feasible way to approximate the multitude of neuronal responses from neurons with different receptive field sizes. The spatial filter kernels are 11 × 11 pixels in size and separable. Since some of the responses can be reused, all eight even and odd filter responses can be obtained on the basis of only 24 1D convolutions. The spatial filter kernels are shown in Figure A1A and their frequency domain coverage when applied to the image pyramid is shown in Figure A1B.

*I*(

**x**), with the oriented filter from Equation A1 can be written as

*ρ*(

**x**) =

*ϕ*(

**x**) = atan2(

*S*(

**x**),

*C*(

**x**)) are the amplitude and phase components, and

*C*(

**x**) and

*S*(

**x**) are the responses of the quadrature filter pair (Pollen & Ronner, 1981). The * operator depicts convolution.

*θ*) by projecting the phase difference on the epipolar line (the horizontal). In this way, multiple disparity estimates are obtained at each location. We robustly combine these estimates using the median:

_{2π }operator depicts reduction to the ]−

*π*;

*π*] interval. As discussed in more detail in the next section, a coarse-to-fine control scheme is used to integrate the estimates over the different pyramid levels. Unreliable estimates are removed by running the algorithm from left to right and from right to left and looking for mutual consistency.

*ϕ*(

**x**,

*t*) =

*c*. Differentiation with respect to

*t*yields

*ϕ*is the spatial and

*ψ*is the temporal phase gradient, and

**u**is the optical flow. Under a linear phase model, the spatial phase gradient can be substituted by the radial frequency vector, (

*ω*

_{0}cos

*θ*,

*ω*

_{0}sin

*θ*)

^{T}(Fleet, Jepson, & Jenkin, 1991). In this way, the component velocity,

**c**

_{ θ }(

**x**), at pixel

**x**and for filter orientation

*θ*, can be estimated directly from the temporal phase gradient,

*ψ*

_{ θ }(

**x**):

**x**are reliable, they are integrated into a full velocity by means of an intersection-of-constraints procedure:

*O*(

**x**) is the set of orientations that correspond to reliable component velocities.

*k*. Using the optical flow estimate obtained at that resolution,

**v**

^{ k }, the phase estimate at the next higher resolution,

*ϕ*

^{ k−1}, is warped in such a way that the estimated motion is removed (Bergen, Anandan, Hanna, & Hingorani, 1992):

*t*) ensures that each pixel in the five frame sequence (

*t*= 1, 2, …, 5) is warped to its corresponding location in the center frame (

*t*= 3). Bilinear interpolation is used to perform subpixel warps. The warped phase is then used to compute the residual motion. This process is repeated until the pyramid level corresponding to the original image resolution is reached.

*k*:

*E*(

**x**) =

*PC*

_{1}(

**x**) will reach a maximum value of unity if all components have the same phase.

*W*(

**x**) is a factor that weights for frequency spread (derived from the distribution of response amplitudes),

**x**) is the mean phase angle (see Figure A3),

*T*is a threshold to remove noisy components with low energy values, and

*ɛ*is a small factor to avoid division by zero. The operator ⌊ ⌋ converts negative values to zero and leaves positive values unchanged. Subtracting the magnitude of the sine of the phase deviation from the cosine improves localization.

*PC*(

*θ*) is the phase congruency value determined at orientation

*θ*. The maximum moment,

*M*, is then used as a measure for the presence of an edge:

**t**= (

*t*

_{ x },

*t*

_{ y },

*t*

_{ z })

^{T}is the translational velocity,

**= (**

*ω**ω*

_{ x },

*ω*

_{ y },

*ω*

_{ z })

^{T}is the rotational velocity of the moving observer,

*d*(

**x**) is the inverse depth, and

**t**and

**appear as a product in Equation A19) and the robust estimation procedure, the estimation process is sensitive to local minima. We reduce this sensitivity by evaluating a number of initializations (**

*ω**n*= 32) in parallel. The median of the residuals is used to evaluate the quality of the estimates and to determine the outlier rejection threshold. We use an approximation to the median that is more suitable for parallel implementation.

*d*

_{ M }(

**x**), can be computed in the following way (Zhang & Tomasi, 2002):

*d*

_{ M }(

**x**) is relative due to the scale ambiguity that occurs in monocular egomotion estimation (

*d*(

**x**) and

**t**appear as a product in Equation A16). For parallel cameras with baseline

*b*and unity focal length (again, without loss of generality), binocular disparity,

*δ*(

**x**), is also related to inverse depth:

*δ*(

**x**) = −

*b*/

*z*. Consequently, the mapping between the two can be easily obtained in a robust fashion:

*d*

_{ D }(

**x**), becomes

*d*

_{ D }(

**x**) =

*Sδ*(

**x**). We then use this measure to compute the ego-flow, the flow that would have been observed in an entirely static environment:

**t**

^{ I }, the following model should be able to explain the differences between the optical flow and the ego-flow:

Resolution | GPU | CPU | ||
---|---|---|---|---|

320 × 256 | 640 × 512 | 320 × 256 | 640 × 512 | |

Gabor filtering | 3 | 5 | 150 | 608 |

Edges/flow/disparity | 7 | 14 | 1406 | 6926 |

Egomotion | 19 | 20 | 1003 | 1003 |

Indep. flow seg. | 4 | 8 | 370 | 1470 |

Total | 33 | 47 | 2929 | 10,007 |

Fps | 30 | 21 | 0.34 | 0.10 |