Abstract
The ability to quickly recognize the number of objects in our environment is a fundamental cognitive function. However, it is far from clear which computations and which actual neural processing mechanisms are used to provide us with such a skill. Here we try to provide a detailed and comprehensive analysis of this issue, which comprises both the basic mathematical foundations and the peculiarities imposed by the structure of the visual system and by the neural computations provided by the visual cortex. We suggest that numerosity should be considered as a mathematical invariant. Making use of concepts from mathematical topology—like connectedness, Betti numbers, and the Gauss–Bonnet theorem—we derive the basic computations suited for the computation of this invariant. We show that the computation of numerosity is possible in a neurophysiologically plausible fashion using only computational elements which are known to exist in the visual cortex. We further show that a fundamental feature of numerosity perception, its Weber property, arises naturally, assuming noise in the basic neural operations. The model is tested on an extended data set (made publicly available). It is hoped that our results can provide a general framework for future research on the invariance properties of the numerosity system.
Introduction
Information about the number of objects in the environment can be extracted in a fast and effortless fashion by the visual systems of humans and other species. Access to this information is a crucial factor in evolution: How many predators am I confronted with? From which tree can I get the larger amount of food? Questions like these show that survival can essentially depend on having access to this type of knowledge. It is therefore not surprising that the estimation of number—or of its approximation, numerosity (Brannon,
2006)—is also considered to be a fundamental property of cognition.
Humans recognize numbers rapidly and precisely up to four (“subitizing”; Kaufman, Lord, Reese, & Volkmann,
1949), but we also estimate larger numbers in a rapid fashion, although there with increasing errors. A typical characteristic of the numerosity system is that the errors depend on the set size, in accordance with the Weber–Fechner law (Gallistel & Gelman,
1992). Whether these characteristics support the idea of two clearly distinct subsystems or reflect different operation modes of a general number system is still under debate (Feigenson, Dehaene, & Spelke,
2004; Piazza, Mechelli, Butterworth, & Price,
2002; Ross & Burr,
2010). Not only has estimation of numerosity been reported for human adults (Whalen, Gallistel, & Gelman,
1999), but it has also been shown that even 6monthold infants were able to perform a numberdistinction task (Xu & Spelke,
2000). Investigation of the development of mathematical competence and the ability for numerosity estimation in children suggests that mathematical ability is correlated with acuity in numerosity estimation (Halberda, Mazzocco, & Feigenson,
2008). But number estimation is not restricted to humans with mature cognitive abilities—it has also been found in infants and animals (Brannon,
2006; Nieder, Freedman, & Miller,
2002), and recently even in invertebrates (Gross et al.,
2009).
In general, numerosity is one of the most abstract properties in the environment, and its perception is almost independent of spatial attributes like orientation (Allik, Tuulmets, & Vos,
1991) and of the shape of the objects (Strauss & Curtis,
1981). It is also not confined to a specific sensory modality (Starkey, Spelke, & Gelman,
1990). And finally, numerosity estimation extends beyond the estimation of the cardinality of objects. The quantity of physical properties like sound volume, space, and time shows similar characteristics (Bonn & Cantlon,
2012). This suggests that there might exist a generalized system for magnitude estimation (Gallistel & Gelman,
2000; Walsh,
2003).
The standard view of cortical organization as a localtoglobal processing hierarchy (Hegde & Felleman,
2007) which goes from basic sensory properties toward the representation of the most abstract properties on top of the hierarchy would suggest that numerosity has to be considered a very highlevel feature. On the other hand, singlecell recordings show that neural reactions to numerosity are quite fast, approximately 100 ms in macaques (Roitman, Brannon, & Platt,
2007). Likewise, human evoked potentials show numberspecific responses as early as 75 ms (Park, DeWind, Woldorff, & Brannon,
2015). This indicates that number processing starts at a relatively early level.
The reaction times in numerosityestimation tasks are independent of the number of elements, suggesting that numerosity is processed in parallel (Mandler & Shebo,
1982). Physiological results also argue for a parallel extraction of numerosity (Nieder et al.,
2002). In addition, there is evidence for a “direct visual sense for number,” since number seems to be a primary visual property like color, orientation, or motion, to which the visual system can be adapted by prolonged viewing (Ross & Burr,
2010). And finally, there is an ongoing debate about whether we have a true sense of number (Anobile, Cicchini, & Burr,
2014; Ross & Burr,
2010) or whether our apparent number sense is in fact just a variant of texture perception, namely the perception of texture density (Dakin, Tibber, Greenwood, Kingdom, & Morgan,
2011; Durgin,
2008; Raphael, Dillenburger, & Morgan,
2013).
All this shows that the understanding of numerosity should be regarded as a constitutive element for our understanding of perception and cognition. And for this it is of obvious importance to understand how numerosity is computed by the visual system. However, there is as yet no agreement upon a canonical structure for models which address the problem of numerosity perception. Rather, there exists a variety of different model approaches (e.g., Dakin et al.,
2011; Dehaene & Changeux,
1993; Meck & Church,
1983; Stoianov & Zorzi,
2012; Verguts & Fias,
2004; Zetzsche & Barth,
1990b), and it is unclear how they are exactly related to each other and whether they all share some common basis. It is also important to note that some of these models cannot be considered as full computational models that can be realized in a neurobiologically realistic fashion. Such models have to account for the complete processing chain, from the retinal image over neurobiologically plausible transformation stages to the final number assignment.
To our knowledge, the first model that matched these criteria was suggested by
Zetzsche and Barth (1990b), and their earlier results constitute an important basis for our present analysis. A very influential model that also matches the criteria (up to the point that it is only a 1D model) has been suggested by Dehaene and Changeux (
1993). This model is based on different processing stages, with the essential components being differenceofGaussian filters of different sizes and an accumulation system. These filters restrict the model to represent all stimuli as bloblike features, which limits its invariance properties with respect to the shape of the elements in a stimulus. Whether human observers show related deviations from invariance which can be attributed to a bloblike representation remains to be determined.
A more recent model by Dakin et al. (
2011) uses texture density computed by a ratio of pooled high and lowspatialfrequency filter outputs. Assuming that the high spatial frequencies are largely determined by the number of objects, an estimate for numerosity is determined by the product of area and density. By definition, the pooled highfrequency output depends on the length of the object contours presented in the stimulus. The model tests so far have used stimuli consisting of similar objects, such that numerosity is approximately proportional to accumulated contour length. The degree of invariance of the model with respect to comparisons involving elements of very different shape, and therefore substantially different contour length, has not yet been systematically tested.
Another model which can be seen to match the criteria, by Stoianov and Zorzi (
2012), is based on unsupervised learning in a deep recurrent network. Neuralnetwork models are very valuable insofar as they demonstrate that a capability like numerosity perception can indeed be learned by a biological system (Hornik, Stinchcombe, & White,
1989). The training images were binary and the elements presented were rectangles, triangles, or ovals, so that only moderate shape variations were investigated. Whether this model and its abstract mathematical counterpart (Cappelletti, Didino, Stoianov, & Zorzi,
2014; Stoianov & Zorzi,
2012) can provide the desired invariance properties for arbitrarily shaped elements thus remains to be determined.
In our view, this situation suggests that a systematic account of the logical, mathematical, and computational requirements for a sense of number is highly desirable. A formal analysis through a hierarchy of abstraction similar to Marr's (
1982) approach and a discussion of the relation to the philosophy of mathematics can be found elsewhere (Kluth & Zetzsche,
2015). Simplified early variants of the model which represent objects as “rightangled” polygons and make use of a generalized eigenvalue operation have also already been described (Zetzsche & Barth,
1990b; Zetzsche, Gadzicki, & Kluth,
2013). First results on the present version of the model have been presented in a conference paper (Kluth & Zetzsche,
2014). In this article the focus is on the invariance properties of the model, on its neural realization, and on its quantitative predictions regarding human behavior.
The article is organized as follows. We start with a specification of the preconditions and the general problem statement, and present then the specific questions that have to be answered. In the following section we describe our approach to the problem: Numerosity should be considered as a mathematical invariant, and the zeroth Betti number, a specific topological invariant, should be considered as the ideal solution. We then derive a realistic approximation of the ideal solution by considering the image luminance function as a curved surface and by making use of the Gauss–Bonnet theorem to compute the number of simply connected components in the image. We next consider the neurobiological plausibility of the suggested computations and show that required hardware can be assumed to be available in the visual cortex. The model is then tested for a variety of differentshaped elements arranged in varying configurations. For these tests, we provide an extended data set, which will also be made publicly available, containing diverse shape variations. In the next section we investigate the relation of the model to human behavior. Reasonable assumptions about neural noise are shown to lead to valid predictions about behavior in different tasks and about the corresponding Weber fractions. The article closes with a discussion in which we compare our model to other suggested models, identify the different invariance properties of the models, and consider what testable predictions can be deduced from this comparison.
Mathematical principles and models
Definition of the problem
The common mathematical basis for numbers is given in set theory. The cardinality of a set is closely related to our understanding of what is meant by numerosity, as it describes how many elements the set contains. But representations of numbers seem to be somewhat arbitrarily chosen, because numbers are just a formal construction to provide a label for the equivalence classes of the relation “having the same cardinality.” The definitions of the natural numbers and their Arabic numerals are both concepts which can have an effect on numerical cognition. However, they do not provide a basis for the mental representation of numbers, as species without access to these concepts are also able to estimate the number of objects in a configuration. Another problem is that the concept of cardinality in set theory is based on a clear distinction between the elements within a set, and this cannot be assumed as a general property of the perceptual process. As a consequence, set theory does not provide the tools and concepts to properly describe how the number of objects in the real world is derived by perceptual processes from basic properties of these objects. The biological representation of number derived from perceptual processes is also referred to as numerosity.
In the context of the
perception of numerosity we thus have to consider a perspective which is different from the abstract mathematical realm. Here we are confronted with the problem of inferring numerosity from the physical world. Constitutional aspects of this problem are the notions of space and object. We envisage space here as the continuous realvalued space ℜ
^{3}. We further assume that this 3D space is not completely empty but contains a configuration of matter—i.e., at each position (
x,
y,
z) we have either some matter of type
m_{j} (with multiple assignments not allowed) or empty space (vacuum). Is it possible to assign in a meaningful way a numerosity to this spatial configuration of matter? And what are the requirements for such an assignment? The critical concept here is the notion of an object. We think here of ordinary objects like dogs, trees, tables, and so on. But what makes up an object formally? At least since Descartes, a very common conception of a physical object (body) is that of a contiguous bounded region of matter in threedimensional space. This region is distinguished from its surround. The approach of establishing objects by such a contiguous region of matter is also supported from a perceptual point of view by evidence from gestalt theory. The connectedness of pieces strongly affects grouping into one object (Palmer & Rock,
1994). Moreover, it has been shown that this kind of connectedness has an effect on numerosity estimation (Franconeri, Bemis, & Alvarez,
2009; He, Zhang, Zhou, & Chen,
2009). One critical issue here is how the distinction between object and surround is established. For the context of this article we will not further pursue this question; we will just assume that this is achieved in some reasonable fashion. An object is thus restricted to a connected subset of the real world, which means that it cannot be represented by a union of elements, i.e., disjoint subsets.
The world in which we live can be assumed to be fourdimensional if we consider time as an additional dimension. Within the context of this article we restrict the world and the objects to be static such that they can be represented by and within a threedimensional vector space, the realvalued space ℜ
^{3}. We then have our configuration of one or more contiguous regions of 3D space, and we can assign a label—the
objecthood label—to each point (
x,
y,
z) of the interior of the region belonging to an object. We further assume that there exists something like a background in the form of the complement of the union of all object regions (in physical terms this could be air, water, vacuum, etc.), and we can assign the
background label to all points of this background. For simplicity, let us also assume that any two objects can come arbitrarily close together but can never have full contact, so that there is always at least a tiny bit of background in between them. A configuration
c(
x,
y,
z) can then be defined by the associated objecthood function. In mathematical terms this means that if multiple objects
O_{i} are given as subsets of ℜ
^{3}, building a configuration set
C = ∪
_{i}O_{i}, the objecthood function defining a configuration becomes the characteristic function with respect to
C—i.e.,
c(
x,
y,
z) =
χ_{C}(
x,
y,
z).
Figure 1 shows an example for a possible configuration.
Having specified the formal conditions so far, we now address the following questions:

How can we assign a numerosity to this configuration in a mathematically reasonable way?

How can we compute this numerosity from visual measurements?

How can this computation be realized with the available neural hardware of the human visual cortex?

How does all this relate to the human perception of numerosity?
Numerosity is an invariant
As a first step we consider the mathematical problem behind the numberassignment problem addressed by the first question. Let us consider the set 𝒞, which has as elements all possible configurations c(x, y, z)—i.e., all the configurations of objects in the sense specified earlier. The relation ∼ defines equinumerosity: Two configurations are equinumerous if each has the same numerosity. The set 𝒞 is then structured into subsets according to the equinumerosity relation ∼, which means that each subset is an equivalence class with respect to ∼. Let us define the property N(c(x, y, z)) as the numerosity of objects in the configuration c. Since it is true that whenever c_{i} ∼ c_{j}, then N(c_{i}) = N(c_{j}) for two configurations c_{i} and c_{j}, the property N—the numerosity—is an invariant under the relation ∼. Note that the fundamental difference from the classical settheoretical approach is as follows. Given a configuration c with respect to a union of objects O_{i}, we only have access to C = ∪_{i}O_{i}. But determination of cardinality in a settheoretical manner would require access to {O_{1}, …, O_{n}}—i.e., an explicit distinction between the objects.
As next step we have to take a closer look at the specific nature of the invariant
N. Invariant properties are usually specified with respect to transformations
T which map an object configuration to another one. Which transformations
T have to be considered? The simplest class are obviously changes of position (
Figure 2a), and of course the numerosity
N(
c(
T[
x,
y,
z])) should not be influenced by such transformations
T. The same is true for other geometric properties, like size and orientation changes (
Figure 2b,
c), and in general every affine geometric transformation
T should be admissible while leaving the number value invariant. However, geometric transformations are not sufficient, since in addition to changes of the geometric properties of the objects it should also be possible to change their
shapes. Is there a class of transformations which enables an arbitrary change of the shapes of the objects? And what is the appropriate mathematical formalism that can provide suitable invariants with respect to this class of transformations?
We suggest that the appropriate formalism is provided by topology. Loosely stated, topology is the mathematical discipline that deals with those problems that do not depend on the exact shape of the objects and configurations involved. The topological structure of the support of a configuration (or more generally, a topological space) is described by the series of Betti numbers. The kth Betti number is the rank of the kth simplicial cohomology group. A more intuitive interpretation of these numbers is that they measure the number of kdimensional holes of the space. The zeroth Betti number b_{0} is the most interesting one with respect to numerosity estimation, as it measures the number of connected components. Each configuration consists of multiple contiguous objects so that the zeroth Betti number of the support C is equivalent to the number of objects—i.e., N(c(x, y, z)) = b_{0}(C).
As a side remark, it is of interest whether the proposed invariant is a topological invariant. In this case, the transformations would be restricted to homeomorphisms—i.e., continuous and bijective mappings whose inverse is also continuous. This class of operations not only includes the geometric transformations illustrated in
Figure 2a through
c but also allows the complete change of object shapes as illustrated in
Figure 2d. Homeomorphisms are structure conserving in the sense that the kind of connectedness does not change. Cutting, tearing, and gluing of objects, for example, are no homeomorphisms in general and could cause a change of a topological invariant. The proposed invariant, the zeroth Betti number, is a topological invariant insofar as it does not change under homeomorphisms. However, in a sense the zeroth Betti number can be seen as a stronger invariant, as it does not change its value for a broader class of transformations, for example the mapping from a sphere to a torus. This distinguishes it from another weaker topological invariant, the Euler characteristic, which we will later consider in more detail. The most general requirement for the invariance of the zeroth Betti number is only that each object remain somehow connected.
Now that we have specified the formal solution to the problem with the concept of the zeroth Betti number, we next have to consider the computational aspects: What kind of algorithms exist for the computation of b_{0}? How can we apply them to real sensory data? In answering these questions we will first proceed through some more abstract levels before finally deriving a concrete model for estimating numerosity from a discrete twodimensional intensity image.
We start with the consideration of algorithms. In digital topology—i.e., the research area dealing with the computation of topological properties in image processing—there exists a class of closely related algorithms referred to as
connected component labeling. As the name states, the algorithms are based on a labeling strategy, and the number of connected components results as a byproduct of the number of different labels that have been assigned on termination of the algorithm. The main problem in making use of these algorithms in the context of numerosity is that our current knowledge of numerosity perception assumes a parallel, almost instantaneous process for its computation. To our knowledge, there exist no hints toward a specific temporal dependence on reaction times for numerosity estimation, with the only exception of sequential or mental counting, which is assumed to rely on different mechanisms (Mandler & Shebo,
1982; Trick & Pylyshyn,
1994). However, all known algorithms from digital topology for computing the number of connected components have a substantial serial component. Many of them show a direct runtime dependence on the number of objects (see, e.g., He, Chao, Suzuki, & Wu,
2009), whereas it is a hallmark of numerosity perception that the reaction times are independent of the number of objects in the stimulus. At present it is thus not possible to determine the number by a connectedcomponentlabeling algorithm which is compatible with our current knowledge on numerosity perception and the neural architecture of the visual cortex. (Note that this does not rule out the possibility that new insights on the former or the latter, or both, will turn up in the future. The issue should thus not be removed from the research agenda.)
So how can we derive a neurobiologically realistic—that is, parallel—algorithm for the computation of numerosity? For this, we have to accept a minor restriction with respect to the most general, ideal solution. An important special case results from the assumption that all objects are approximately
simply connected—i.e., objects have no holes. In this case the zeroth Betti number equals another important topological invariant, the Euler characteristic. Although the Euler characteristic is a somewhat weaker invariant, due to being a topological invariant it will remain constant if any homeomorphism is applied to a simply connected object. This allows us to assign a constant value to a wide variety of different shapes (cf.
Figure 2). This invariant is of special importance in the current context, since there exists a famous theorem which allows its
parallel computation from local measurements.
The Gauss–Bonnet theorem provides a connection between topology and differential geometry and relates the Gaussian curvature of twodimensional surfaces to their Euler characteristic. It thus becomes possible to represent the number in terms of the Euler characteristic. We first provide the theorem in its basic form for the case of simply connected objects, which we assume to have closed smooth surfaces:
Theorem 1 (Gauss–Bonnet, basic version) Let
S ⊂ ℜ
^{3} be a closed surface (of class
C^{3}) without boundary; then
where
K is the Gaussian curvature and
χ is the Euler characteristic.
How can we understand the invariance properties provided by the Gauss–Bonnet theorem? The Gaussian curvature K of a regular surface is defined as the product of its principal curvatures, the minimal and maximal normal curvatures of two orthogonal planes being both orthogonal to the tangent plane of the surface. The local shape of a surface can be distinguished by its Gaussian curvature in elliptic (K > 0), hyperbolic (K < 0), and parabolic (K = 0) parts. For general shapes, the invariance is provided by (a) a tradeoff between area and curvature, (b) a tradeoff between elliptic and hyperbolic regions, and (c) the fact that parabolic regions never produce any contribution, independent of their spatial extension.
Let us consider, as a special and particularly simple case, the surface
S of a sphere, as shown in
Figure 3. It is completely elliptic (light blue); the Gaussian curvature is constant everywhere and depends only on the radius of the sphere. The interplay between the surface area and the Gaussian curvature is just a multiplication. If the radius is decreased (top of
Figure 3b,
c), the surface area decreases while at the same time the Gaussian curvature increases. The resulting surface integral—i.e., the product of Gaussian curvature and surface area—is always 4
π, independent of the size of the sphere (the Euler characteristic
χ(
S) of any sphere is 2). Note that the deformations of the sphere are exactly the topological transformations previously considered. As decreasing the radius of a sphere is a homeomorphism, Theorem 1 states that the integral over the Gaussian curvature is constant. The surface integral also does not change if the sphere is dented, as can be seen in the middle row of
Figure 3b and
c. Here a hyperbolic (red) curvature emerges at the transition region between sphere and dent. Since the curvature does not change except in the region of the dent, curvature with a negative sign always implies the emergence of another elliptic part with a higher absolute value. We can find this elliptic part in the middle of the dent, which can be seen in
Figure 3c. The Gaussian curvature of a regular surface is a continuous function and thus there exists at least a curve between hyperbolic and elliptic parts on the surface where the curvature is parabolic. The influence of a homeomorphism producing a bulge on the surface of a simply connected object and its Gaussian curvature is also shown in
Figure 3.
The invariant provides the desired property that each object, irrespective of its shape, will generate the same contribution. But this is not sufficient, since for measuring the number of objects of a configuration we also want the contributions of all objects to be summed up. The linearity of the integral operator directly yields this important property, the
additivity of the invariant. Assume
to be the union of surfaces of pairwise disjoint simply connected objects
O_{i}. With
χ(∂
O_{i}) = 2,
i = 1, …,
n, the integral becomes
The invariant of a configuration of n objects thus amounts to just n times the basic invariant. We can therefore sum up all local curvature measurements of a configuration of simply connected objects without any further a priori knowledge to obtain the number of objects of the configuration. In principle, the problem of estimating the number of objects in a scene is thereby solved.
Sensor implementation
The challenging part of the informationgathering process is thus how the local Gaussian curvature can be estimated by a sensory modality. Tactile scanning of the object's surface and visual sensing are both possible sensor strategies. Multisensory approaches also seem possible, and with technical depthsensing devices, as used in robots, the number of possibilities becomes quite high. However, in this article we restrict our investigation to the human visual system. Here we can distinguish two basic possibilities. First, we can assume the existence of some sort of 3D representation at an intermediate stage of the visual system. Then it would in principle be possible to derive the local Gaussian curvature of the surface of an object from this 3D representation and make direct use of Theorem 1 for computing numerosity. In our view, this is a possibility that should indeed not be neglected. However, the representation of 3D information in the visual system is not yet completely understood, and the additional processing time seems to be in conflict with the very fast reaction times at which numerosity can be estimated. We thus consider here a second, more plausible possibility: direct estimation of numerosity from the 2D luminance function l = l(x, y).
Given that the solution by the Gauss–Bonnet theorem is specified in terms of surfaces in 3D space, making use of it for 2D luminance images requires looking at two aspects. First, we have to consider the relation between the 3D spatial configuration and the resulting 2D image. And second, we have to investigate how the formalism can be adapted to the processing of 2D luminance functions. An example for the luminance function of a weakly lightened 3D cube is illustrated in
Figure 4a. The interplay between lighting and the reflectance properties of the object's surface results in a mapping from the real world to the sensed 2D luminance function. The details of this mapping have been investigated by, e.g., Phong (
1975) and Schlick (
1993), and its relation to human perception has been analyzed by Koenderink and van Doorn (
2003). However, for our investigations we consider only basic versions of this mapping. The sensor operator
G maps this physical properties of a configuration
c to the luminance function
such that the sensory process becomes
G(
c) =
l. With the simplest assumption that
G is an orthogonal projection not incorporating lighting, the resulting luminance level is always constant for the objects. If we assume planar surfaces to be projected this way, the resulting setup matches the visual stimuli commonly used in psychological studies (e.g., He, Zhang, et al.,
2009; Piazza, Izard, Pinel, Le Bihan, & Dehaene,
2004). In particular, the case for objects resulting in rightangled polygons as projections was considered in previous works (Zetzsche et al.,
2013; Zetzsche & Barth,
1990b).
However, we also want to consider the case in which
G takes lighting of the objects into account (such that it does not result in a constant luminance level of the objects' projections). For this, we additionally assume that the operator
G preserves the differentiability on the objects' surface. The configuration
c is thus mapped to an almosteverywhere sufficiently smooth function
l. The background is assumed to be uniform and clearly separated from the objects. The discontinuity between objects and the environment is thus projected to the luminance function (compare
Figure 4a).
With these assumptions it becomes easy to apply the Gauss–Bonnet theorem to the luminance function in order to estimate the numerosity of perceived objects—i.e., the Euler characteristic as related to local features of the luminance function. The only conceptual step needed is to consider the luminance surface Ω ⊂ ℜ
^{3}, which is defined by the perceived luminance function
l—i.e., Ω:= {(
x,
y,
z) ∈ ℜ
^{3}
x,
y ∈ ℜ,
z =
l(
x,
y)}. For example, in
Figure 4b the luminance surface of the luminance function in
Figure 4a is shown.
However, because the luminance surface of an object is no longer a closed surface, but has a boundary, we have to consider the complete version of the Gauss–Bonnet theorem:
Theorem 2 (Gauss–Bonnet, complete version) Let
S ⊂ ℜ
^{3} be a regular oriented surface (of class
C^{3}), and let
R be a compact region of
S with the boundary
∂R. Suppose that
∂R is a simple, closed, piecewise regular, positively oriented curve. Assume that
∂R consists of
k regular arcs
∂R_{i} (of class
C^{2}), and let
θ_{i} be the external angles of the vertices of
∂R. Then
where
K is the Gaussian curvature, κ
_{g} is the geodesic curvature, and
χ is the Euler characteristic.
Theorem 2 provides a solution which incorporates the luminance surface properties directly. Assuming a smooth space curve Γ which is the boundary of a closed region
S on the luminance surface Ω—i.e., Γ =
∂S—we have
where κ
_{g} denotes the geodesic curvature and the Gaussian curvature
K is given by
where the subscript letters denote the derivative in the respective direction. The consideration of the space curve Γ is necessary because the integral of the Gaussian curvature over the whole luminance surface of arbitrary luminance functions would result in the same quantity. This means that in this case all images have the same Euler characteristic (Barth, Ferraro, & Zetzsche,
2001). Using the parametrization
ϕ(
x,
y) := (
x,
y,
l(
x,
y))
^{T} of the surface
S ⊂ Ω and the Gaussian curvature given by
Equation 5, the first integral can be computed by
where
χ_{S} is the characteristic function with respect to the set
S. In order to calculate the second integral, the geodesic curvature of the boundary curve Γ must be estimated. Using the parametrization
θ :
I → ℜ
^{3} with
θ(
t) := (
x(
t),
y(
t),
l(
x(
t),
y(
t)))
^{T} of the boundary curve and the assumption
l(
x(
t),
y(
t)) =
τ is a constant ∀
t ∈
I, we can determine the geodesic curvature by
The parameter
τ is considered as a threshold in the remainder of this article. The additional assumption of constant height allows us to eliminate the second derivatives of the parametrization. The second integral in
Equation 4 thus becomes
where
χ_{Γ} is the characteristic function with respect to the set Γ. Assuming constant height, the first derivative of the parametrization yields
x′ = −(
l_{y}/
l_{x})
y′. Consequently, the geodesic curvature depends only on differential operators applied to the luminance function and one first derivative
x′ or
y′ of the parametrization of the boundary curve. Finally, the numerosity of objects
N from a union
S of pairwise disjoint regions of luminance surfaces can be determined by
where
and
. The computation now depends on three quantities
K̃,
χ̃_{S}, and κ̃
_{g} depending on surface properties and the quantity
χ̃_{C}, which has to extract curve properties. All quantities have in common the fact that they represent local properties of the luminance surface.
Regarding the implementation, images are assumed to be positive realvalued matrices. In order to replace the continuous operators by their discretized approximations, we have to guarantee sufficient smoothness on the luminance surface. If this is not satisfied, high numerical errors can emerge. In order to obtain differentiability in the discrete representation, each stimulus is additionally lowpass filtered using a Gaussian kernel g : [−1, 1] × [−1, 1] → ℜ with fixed standard deviation σ = 0.04 for all computations considered within this article (if not otherwise mentioned). The images are assumed to be defined by functions l : [−1, 1] × [−1, 1] → [0, 1]. The discretization assumed to be done with a step size of Δx = Δy = 1/50 results in image matrices with 100 × 100 pixels. The threshold height τ is determined by half the maximum luminance value within an image.
Neural implementation
A crucial question which immediately arises for any model being based on heavily theoretical arguments is whether such a model is still neurobiologically plausible. We show that the computation relies on four important properties which have been reported to be already available in the early visual cortex (up to V2):
orientation selectivity—a basic property of linear filter functions in V1 (Daugman,
1985);
2D selectivity—selectivity to curved features and 2D features has been reported for V1 and V2 (HashemiNezhad & Lyon,
2012; Ito & Komatsu,
2004);
gain control—an essential normalization property of networks (Carandini & Heeger,
2012); and
integration—spatial summation as a special case of a linear filter operation. We derived a possible architecture for the implementation of
Equation 9, which is illustrated in
Figure 5. The processing direction is from left to right. First the input is filtered with a 2D Gaussian kernel. The filtered input is then used to compute the first and secondorder derivatives of the luminance function as well as the threshold value used to control the boundary curve. The computation of the derivatives builds the second path (blue block “Linear filter”) with its origin in the filtered input image. The threshold value in the model depends on the maximum of the whole Gaussianfiltered luminance function. Given the derivatives of the luminance function, a bunch of multiplications must be realized; compare the red block “GCProduct” in
Figure 5. The resulting products then must be summed (blue block “GCSum”). The output of this stage is fed into a ratiocomputation stage (block “GCNorm”). After this stage, we finally have two local nonlinear features available,
K̃(
x,
y) and κ̃
_{g}(
x,
y). These are assumed to be computed across the whole visual field but are selectively gated by the threshold mechanism such that κ̃
_{g} values are only passed at the boundary locations, and
K̃ values are only passed within the interior regions to the final global summation stage (blue block “Integration”). This global summation provides the estimate of the numerosity of the input pattern.
The architecture requires the computation of several linear and nonlinear features across the visual field. Nominally, this results in a much higher computational complexity than is required for the models of Dakin et al. (
2011) or Stoianov and Zorzi (
2012). However, a comparison must also take into account the existing structure of the visual cortex and the generality of the model. Regarding the first aspect, we describe later in detail how basically all required computations are known to exist in visual cortex (V1 and V2). In particular, there is an overcomplete representation of the input (Barlow,
2013), which includes not only a complete orientationselective decomposition but also a provision of divisive normalization and selectivity for curvature. In this sense, the proposed model does not imply a significantly higher computational complexity than is assumed to exist in visual cortex.
Nevertheless, according to Occam's razor a simpler model always has to be preferred if it has the same explanatory power. This brings us to the second aspect that has to be considered in the comparison of models with respect to their computational complexity. We argue in this article that the invariance desirable for the computation of numerosity should include an invariance to shape properties. The invariance provided by the models of Dakin et al. (
2011) and Stoianov and Zorzi (
2012) is more restricted than the invariance of our model (see
Discussion). It is clear that with restricted demands on invariance it is possible to find a computationally simpler algorithm. What is unclear is which degree of invariance is provided by the human system. The proper application of Occam's razor thus becomes an empirical question about the invariance properties of human numerosity perception.
Now let us consider the neurophysiological plausibility of the computations in our model in a stepbystep fashion. The Gaussian filtering operation is commonly assumed to be available from the earliest stage of the visual system; in particular, it could be realized in the ganglion cells of the retina (Kuffler,
1953; Marr & Hildreth,
1980). The key operations in the model are directional derivatives of first and second order (combined with the Gaussian filter for regularization). Translated into receptivefield properties, these are orientationselective mechanisms with odd and even symmetry. It is well known that Gaussian derivatives are well suited for the description of neurons in the primary visual cortex (e.g., Koenderink & van Doorn,
1990; Lindeberg,
2013; Marr & Ullman,
1981; Martens,
1990; Young,
1987; Young & Lesperance,
2001). In particular, it has been argued that they can be approximated in a plausible fashion as differenceofoffset Gaussian filters (Young,
1987; Young & Lesperance,
2001).
The second important type of computation in the model is the nonlinear extraction of curvature. Selectivity to curvature is a wellknown basic selectivity of cortical neurons (Dobbins, Zucker, & Cynader,
1987; Zetzsche & Barth,
1990a,
1990b). In addition to the aforementioned linear filtering operations, this requires nonlinear ANDlike multiplication of two signals. One possibility is that this could be directly realized in the dendritic tree of a neuron (Koch & Segev,
2000; Mel,
1993). Alternatively, combinations of subunits could be used to realize the Babylonian trick and compute the product by the sum of squared spatialfilter outputs (Adelson & Bergen,
1985; Resnikoff & Wells,
2015; Zetzsche & Barth,
1990a)—i.e.,
ab = (1/4)[(
a +
b)
^{2} − (
a −
b)
^{2}]. Finally, it can be shown that the nonlinear mechanism of cortical gain control can be combined with subsequent nonlinear transducer functions to realize an ANDlike interaction (Zetzsche & Nuding,
2005). It has thus been argued that curvatureselective operators can be realized without problems by the available cortical hardware (Zetzsche & Barth,
1990a,
1990b). Furthermore, it has been shown that certain forms of end stopping, of the hypercomplex property, and of extraclassical receptivefield properties show a close relation to the computation of curvature (Zetzsche & Barth,
1990a,
1990b; Zetzsche & Nuding,
2005; Zetzsche & Roehrbein,
2001). Even evidence for the existence of an isotropic variant is provided by socalled dotresponsive cells (Saito, Tanaka, Fukada, & Oyamada,
1988). For this, the involved products must be summed (blue block “GCSum”), but this is an undisputed ability of neurons.
The output of the basic curvatureselective computation (the numerator) is fed into a divisive operation (GCNorm). That neurons can implement such an operation is shown by the established role of the divisive normalization mechanism in models of the visual cortex (Abbott, Varela, Sen, & Nelson,
1997; Carandini & Heeger,
2012; Geisler & Albrecht,
1992; Heeger,
1992; Schwartz & Simoncelli,
2001). It has even been argued that such an operation should be regarded as a “canonical” neural computation (Carandini & Heeger,
2012). The gaincontrol layer, which is illustrated by the rightmost red box in
Figure 5, results directly in an estimate for the required curvature quantities
K̃(
x,
y) and κ̃
_{g}(
x,
y).
Although it can be assumed that these features are computed throughout the visual field, the implementation of
Equation 9 requires a selective gating before feeding them into the global summation stage. The first part of this process is the determination of a level for the boundary curve. For simplicity, we have here assumed that this is achieved by selecting the 50% level of the maximum luminance value. Neural computation of the maximum is a wellknown principle discussed in the context of winnertakeall networks (Kaski & Kohonen,
1994; Mead & Ismail,
2012) electing the boundary which determines the region where the surface curvature operations setting the value for the Maximum operations are also routinely used in models of the visual cortex (Lampl, Ferster, Poggio, & Riesenhuber,
2004; Serre, Wolf, & Poggio,
2005). However, we do not put special weight on the use of this principle since, for the successful application of
Equation 9, the specific method used to determine a boundary curve plays no crucial role. It thus might as well be based on some other mechanism, e.g., on the determination of some equilibrium level (Dayan & Abbott,
2001; Grossberg,
1988). Once the boundary is selected, what remains to be achieved is the gating operation—i.e., which values are passed to the global integration mechanism must be controlled. This type of operation is a special case of the general principle of neural gain modulation, which is an essential neural mechanism in sensorimotor processing (Salinas & Thier,
2000) and is also used in the visual cortex (Reynolds & Chelazzi,
2004).

The curvature operators K̃ and κ̃_{g} depend only on derivatives of the luminance function which can be realized by neurophysiologically realistic orientationselective filters.

The nonlinear multiplicative ANDlike combination of these features can be computed by a variety of plausible neural mechanisms.

The ratio operations are closely related to the wellestablished principles of cortical divisive normalization and gain control.

The outputs of the curvature operators are globally integrated in a gated fashion, which is also an established neural principle, known as gain modulation.
The computation of
Equation 9 can thus be achieved by the available neural hardware in the visual cortex.
Basic properties
The model's behavior for stimuli commonly used in vision research is illustrated in
Figure 6. More complex stimuli are shown in
Figure 7. Variations of elements and configurations for which the model has been tested include cumulative area, morphing, convex–concave properties, luminance, mixtures of element sizes, shape variations, and texture properties. The full test set is much larger than shown here and is described in the
Appendix.
The model shows a strong invariance with respect to the cumulative area of objects (
Figure 6a). Furthermore, it is also invariant to changes in object shape from convex to concave (
Figure 6c).
Figure 6b shows a circle morphed into two circles. The model only shows a strong change in response when the two circles are not connected anymore (bottom stimulus in
Figure 6b; the slight change from the top stimulus to the middle one is due to numerical instability at the connecting points of the contours). This selectivity with respect to numerosity is one of the most important properties of the model, and it is also present for larger numbers of objects (
Figure 6e,
f).
By definition, the proposed model is invariant to contrast to some degree. The threshold, which is defined with respect to maximum contrast, causes the invariance illustrated in
Figure 6d. The slight changes are caused by numerical differences in the curvature computation, which depends on the absolute level of luminance. Contrast invariance is also present for multiple objects, as can be seen in the top and middle stimuli of
Figure 7f. The bottom stimulus shows a configuration of objects with different contrast. In this case the threshold to determine the integration domain was chosen to be smaller, i.e., 30% of maximum luminance (denoted by
N_{t}). The invariance with respect to contrast is preserved, as the model output changes only slightly. Note that the model can only be contrast invariant to some degree. The lower the threshold, the higher the influence of the standard deviation
σ of the Gaussian filter. This means that the minimum spatial distance between two objects that is necessary to distinguish them in terms of numerosity is increased. One should find a systematic underestimation in the numerosity estimation when using stimuli which include a mixture of low and highcontrast objects spatially placed with the critical distance for constant contrast. For further illustrations of illuminated 3D objects and thresholddependent responses, we refer the reader to Kluth and Zetzsche (
2014).
We also tested the model on more complex stimuli, including various numerosities of convex and concave objects with different luminance patterns (constant, sinusoidal gratings, and texture; compare
Figure 7a through
e). In all cases with piecewise constant luminance patterns, the model (
N =
N_{2}), which uses the parameters that are also used for further investigations within this article, works well and yields the desired selectivity and invariance properties. In a few cases the model does not work optimally. In particular, the introduction of sinusoidal gratings (compare
Figure 7a [middle], b [bottom], and d) or texture (compare
Figure 7e) on the objects results in larger deviations from the expected model output. Such more sophisticated luminance patterns have a larger spatial area, for which a certain degree of regularity is required for differentiation. If the regularity is not given in this area, the size of the area determines the amount of error in the computation of the differentials. We tested the influence of this error by increasing the standard deviation of the Gaussian filter to increase the regularity of the differentiated signal. By using twice the standard deviation (
N_{4};
σ = 4Δ
x) as before (
N =
N_{2};
σ = 2Δ
x), we make the model (
N_{4}) a perfect enumerator (deviations smaller than 0.02) for constant as well as sinusoidal luminance patterns, as can be seen in all cases of
Figure 7a through
d. The deviations in the texture case are slightly higher, which is due to the higher degree of regularity required on the objects. Further increasing the standard deviation solves this issue, but could theoretically cause a spatial pooling of close objects or an exclusion of small objects. For
σ = 6Δ
x and a threshold of 30% maximum luminance, the outputs for the middle and bottom stimuli of
Figure 7e become 4.0145 and 7.0336. There is no indication for a spatial pooling effect in this case, and the model output is improved by using these parameters.
That connectedness matters has also been shown by Franconeri et al. (
2009) and He, Zhang, et al. (
2009). In
Figure 8a, the kinds of stimuli used in this study and the corresponding model responses are illustrated. A line connecting the circles strongly affects the response of the model such that the response is reduced with every inserted line from the top to the bottom stimulus. This implies that the perceived numerosity should be underestimated by the introduction of connecting lines. This is in line with the findings reported by He, Zhang, et al. (
2009). The stimuli are compared to stimuli with a larger image size (compare
Figure 8a through
c). The relative size of the circles is constant for all stimuli, but the relative width of the connecting lines varies from
Figure 8a to
c. In each column we can observe an approximately constant difference in response from top to bottom, which illustrates the desired selectivity of numerosity estimation with respect to the number of objects present in the stimulus. The model responses are constant in each line, which is due to the increase of
σ with respect to the image size. As a result, numerical errors in the gradient computation are sufficiently small. But the increase of
σ also implies that the sensitivity to the connecting lines emerges up to a certain line width only. The thinner the connecting lines, the less influence they have.
Simulation experiments on Weber fraction
In the following we extend the proposed model such that noise is taken into account. We modeled additive normally distributed noise
η_{in} ∼
N(0,
σ_{in}) at the input and neural noise
η_{lin} ∼
N(0,
σ_{lin}) at the linear filter outputs. Both curvature operators
K̃ and κ̃
_{g} are influenced by both types of noise. The structure of the curvature operators as products of noiseaffected variables is a prototypical configuration for yielding a lognormally distributed quantity (Buzsáki & Mizuseki,
2014). This noise behavior has been reported in several studies regarding approximatenumber estimation.
Figure 5 illustrates the way noise influences the system which computes the Gaussian and the geodesic curvature. The digital quantity described by the perfect model thus becomes an analog quantity from which the desired information must be extracted. In the following, we focus on two different standard tasks to analyze the proposed model and to compare its performance with behavioral results from experiments with humans. The first task is discrimination whether it is a specific number or not (same–different), and the second is deciding whether a stimulus is larger than a reference stimulus (larger–smaller; Piazza et al.,
2004).
All investigations are based on four synthetic data sets of binary images generated in a similar fashion as that described by Stoianov and Zorzi (
2012). The 100 × 100–pixel binary images contained up to 32 randomly placed nonoverlapping objects with variable surface area and shape. The cumulative area of the objects was varied from 288 to 2,304 pixels with a step of 288, resulting in eight levels of cumulative area. For each tuple of number and cumulative area, 200 images were computed. The data sets thus consist of 51,200 100 × 100 binary images, with 32 different numbers and eight levels of cumulative area. The data sets are distinguished by the kind of shape of the elements and whether they are for training or testing. One data set (R) has rectangular objects, as used for the analysis by Stoianov and Zorzi (
2012), and the other ones consist of rectangular objects with smoothed corners (RC) and of circular objects (C), which are commonly used in behavioral experiments. All are used as test sets. We generated a fourth dataset (TR) consisting of rectangles to obtain the parameters for the optimal estimators. In addition, data sets with concave objects (C1, C2, C3) and randomly shaped objects (RA, RA200, RA400) were evaluated. More details and sample images of the data sets can be found in the
Appendix.
We trained optimal estimators tuned to a specific numerosity
N_{tuned} and a fixed task. Detailed information about the optimal estimators can be found in the
Appendix. See
Table A4 in the
Appendix for an overview of the trained parameters given the noise levels which are motivated later. The joint relative frequencies of the true numerosities and the response of the estimator were obtained from three distinct data sets (rectangles [R], smoothed corners [RC], circles [C]). We then fitted a continuous function to the conditional relative frequencies of the estimator's response given the true numerosity to obtain the internal Weber fraction. Further technical details can be found in the
Appendix.
Before we start considering noise in the system, we investigate the output of the noisefree model on each data set. The mean model output and the corresponding standard deviation for each true number of objects within the stimulus are illustrated in
Figure 9. All data sets have in common that the model predictions are close to identity (black curve) and have very small standard deviations for small numbers. In this case, the mathematical model is approximated well by the discretization. Only for large numbers and for data set C does the standard deviation increase. (The possible reasons are discussed in detail in the
Appendix.)
For the noiseaffected model, the remaining free parameters to control the system behavior are the standard deviations
σ_{in} and
σ_{lin} of the additive noise in the system. In order to make reasonable assumptions on the noise parameters, we analyzed the data sets with respect to their signaltonoise ratios (SNR) at different stages of the architecture (see
Table A1). The technical details can be found in the
Appendix. The SNR values at the input stage plotted against the noise parameter
σ_{in} can be found in
Figure 10a. Analogously, the SNR curves for the derivatives of the luminance function with respect to
σ_{lin} are illustrated in
Figure 10b. For further consideration with respect to the information encoded by a single unit, the mean value of the curves was computed. Note that the mean curve and the curve for luminance
l cover each other in
Figure 10a. The same holds true for the pairs
l_{x}/
l_{y} and
l_{xx}/
l_{yy} in
Figure 10b.
Figure 10c shows the rate required to encode the signals at the different stages with the corresponding SNR without loss of signal quality. Details about the relation between SNR and the rate can be found in the
Appendix. This rate can now be related to findings in the literature regarding single neurons and their encoding rate of transmitted signals. The transmitted information is reported with a range from 5 to 30 bits per second for a single neuron (Reich, Mechler, Purpura, & Victor,
2000). Assuming a mean firingrate code and a time window of approximately 200 ms we get a rate interval of 1–6 bits, which can be compared with the rates in
Figure 10c (gray window). At the linear filter we thus chose the standard deviation which corresponds to half the maximum of the reported rate interval—i.e.,
σ_{lin} = 2. At the input stage we chose the worst case,
σ_{in} = 0.4. These two cases are highlighted with red circles in
Figure 10. These parameters were used to simulate the model behavior in both numerosity judgment tasks. Note that the mean peak SNR of the first layer is approximately 40 dB higher than the respective SNR (compare
Table A1). A higher rate of approximately 6–7 bits is required to guarantee the same signal quality for the whole signal amplitude. Multiple units encoding the same quantity could be used to overcome this problem.
The results of the rectangle data set (R) for the specified noiseparameter configuration are shown in
Figure 11, and the mean model outputs for the noise case can be found in
Figure A9. We find that in both tasks, the performance of the artificial estimators shows very similar behavior to a human estimator (Piazza et al.,
2004). On the linear axis shown in the first row of
Figure 11, an increase of the variance with increasing tuned numerosity can be seen in the same–different task and a decrease in steepness with increase of tuned numerosity can be observed in the larger–smaller task. If we consider the same data on a logarithmic scale of the true numerosity, the effects are not observable anymore, as can be seen in the second row. The logarithmic representation with constant uncertainty is one possible representation (Feigenson et al.,
2004) for which neural evidence has been found (Nieder & Miller,
2003). The third row shows the same data plotted on a logarithmic scale of the ratio between the true numerosity and the tuned numerosity of the estimator. For all tuned numerosities, the curves are nearly identical, which implies a behavior relying on the Weber–Fechner law. All data points on the logarithmic scale of the ratio were used to fit one continuous function (black curve). The tuning curves were used to obtain the internal Weber fraction
w, an index to describe the discriminability between two numbers. In both tasks we find internal Weber fractions in the dimension of human behavior: 0.167 in the same–different task and 0.169 in the larger–smaller task on the rectangle data set (R). In comparison, Piazza et al. (
2004) reported corresponding Weber fractions of 0.17 and 0.174.
In summary, the computational results agree very well with behavioral results (Halberda et al.,
2008; Piazza et al.,
2004; Piazza et al.,
2010) and can compete with recent computational results of a neuralnetwork model (Stoianov & Zorzi,
2012;
w = 0.15 in larger–smaller task). Furthermore, the proposed model is able to reproduce the desired behavioral results on data sets covering a broader range of object shapes, as can be seen in
Table 1. Only random shapes for a small image size appear somewhat problematic (for an explanation, see the
Appendix). All other shapes, from rectangular to circular objects, different kinds of concave objects, and even larger randomly shaped objects, result in quite similar Weber fractions.
Table 1 Parameters of the continuous datafitting functions for concave data sets. Notes: In all cases the noise levels were σ_{in} = 0.4 and σ_{lin} = 2. The parameters of the optimal estimators were obtained from data set TR.
Table 1 Parameters of the continuous datafitting functions for concave data sets. Notes: In all cases the noise levels were σ_{in} = 0.4 and σ_{lin} = 2. The parameters of the optimal estimators were obtained from data set TR.
Discussion
In this article, we proposed a general mathematical approach to the computation of numerosity. We suggested formalizing the problem and its solution by considering numerosity as a mathematical invariant. We then developed a model which takes into account the restrictions imposed by the structure of the visual system and known neurophysiology. The proposed model was applied to visual stimuli and finally validated in computational experiments.
We first considered the problem on an abstract level, with the goal of providing a formal mathematical solution without having to consider the specific constraints imposed by the human visual system and the neural computations of the visual cortex. For this, we formalized it as the problem of an assignment of the property number to a configuration
c of objects in 3D space ℜ
^{3} (
Figure 1). We then argued that number specified in this sense has to be regarded as a mathematical invariant (
Figure 2). This simple fact is in our view a central contribution of this article: The understanding and modeling of numerosity should be considered as an investigation of the specific invariance properties that have to be provided by the perceptual processes and neural computations.
The special nature of this invariant—numerosity does not depend on the shape of the objects involved—suggested that we consider the problem in the context of topology. Since the crucial property that constitutes an object is the fact that it occupies a
connected region of space, the appropriate formal invariant turned out to be the zeroth Betti number, which counts the number of connected components of a topological space. It should be noted that this invariant represents the
ideal solution to the problem of numerosity, and should hence be taken into account in future investigations, both in experimental and in model research. However, if we consider its computational properties, it turns out that the Betti number is not compatible with the established properties of numerosity perception: The biological numerosity system acts quite fast (Park et al.,
2015; Roitman et al.,
2007) and is based on parallel computations, since there exists no hint of a dependence of processing time on the number of elements in the set (Mandler & Shebo,
1982). Connectedcomponentlabeling algorithms from digital topology are able to estimate the number of components as a byproduct of the labeling process, but these algorithms are inherently sequential and require multiple passes through the image (e.g., He, Chao, & Suzuki,
2007; Suzuki, Horiba, & Sugie,
2003).
This led us to consider an alternative approach that is suited for parallel implementation. It turned out that such a solution can be achieved if we are willing to accept slightly more idealized conditions regarding the definition of what constitutes an object. If we consider objects as simply connected entities in topological terms, a solution can be found which is fully parallel and also has a high degree of neurophysiological plausibility, since the required invariant then becomes equivalent to the Euler characteristic χ(C). This in turn is computable via the famous Gauss–Bonnet theorem by measuring in a local, pointwise fashion the surface curvature of the objects in the set. By this we have specified numerosity as an invariant that can be computed in parallel by making use of local mathematical operations (derivatives).
Up to here, our analysis was abstracted from the specific conditions of the visual system insofar as the objects and their surfaces are assumed to be defined in 3D space. If we assume that the visual system is able to create some sort of 3D representation of the environment (even a 2½D representation may suffice; Marr,
1982), the problem of numerosity can indeed be solved by direct application of Gauss–Bonnet to this representation. However, we do not want to be logically dependent on this specific assumption, since there also exist theories which assume that the handling of 3D information in the early stages of the visual system (e.g., for 3D object recognition) is based on a representation which consists of multiple views of which each is only a 2D image (Wallis & Bülthoff,
1999).
We hence considered alternatives for how Gauss–Bonnet can be applied directly to the processing of 2D images (optical projections of the 3D configurations to 2D retinal images). It turned out that this is basically possible by interpreting the 2D luminance function as a surface (
Figure 4). It can be shown that this leads to three further model variants that represent different strategies for dealing with deviations from the ideal conditions of the theory (Kluth & Zetzsche,
2014). In this article we use the specific variant of the model which takes the deviations along the object borders explicitly into account by a second term (compare
Equation 4). It can be assumed that this yields the most stable model version, because it combines the robust largearea integration of the surface curvature with a correction term.
Basing our analysis largely on formal mathematical considerations raises the immediate question of empirical support. Regarding neurophysiological plausibility, we have shown that the model can be realized by the neural hardware believed to be available in the visual cortex. The local core operations are first and secondorder derivatives which correspond to oriented mechanisms with odd and even symmetry. It is well known that Gaussian derivatives are well suited for the description of neurons in the primary visual cortex (e.g., Koenderink & van Doorn,
1990; Lindeberg,
2013; Marr & Ullman,
1981; Martens,
1990; Young,
1987; Young & Lesperance,
2001). The computation of the local curvature terms requires an ANDlike multiplication, which can be provided by a number of neurophysiologically plausible mechanisms (Adelson & Bergen,
1985; Koch & Segev,
2000; Mel,
1993; Resnikoff & Wells,
2015; Zetzsche & Barth,
1990a; Zetzsche & Nuding,
2005). Furthermore, it has been shown that certain forms of end stopping, of the hypercomplex property, and of extraclassical receptivefield properties show a close relation to the computation of curvature (Zetzsche & Barth,
1990a,
1990b; Zetzsche & Nuding,
2005; Zetzsche & Roehrbein,
2001). The computation of the curvature terms further requires a ratio operation. This is very similar to the mechanism of cortical gain control which is available at early stages of the visual cortex and is regarded as a “canonical” neural computation of the cortex, which exists in various versions (Carandini & Heeger,
2012).
Essential questions regarding the proposed model are whether it leads to clear testable predictions and how it compares to other existing models. We start with the comparison to other models, which can be considered imageprocessing models. This criterion can range from models which process only binary images to full imageprocessing models, which accept an arbitrary graylevel image as input and compute the corresponding numerosity. As mentioned before, to our knowledge the first numerosity model of this type was suggested by
Zetzsche and Barth (1990b). This model can be regarded as a simplified version of the present one which represents shapes as polygons with ±90° corners and straight segments aligned to the Cartesian grid. Aside from this it is based on the same invariance principle and will lead to similar predictions, such that we will not further consider it as a separate model in the following discussion. The classic model by Dehaene and Changeux (
1993) can be considered as a restricted version of a full imageprocessing model, since it is only suited for 1D images. Of greatest interest for the comparison are two more recent models which match the criteria. These are the “contrast energy model” (Dakin et al.,
2011; Morgan, Raphael, Tibber, & Dakin,
2014) and the neural network of Stoianov and Zorzi (
2012). The latter one exists in two versions, as a full neural network and as an abstracted mathematical model (Cappelletti et al.,
2014; Stoianov & Zorzi,
2012), which will be considered the “network model” in the following comparison.
The model of Dehaene and Changeux (
1993), henceforth designated the “normalization model,” differs from the two other models and from our one not only in being only onedimensional but also with respect to the invariance principle being used. In the normalization model the invariance is achieved not in an implicit fashion by the distributed computation of local features but with an explicit normalization stage in which input objects are mapped to a standard representation that no longer varies in dependence on the shape properties of the input objects. This is achieved by regarding each object as a blob which is represented by a dedicated “blob detector.” This principle can generate a certain size invariance but cannot cope with the different shape variations which become only fully apparent in the 2D case, for example in the form of elongated linelike elements. A single blobmatching system will not be able to bring into one standardized form such different element types as lines, discs, and nonconvex shapes—which can additionally be arranged in quite different spatial patterns—without interference between the invariance and the numbering properties. The existence of a blobmatching stage as opposed to the spatially distributed computation of local features in the contrast energy model, the network model, and our model is a structural property which should also be testable on the neural level. While the model of Dehaene and Changeux (
1993) predicts the existence of local units which represent a single object in an invariant fashion, the invariance in the other models is an emergent property which will only become apparent in the final spatial summation stage, where the contributions from several objects are combined.
A comparison of these models can be based on two methods: consideration of the underlying invariance principle and direct comparison of the detailed computations. In the following we will make use of the first approach and consider how far the contrast energy model and the network model make use of alternative invariance mechanisms to the one proposed here. The second approach is described on a detailed level in the
Appendix.
Two basic shape properties seem somehow to be involved in the computations of the two models: the boundary of objects and their area. In the network model, the cumulative area of all elements is seen to play a crucial role as a covarying factor (Stoianov & Zorzi,
2012); and for the contrast energy model, the contour length of the elements is explicitly mentioned as one crucial factor in its computations (Morgan et al.,
2014). On a closer look, the contrast energy model exists in two variants. In the first variant, the invariance is basically attributed to the highfrequency filtering stage, and the lowfrequency filters are only considered for handling a moderate bias from the size of the stimulus configuration (Dakin et al.,
2011). In this interpretation, the contours of the elements are suggested as a crucial factor in the computation, since the aggregated contour length is directly proportional to the number of objects (Morgan et al.,
2014). However, without an additional compensation mechanism this solution would predict a strong influence of element size, since the size would directly influence the aggregated contour length. Deviating from the interpretation of Dakin et al. (
2011) and Morgan et al. (
2014), we will thus consider a second interpretation of the model, in which the lowfrequency filters could contribute to size invariance by computing a cumulative area estimate. This is of special interest because the invariance properties of the network model are also suggested to be based on a tradeoff term which computes cumulative area (Stoianov & Zorzi,
2012).
Is it possible to compute the desired invariant by making use of contour length and area? It is a wellknown fact that the ratio
with
A being the area of an object and
L its contour length, is a geometric invariant with respect to similarity transformation (which includes size changes as well as position and orientation changes). We can thus use this invariance property of the ratio to compute the number of objects of a configuration based on the aggregated area Σ
A_{i} and the aggregated contour length Σ
L_{i} as
where all objects have the same arbitrary area
A_{i} =
A and share the same characteristic
q. This means that we can scale the elements of a configuration by an arbitrary common factor, and we can arbitrarily rotate and shift the individual elements without causing any influence on the number estimate (compare
Figure 2).
However, although the ratio
q is a geometric invariant in the sense described earlier, it is not the full
topological invariant desired for the computation of number, since it still has a systematic dependency on the
shape of the elements. It has long been known that the normalized variant of
q, the socalled isoperimetric quotient
Q = 4
πA/
L^{2} is a measure of the
compactness of a shape (Kesavan,
2006). This measure attains its maximum of 1 in the case of circles but can easily get much smaller if the object becomes more ragged. For specific patterns, like the wellknown Koch snowflake,
Q can even approach 0. But also for commonplace patterns
q is often much smaller than 1. For Ulike shapes, for example,
Q is around 1/3, which would imply that three Ushaped elements should appear perceptually as numerous as nine disks.
It is not directly evident that the two models are exactly aimed at the computation of this ratio invariance, since only one of its parts is emphasized in each description of the models. But in the contrast energy model the lowfrequency filtering can be seen to bear a certain resemblance to arearelated computations. And in the network model, a contourrelated feature can be seen to be implicitly computed, as explained in detail in the common framework analysis in the
Appendix. However, we can draw the following general conclusion regarding all models that try to somehow make use of a tradeoff between contour length and area to compute the number of objects: If the tradeoff computations in these models deviate from the ratio invariant, they will lose the size invariance. If they manage to approach the ratio invariant, however, they will be subject to a substantial dependency on the shape of the elements. If we want to allow for arbitrary object shapes, making use of the ratio invariant thus seems not to be a viable solution.
This also becomes apparent if we use the second approach for model comparison, consideration of the local features which are extracted by the different models. A detailed analysis can be found in the
Appendix. It provides information about where the two other models differ from our approach, and this can also help in understanding how the different invariance properties are generated. The simplest case for understanding these effects is provided by the processing of binary images. For binary objects the interior region is completely flat, and the Gauss–Bonnet theorem tells us that no contribution should come from these interior points. Hence the only nonvanishing contributions should come from the object boundaries (compare the
Appendix). In this sense it is a reasonable strategy to use only the contour region for the computation, as in the contrast energy model or, in an implicit fashion, in the network model (see the
Appendix). However, it is crucial that not all contour points contribute in the same fashion to the final spatially integrated numerosity variable. Rather, the contribution should depend on the
curvature of the boundary. In particular, there should be a zero contribution from straight contour parts. That the contribution from straight or lowcurvature contours is not appropriately reduced by the contrast energy model and the network model is, in our view, the essential reason that systematic deviations from the ideal invariance properties have to be expected for these models.
In conclusion, models which are based on some tradeoff between area and contour will show a systematic dependency on shape, element size, or both, and the same is true for models which rely only on contour features but provide no appropriate curvaturedependent weighting.
It should be noted, however, that this does not necessarily imply that these models are not suited as models of human numerosity perception. In particular, if they do not realize the full size invariance but instead use some compromise which combines a medium size dependency with a medium shape dependency, it remains to be tested whether perceived numerosity does not exhibit the same form of deviation from perfect invariance. It is well known that the human numerosity system shows a systematic dependency on nonnumerical parameters like the size of the elements (e.g., DeWind, Adams, Platt, & Brannon,
2015; Hurewitz, Gelman, & Schnitzer,
2006; Ross,
2003). However, it has been argued that many studies focus only on finding a statistically significant effect instead of quantifying the exact
quantitative relation between the parameter and the numerosity bias, and that they rather interpret the influence of the parameter as an estimation error (DeWind et al.,
2015). If the quantitative influence of nonnumerical cues is instead explicitly modeled, the results suggest that the quantitative effect of the cues is relatively small for most individuals (DeWind et al.,
2015).
Nevertheless, the considerations from earlier suggest that systematic quantitative measurements of the influence of nonnumerical parameters on perceived numerosity are required in order to draw further conclusions about the different models. We would assume that the explicit consideration of invariance mechanisms and invariance properties, as exemplified in the present analysis, could be one valuable strategy for such a systematic analysis. It is clear, however, that the prediction from our model on the outcome of such investigations would be that the quantitative deviations from invariance should always be relatively small.
Empirical tests of the invariance properties are in our view also essential with respect to the debate about a “true sense of number” as opposed to a texturedensitybased mechanism (Dakin et al.,
2011). It is clear that we should expect to find strong interrelations between density, numerosity, and cumulative area, since formally density and number are completely equivalent because they can be transformed into each other via the area. However, it would be justified in our view to regard numerosity perception as just a byproduct of texture processing if it is derived from some texturerelated processing, like bandpass filtering, and if the quantitative deviations from invariance are predicted by the properties of this mechanism. If, on the other hand, the invariance properties should turn out to be more compatible with our sort of approach, then, irrespective of whether the crucial variables in the system would covary directly or inversely with number, we would consider it adequate to speak of a sense of number.
In this context it is also of interest to consider a major argument for a direct visual sense for number, which is given by the fact that number seems to be a primary visual property like color, orientation, or motion to which the visual system can be adapted by prolonged viewing (Ross & Burr,
2010). If we consider the invariance mechanism of our model, the only stage where adaptation would selectively influence only the number property would be the stage of the final summation units. Adaptation at this stage would cause a systematic influence even if we later test with entirely different elements and spatial configurations as the ones being used during adaptation. However, there should also arise systematic influences if we adapt the basic curvature features (compare, e.g., Bell, Gheorghiu, & Kingdom,
2009; Blakemore & Over,
1974; Hancock & Peirce,
2008). If we adapt to patterns with high local curvatures, for example, the contribution of the curvature mechanisms to the integral should be reduced such that the resulting numerosity perceived in later tests should also be reduced.
The very nature of our approach leads to further testable predictions. The most prominent prediction, which applies to the basic model and all its variants, is based on its intimate relation to topology and in particular to the topological concept of connectivity: On the one hand, many possible nonnumeric changes of a configuration of objects are predicted to have only a small influence, even if they are quite dramatic with respect to basic signallevel properties. These are, for example, drastic changes of the sizes of the objects, or substantial alterations of their shapes (e.g., from a thin, elongated shape to a compact, round one). On the other hand, changes of the topology, and in particular changes of connectivity, should have a strong influence, even if the corresponding signallevel changes are very small. An example of this would be the connection of two big blobs by a thin line, where the line width can be assumed to be represented on a substantially smaller spatial scale than the blob size (
Figure 8). Our model would predict a clear decrease of perceived numerosity in spite of the small signallevel change. In a model using area as an essential variable, the influence should be relatively weak, whereas a mechanism relying on aggregated contour length would predict an increase of perceived numerosity. The predictions of our model are supported by findings which show that visual perception is sensitive to topological quantities (Chen,
2005) and in particular that a change of topological connectivity affects visual numerosity estimation (Franconeri et al.,
2009; He, Zhang, et al.,
2009).
As a last point for empirical tests, it should be remembered that the restriction to simply connected objects has only been caused by plausibility arguments, since for the computation of the invariant zeroth Betti number no parallel algorithm is known. It would thus be of special interest to perform experiments with human subjects in which exactly this property is manipulated (e.g., by determining how the perceived numerosity is influenced by making “holes” into the objects). In this context we again mention that there exists already evidence that human perception in general is sensitive to topological quantities, and this includes a significant influence of holes on perception (Chen,
2005),
As our approach is derived from formal mathematical considerations, it seems to generate an undesired prediction: There seems to be no obvious role for errors, and in particular not for Weberlike behavior. However, it thus can be regarded as important supporting evidence that only on the basis of quite natural assumptions about noise sources, and without any explicit structural support (e.g., by logarithmic transfer functions or gaincontrol mechanisms), does the model exhibit such a Weberlike behavior. This has been confirmed by tests with an extended data set that will be made publicly available. The noisy model combined with a decisionmaking system was able to closely reproduce the numberdiscrimination abilities of human subjects: In the literature, Weber fractions of 0.174 (Piazza et al.,
2004), 0.108 (Halberda & Feigenson,
2008), and 0.15 (Piazza et al.,
2010) have been reported for adults. The presented parametrization of the noise levels results in Weber fractions between 0.150 and 0.197 for the six basic data sets.
In conclusion, we have provided a formal analysis of the problem of numerosity. The essential and most basic result is that numerosity should be regarded as a mathematical invariant. Based on concepts from topology, we have also derived a basic model structure and several specific variants of this model. Its key property is given by the Gauss–Bonnet formula, which provides the desired invariance properties by the parallel integration of local curvature measures. These in turn are based on neurophysiologically highly plausible operations: directional derivatives (oriented receptive fields), nonlinear ANDlike combinations related to extraclassical receptivefield properties, and divisive operations similar to cortical gaincontrol (normalization) mechanisms. The properties that turned up in our analysis are so basic from a mathematical point of view that it seems difficult to believe that there could be any mathematically reasonable model for the sense of number which is based on parallel computations but does not somehow relate to the invariance principles described here. It may thus be hoped that the conceptual framework suggested here can serve as a fruitful basis for future research into the basic cognitive capacity of numerosity perception.
Acknowledgments
The original motivation for this research came in 1989, when Erhardt Barth, at that time a diploma student under the supervision of CZ, returned from a talk by Chen Lin on topological perception. This work was supported by Deutsche Forschungsgemeinschaft (SFB/TR8 Spatial Cognition, project A5[ActionSpace]) and the German Federal Ministry for Economic Affairs and Energy (DLR project “CAUSE,” funding no. 50 NA 1505).
Commercial relationships: none.
Corresponding author: Christoph Zetzsche.
Email: zetzsche@informatik.unibremen.de.
Address: Cognitive Neuroinformatics, University of Bremen, Bremen, Germany.
References
Abbott
L.,
Varela
J.,
Sen
K.,
Nelson
S.
(1997,
Jan 10). Synaptic depression and cortical gain control.
Science,
275,
221–224.
Adelson
E. H.,
Bergen
J. R.
(1985).
Spatiotemporal energy models for the perception of motion.
Journal of the Optical Society of America A,
2
(2),
284–299.
Allik
J.,
Tuulmets
T.,
Vos
P. G.
(1991).
Size invariance in visual number discrimination.
Psychological Research,
53
(4),
290–295,
doi:
10.1007/BF00920482.
Anobile
G.,
Cicchini
G. M.,
Burr
D. C.
(2014).
Separate mechanisms for perception of numerosity and density.
Psychological Science,
25
(1),
265–270,
doi:
10.1177/0956797613501520.
Barth
E.,
Ferraro
M.,
Zetzsche
C.
(2001).
Global topological properties of images derived from local curvature features. In
Arcelli
C.
Cordella
L. P.
Sanniti di Baja
G.
(Eds.)
Visual form
(pp. 285–294). Berlin Heidelberg, Germany: Springer.
Bell,
J.,
Gheorghiu
E.,
Kingdom
F.
(2009).
Orientation tuning of curvature adaptation reveals both curvaturepolarityselective and nonselective mechanisms.
Journal of Vision,
9
(12):
3,
1–11,
doi:10.1167/9.12.3. [
PubMed] [
Article]
Blakemore
C.,
Over
R.
(1974).
Curvature detectors in human vision.
Perception,
3
(1),
3–7.
Bonn
C. D.,
Cantlon
J. F.
(2012).
The origins and structure of quantitative concepts.
Cognitive Neuropsychology,
29
(1–2),
149–173,
doi:
10.1080/02643294.2012.707122.
Brannon
E. M.
(2006).
The representation of numerical magnitude.
Current Opinion in Neurobiology,
16
(2),
222–229,
doi:
10.1016/j.conb.2006.03.002.
Buzsáki
G.,
Mizuseki
K.
(2014).
The logdynamic brain: How skewed distributions affect network operations.
Nature Reviews Neuroscience,
15
(4),
264–278,
doi:
10.1038/nrn3687.
Cappelletti
M.,
Didino
D.,
Stoianov
I.,
Zorzi
M.
(2014).
Number skills are maintained in healthy ageing.
Cognitive Psychology,
69,
25–45.
Carandini
M.,
Heeger
D. J.
(2012).
Normalization as a canonical neural computation.
Nature Reviews Neuroscience,
13
(1),
51–62,
doi:
10.1038/nrn3136.
Chang
C.I.
(2010).
Multiparameter receiver operating characteristic analysis for signal detection and classification.
IEEE Sensors Journal,
10
(3),
423–442,
doi:
10.1109/JSEN.2009.2038120.
Chen
L.
(2005).
The topological approach to perceptual organization.
Visual Cognition,
12
(4),
553–637,
doi:
10.1080/13506280444000256.
Dakin
S. C.,
Tibber
M. S.,
Greenwood
J. A.,
Kingdom
F. A. A.,
Morgan
M. J.
(2011).
A common visual metric for approximate number and density.
Proceedings of the National Academy of Sciences, USA,
108
(49),
19552–19557,
doi:
10.1073/pnas.1113195108.
Daugman
J. G.
(1985).
Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by twodimensional visual cortical filters.
Journal of the Optical Society of America A,
2
(7),
1160–1169,
doi:
10.1364/J0SAA.2.001160.
Dayan
P.,
Abbott
L. F.
(2001).
Theoretical neuroscience: Computational and mathematical modeling of neural systems.
Cambridge, MA:
MIT Press.
Dehaene
S.,
Changeux
J. P.
(1993).
Development of elementary numerical abilities: A neuronal model.
Journal of Cognitive Neuroscience,
5
(4),
390–407,
doi:
10.1162/jocn.1993.5.4.390.
DeWind
N. K.,
Adams
G. K.,
Platt
M. L.,
Brannon
E. M.
(2015).
Modeling the approximate number system to quantify the contribution of visual stimulus features.
Cognition,
142,
247–265,
doi:
10.1016/j.cognition.2015.05.016.
Dobbins
A.,
Zucker
S. W.,
Cynader
M. S.
(1987).
Endstopped neurons in the visual cortex as a substrate for calculating curvature.
Nature,
329
(6138),
438–441.
Durgin
F. H.
(2008).
Texture density adaptation and visual number revisited.
Current Biology,
180
(18),
R855–R856,
doi:
10.1016/j.cub.2008.07.053.
Feigenson
L.,
Dehaene
S.,
Spelke
E.
(2004).
Core systems of number.
Trends in Cognitive Sciences,
8
(7),
307–314,
doi:
10.1016/j.tics.2004.05.002.
Franconeri
S.,
Bemis
D.,
Alvarez
G.
(2009).
Number estimation relies on a set of segmented objects.
Cognition,
113
(1),
1–13.
Gallistel
C. R.,
Gelman
R.
(1992).
Preverbal and verbal counting and computation.
Cognition,
44
(1–2),
43–74,
doi:
10.1016/0010 0277(92)90050R.
Gallistel
C. R.,
Gelman
R.
(2000).
Nonverbal numerical cognition: From reals to integers.
Trends in Cognitive Sciences,
4
(2),
59–65,
doi:
10.1016/S13646613(99)014242.
Geisler
W. S.,
Albrecht
D. G.
(1992).
Cortical neurons: Isolation of contrast gain control.
Vision Research,
32
(8),
1409–1410.
Gray
R. M.,
Neuhoff
D. L.
(1998).
Quantization.
IEEE Transactions on Information Theory,
44
(6),
2325–2383.
Gross
H. J.,
Pahl
M.,
Si
A.,
Zhu
H.,
Tautz
J.,
Zhang
S.
(2009).
Numberbased visual generalisation in the honeybee.
PloS One,
4
(1),
e4263,
doi:
10.1371/journal.pone.0004263.
Grossberg
S.
(1988).
Nonlinear neural networks: Principles, mechanisms, and architectures.
Neural Networks,
1
(1),
17–61,
doi:
10.1016/08936080(88)900214.
Halberda
J.,
Feigenson
L.
(2008).
Developmental change in the acuity of the “number sense”: The approximate number system in 3, 4, 5, and 6yearolds and adults.
Developmental Psychology,
44
(5),
1457–1465,
doi:
10.1037/a0012682.
Halberda
J.,
Mazzocco
M. M.,
Feigenson
L.
(2008).
Individual differences in nonverbal number acuity correlate with maths achievement.
Nature,
455
(7213),
665–668,
doi:
10.1038/nature07246.
Hancock
S.,
Peirce
J. W.
(2008).
Selective mechanisms for simple contours revealed by compound adaptation.
Journal of Vision,
8
(7):
11,
1–10,
doi:10.1167/8.7.11. [
PubMed] [
Article]
HashemiNezhad
M.,
Lyon
D. C.
(2012).
Orientation tuning of the suppressive extraclassical surround depends on intrinsic organization of V1.
Cerebral Cortex,
22
(2),
308–326.
He
L.,
Chao
Y.,
Suzuki
K.
(2007).
A lineartime twoscan labeling algorithm.
In
IEEE international conference on image processing,
Vol. 5 (pp. V241–V244). IEEE.
He
L.,
Chao
Y.,
Suzuki
K.,
Wu
K.
(2009).
Fast connectedcomponent labeling.
Pattern Recognition,
42
(9),
1977–1987,
doi:
10.1016/j.patcog.2008.10.013.
He
L.,
Zhang
J.,
Zhou
T.,
Chen
L.
(2009).
Connectedness affects dot numerosity judgment: Implications for configural processing.
Psychonomic Bulletin & Review,
16
(3),
509–517,
doi:
10.3758/PBR.16.3.509.
Heeger
D. J.
(1992).
Normalization of cell responses in cat striate cortex.
Visual Neuroscience,
9
(2),
181–197,
doi:
10.1017/S0952523800009640.
Hegde
J.,
Felleman
D.
(2007).
Reappraising the functional implications of the primate visual anatomical hierarchy.
The Neuroscientist,
13
(5),
416–421,
doi:
10.1177/1073858407305201.
Hornik
K.,
Stinchcombe
M.,
White
H.
(1989).
Multilayer feedforward networks are universal approximators.
Neural Networks,
2
(5),
359–366,
doi:
10.1016/08936080(89)900208.
Hurewitz
F.,
Gelman
R.,
Schnitzer
B.
(2006).
Sometimes area counts more than number.
Proceedings of the National Academy of Sciences, USA,
103
(51),
19599–19604,
doi:
10.1073/pnas.0609485103.
Ito
M.,
Komatsu
H.
(2004).
Representation of angles embedded within contour stimuli in area V2 of macaque monkeys.
The Journal of Neuroscience,
24
(13),
3313–3324.
Kaski
S.,
Kohonen
T.
(1994).
Winnertakeall networks for physiological models of competitive learning.
Neural Networks,
7
(6–7),
973–984,
doi:
10.1016/S08936080(05)801546.
Kaufman
E. L.,
Lord
M. W.,
Reese
T. W.,
Volkmann
J.
(1949).
The discrimination of visual number.
The American Journal of Psychology,
62
(4),
498–525.
Kesavan
S.
(2006).
Symmetrization & applications (Vol. 3).
Singapore:
World Scientific.
Kluth
T.,
Zetzsche
C.
(2014).
Spatial numerosity: A computational model based on a topological invariant.
In
Freksa
C.
Nebel
B.
Hegarty
M.
Barkowsky
T.
(Eds.)
Spatial cognition IX (Vol. 8684,
pp.
237–252).
Heidelberg, Germany:
Springer International Publishing.
Kluth,
T.,
Zetzsche
C.
(2015).
A topological view on numerosity.
Manuscript in preparation.
Koch
C.,
Segev
I.
(2000).
The role of single neurons in information processing.
Nature Neuroscience,
3,
1171–1177.
Koenderink
J. J.,
van Doorn
A. J.
(1990).
Receptive field families.
Biological Cybernetics,
63
(4),
291–297,
doi:
10.1007/BF00203452.
Koenderink
J. J.,
van Doorn
A. J.
(2003).
Shape and shading.
In
Chalupa
L. M.
Werner
J. S.
(Eds.)
The visual neurosciences
(pp.
1090–1105).
Cambridge, MA:
MIT Press.
Kuffler,
S. W.
(1953).
Discharge patterns and functional organization of mammalian retina.
Journal of Neurophysiology,
16
(1),
37–68.
Lampl
I.,
Ferster
D.,
Poggio
T.,
Riesenhuber
M.
(2004).
Intracellular measurements of spatial integration and the max operation in complex cells of the cat primary visual cortex.
Journal of Neurophysiology,
92
(5),
2704–2713.
Lindeberg
T.
(2013).
A computational theory of visual receptive fields.
Biological Cybernetics,
107
(6),
589–635,
doi:
10.1007/s00422013–0569z.
Mandler
G.,
Shebo
B. J.
(1982).
Subitizing: An analysis of its component processes.
Journal of Experimental Psychology: General,
111
(1),
1–22,
doi:
10.1037/00963445.111.1.1.
Marr
D.
(1982).
Vision: A computational investigation into the human representation and processing of visual information.
San Francisco:
W.H. Freeman and Company.
Marr
D.,
Hildreth
E.
(1980).
Theory of edge detection.
Proceedings of the Royal Society of London B: Biological Sciences,
207
(1167),
187–217.
Marr
D.,
Ullman
S.
(1981).
Directional selectivity and its use in early visual processing.
Proceedings of the Royal Society of London B: Biological Sciences,
211
(1183),
151–180.
Martens
J.B.
(1990).
The Hermite transformtheory.
IEEE Transactions on Acoustics, Speech and Signal Processing,
38
(9),
1595–1606,
doi:
10.1109/29.60086.
Mead
C.,
Ismail
M.
(2012).
Analog VLSI implementation of neural systems.
New York: Springer:
Springer Science & Business Media.
Meck
W. H.,
Church
R. M.
(1983).
A mode control model of counting and timing processes.
Journal of Experimental Psychology: Animal Behavior Processes,
9
(3),
320–334,
doi:
10.1037/00977403.9.3.320.
Mel
B. W.
(1993).
Synaptic integration in an excitable dendritic tree.
Journal of Neurophysiology,
70
(3),
1086–1101.
Morgan
M. J.,
Raphael
S.,
Tibber
M. S.,
Dakin
S. C.
(2014).
A textureprocessing model of the visual sense of number.
Proceedings of the Royal Society B:
Biological Sciences,
281
(1790),
20141137,
doi:
10.1098/rspb.2014.1137.
Nieder
A.,
Freedman
D. J.,
Miller
E. K.
(2002,
Sept 6). Representation of the quantity of visual items in the primate prefrontal cortex.
Science,
297,
1708–1711,
doi:
10.1126/science.1072493.
Nieder
A.,
Miller
E. K.
(2003).
Coding of cognitive magnitude: Compressed scaling of numerical information in the primate prefrontal cortex.
Neuron,
37
(1),
149–157.
Palmer
S.,
Rock
I.
(1994).
Rethinking perceptual organization: The role of uniform connectedness.
Psychonomic Bulletin & Review,
1
(1),
29–55,
doi:
10.3758/BF03200760.
Park
J.,
DeWind
N. K.,
Woldorff
M. G.,
Brannon
E. M.
(2015).
Rapid and direct encoding of numerosity in the visual stream.
Cerebral Cortex,
26
(2),
748–763,
doi:
10.1093/cercor/bhv017.
Phong
B. T.
(1975).
Illumination for computer generated pictures.
Communications of the ACM,
18
(6),
311–317,
doi:
10.1145/360825.360839.
Piazza
M.,
Facoetti
A.,
Trussardi
A. N.,
Berteletti
I.,
Conte
S.,
Lucangeli
D.,
Zorzi
M.
(2010).
Developmental trajectory of number acuity reveals a severe impairment in developmental dyscalculia.
Cognition,
116
(1),
33–41,
doi:
10.1016/j.cognition.2010.03.012.
Piazza
M.,
Izard
V.,
Pinel
P.,
Le Bihan
D.,
Dehaene
S.
(2004).
Tuning curves for approximate numerosity in the human intraparietal sulcus.
Neuron,
44
(3),
547–555,
doi:
10.1016/j.neuron.2004.10.014.
Piazza
M.,
Mechelli
A.,
Butterworth
B.,
Price
C. J.
(2002).
Are subitizing and counting implemented as separate or functionally overlapping processes?
NeuroImage,
15
(2),
435–446,
doi:
10.1006/nimg.2001.0980.
Raphael
S.,
Dillenburger
B.,
Morgan
M.
(2013).
Computation of relative numerosity of circular dot textures.
Journal of Vision,
13
(2):
17,
1–11,
doi:10.1167/13.2.17. [
PubMed] [
Article]
Reich
D. S.,
Mechler
F.,
Purpura
K. P.,
Victor
J. D.
(2000).
Interspike intervals, receptive fields, and information encoding in primary visual cortex.
The Journal of Neuroscience,
20
(5),
1964–1974.
Resnikoff
H. L.,
Wells
R. O.,Jr.
(2015).
Mathematics in civilization.
Mineola, NY:
Courier Dover Publications.
Roitman
J. D.,
Brannon
E. M.,
Platt
M. L.
(2007).
Monotonic coding of numerosity in macaque lateral intraparietal area.
PLoS Biology,
5
(8),
e208,
doi:
10.1371/journal.pbio.0050208.
Ross
J.
(2003).
Visual discrimination of number without counting.
Perception,
32
(7),
867–870,
doi:
10.1068/p5029.
Ross
J.,
Burr
D. C.
(2010).
Vision senses number directly.
Journal of Vision,
10
(2):
10,
1–8,
doi:10.1167/10.2.10. [
PubMed] [
Article]
Saito
H.A.,
Tanaka
K.,
Fukada
Y.,
Oyamada
H.
(1988).
Analysis of discontinuity in visual contours in area 19 of the cat.
The Journal of Neuroscience,
8
(4),
1131–1143.
Salinas
E.,
Thier
P.
(2000).
Gain modulation: A major computational principle of the central nervous system.
Neuron,
27
(1),
15–21,
doi:
10.1016/S08966273(00)000040.
Schlick
C.
(1993).
A customizable reflectance model for everyday rendering.
In
Cohen
M.
Puech
C.
Sillion
F.
(Eds.)
Fourth Eurographics Workshop on Rendering
(pp.
73–83).
Geneva, Switzerland:
European Association for Computer Graphics.
Schwartz,
O.,
Simoncelli
E. P.
(2001).
Natural signal statistics and sensory gain control.
Nature Neuroscience,
4
(8),
819–825.
Serre
T.,
Wolf
L.,
Poggio
T.
(2005).
Object recognition with features inspired by visual cortex. In
Schmid
C.
Soatta
S.
Tomasi
C.
(Eds.) IEEE Computer Society conference on computer vision and pattern recognition, Vol. 2 (pp. 994–1000). IEEE.
Starkey,
P.,
Spelke
E. S.,
Gelman
R.
(1990).
Numerical abstraction by human infants.
Cognition,
36
(2),
97–127,
doi:
10.1016/00100277(90)90001Z.
Stoianov
I.,
Zorzi
M.
(2012).
Emergence of a “visual number sense” in hierarchical generative models.
Nature Neuroscience,
15
(2),
194–196,
doi:
10.1038/nn.2996.
Strauss
M. S.,
Curtis
L. E.
(1981).
Infant perception of numerosity.
Child Development,
52
(4),
1146–1152,
doi:
10.2307/1129500.
Suzuki
K.,
Horiba
I.,
Sugie
N.
(2003).
Lineartime connectedcomponent labeling based on sequential local operations.
Computer Vision and Image Understanding,
89
(1),
1–23,
doi:
10.1016/S10773142(02)000309.
Trick
L. M.,
Pylyshyn
Z. W.
(1994).
Why are small and large numbers enumerated differently? A limitedcapacity preattentive stage in vision.
Psychological Review,
101(1),
80–102, doi:
10.1037/0033–295X.101.1.80.
Umbaugh
S. E.
(2005).
Computer imaging: Digital image analysis and processing.
Boca Raton, FL:
CRC Press.
Verguts
T.,
Fias
W.
(2004).
Representation of number in animals and humans: A neural model.
Journal of Cognitive Neuroscience,
16
(9),
1493–1504,
doi:
10.1162/0898929042568497.
Walsh
V.
(2003).
A theory of magnitude: Common cortical metrics of time, space and quantity.
Trends in Cognitive Sciences,
7
(11),
483–488,
doi:
10.1016/j.tics.2003.09.002.
Whalen
J.,
Gallistel
C. R.,
Gelman
R.
(1999).
Nonverbal counting in humans: The psychophysics of number representation.
Psychological Science,
10
(2),
130–137,
doi:
10.1111/14679280.00120.
Young
R. A.
(1987).
The Gaussian derivative model for spatial vision: I. Retinal mechanisms.
Spatial Vision,
2
(4),
273–293,
doi:
10.1163/156856887X00222.
Young
R. A.,
Lesperance
R. M.
(2001).
The Gaussian derivative model for spatialtemporal vision: II. Cortical data.
Spatial Vision,
14
(3),
321–389,
doi:
10.1163/156856801753253591.
Zahn
C. T.,
Roskies
R. Z.
(1972).
Fourier descriptors for plane closed curves.
IEEE Transactions on Computers,
100
(3),
269–281.
Zetzsche
C.,
Barth
E.
(1990a).
Fundamental limits of linear filters in the visual processing of twodimensional signals.
Vision Research,
30
(7),
1111–1117.
Zetzsche
C.,
Barth
E.
(1990b).
Image surface predicates and the neural encoding of twodimensional signal variations. In
Rogowitz
B. E.
Allebach
J. P.
(Eds.) Human vision and electronic imaging: Models, methods, and applications, Vol. 1249 (pp. 160–177). Bellingham, WA: SPIE.
Zetzsche,
C.,
Gadzicki
K.,
Kluth
T.
(2013).
Statistical invariants of spatial form: From local and to numerosity.
In
Kutz
O.
Bhatt
M.
Borgo
S.
Santos
P.
(Eds.)
Proceedings of the Second Interdisciplinary Workshop The Shape of Things (Vol. 1007,
pp.
163–172).
Aachen, Germany: CEURWS.org.
Zetzsche,
C.,
Nuding
U.
(2005).
Nonlinear and higherorder approaches to the encoding of natural scenes.
Network: Computation in Neural Systems,
16
(2–3),
191–221.
Zetzsche
C.,
Roehrbein
F.
(2001).
Nonlinear and extraclassical receptive field properties and the statistics of natural scenes.
Network: Computation in Neural Systems,
12
(3),
331–350,
doi:
10.1080/net.12.3.331.350.
Appendix
Optimal estimator
In order to relate the internal representation of numerosity to behavioral results, we need models for decision making. These models must connect the analog quantity
n resulting from the proposed noisy model with a decision regarding the specific task. For the same–different task and the larger–smaller task, both requiring a binary decision, we used the receiver operator characteristic (ROC; Chang,
2010; Fawcett,
2006) from signaldetection theory to obtain the optimal parameters for the estimator. Each parameter setup defines one point in the ROC space, which is defined as the space spanned by the true positive rate and the false positive rate. The higher the true positive rate and simultaneously smaller the false positive rate, the better the parametrization of the classifier. This is equivalent to maximizing the area under the curve defined by the classifier in the ROC space.
In the same–different task, the classifier is defined as
where
lb defines the left bound and
d the length of the detection interval.
In the larger–smaller task, the classifier is defined as
where
t defines the threshold to distinguish between smaller and larger.
In both cases the optimal parameters of the estimator were determined in order to maximize the ROC on the data set and one fixed reference number
N_{tuned}. The parameters of the detectors can be found in
Table A4. The optimal estimators on data set TR were then used to obtain the statistics on data set R of the behavioral tasks shown as squares, triangles, and rhombuses in
Figure 11.
Fitting functions
We fitted the behavior of the estimators to a continuous function which is dependent on the internal Weber fraction
w as described by Piazza et al. (
2004). This fitting method was also used by Stoianov and Zorzi (
2012) to analyze their model such that the results are obviously comparable. However, we obtained the conditional relative frequencies of the response of the optimal estimators
D_{different} and
D_{larger} given the true numerosities
N_{true}. For the same–different task we fitted these data points to the function
where erf is the standard error function,
c controls the internal representation of the reference numerosity
N_{tuned},
δ controls the variance of the corresponding probability density function, and
w is the internal Weber fraction.
For the larger–smaller task we fitted the results of the optimal estimator to another function
with the functions and parameters as just described. In order to obtain representative fits for the numberdiscrimination ability (Piazza et al.,
2004), we only used numerosities which are close to the reference numerosity for the fit—i.e., we used all
N with the ratio
In both cases the parameters were estimated by a standard leastsquares minimization of the residual. The fitted curves obtained from data set R are illustrated as continuous lines in
Figure 11. The respective parameters can be found in
Table A3.
Noise analysis
In order to evaluate the noise behavior of the input unit and the linear filter units with respect to their neural plausibility, the peak SNR (PSNR) is derived from theory as described in the following. The input signal is a function
l : [−1, 1] × [−1, 1] → [0, 1] with amplitude
A(
l) = 1, such that the PSNR becomes
The implementation of the derivatives strongly depends on the discretization and thus on the step sizes Δ
x and Δ
y. We thus can derive the differential quotient
The upper bound for
l_{y} can be derived analogously. The second derivative in the
xdirection is bounded by
The upper bounds for
l_{yy} and
l_{xy} can be derived analogously. The linear filter units are computed by a convolution operation with a Gaussian function
g such that we need an upper bound for the filter output. Using Young's inequality, we get for
l_{x} The peaktopeak signal amplitude is then twice the maximum norm of the respective filter unit—i.e.,
The PSNR then becomes
The PSNR formulas for all signals are summarized in
Table A1.
Table A1 PSNR and SNR. Notes: The PSNR and SNR functions and values for the input and linear filter stages are shown for the theoretical setup and the three data sets TR, R, and RC.
Table A1 PSNR and SNR. Notes: The PSNR and SNR functions and values for the input and linear filter stages are shown for the theoretical setup and the three data sets TR, R, and RC.
In contrast to the PSNR, which can be derived from theory easily, we also consider the SNR of the data sets used for the analysis with respect to the Weber fraction. The SNR is determined by the expected integral of the squared signal over each data set. The SNR for the function
l and all images in data set R is given by
The values for the different signal types and the different data sets can also be found in
Table A1.
In order to relate the given SNR values with information encoded by neurons, we determine the rate of a uniform quantizer for the given SNR values. As a good approximation we can use the “6dBperbit rule” (Gray & Neuhoff,
1998), such that the rate
R is given approximately by
for a signal
l.
We went one step further and analyzed various noiseparameter combinations to identify suitable noise parameters. The resulting Weber fractions on the rectangle data set (R) are illustrated in
Figure A1. In both tasks we find a similar distribution of Weber fractions. The only difference can be observed for high noise levels in the upper right corner of the right illustrations of
Figure A1. This could be caused by numerical instability of the fitting algorithm, which relies on the resulting error in the numerosity quantity obtained from the model. However, the Weber fraction for these noise parameters is beyond the reported Weber fractions for humans (children ∼ 0.3). In both tasks the distribution in the vertical direction seems to have a discontinuity for lower linear noise
σ_{lin}. This discontinuity is a result of the noncontinuous luminance function in the data set. The standard deviation of the Gaussian filter is not sufficient to compensate the discontinuity in the stimulus signal function. The regularity required for the model computation (i.e., differentiability) is not guaranteed. This implies that the input noise stabilizes the system to a certain degree. Changing the standard deviation of the Gaussian filter should move the discontinuity in the Weber fraction distribution in
Figure A1; in particular, an increase of the standard deviation shifts this line toward the bottom of the illustration—i.e., toward a smaller
σ_{in} value.
We also highlighted a height line for a constant Weber fraction
w = 0.15 to demonstrate that various parameter combinations can result in the same fraction value. The qualitative difference between two cases (red and blue circles) is illustrated for the larger–smaller task in
Figure A2. A smaller noise parameter at the linear filters in combination with higher noise at the input has quantitatively the same overall Weber fraction, but qualitatively it does not show the same reported lognormally distributed noise behavior for all numbers (see
Figure A2a). This supports the previous suggestion that the noise at the linear filter outputs which are fed into the multiplication is the essential reason for the lognormal behavior of the model, which is in line with the literature (Buzsáki & Mizuseki,
2014).
Model comparison
On the general level, our proposed model differs from other imagebased models (Cappelletti et al.,
2014; Dakin et al.,
2011; Dehaene & Changeux,
1993; Morgan et al.,
2014; Stoianov & Zorzi,
2012) in the motivation behind its design. The central motivation for our model is the idea to find a sound mathematical basis for the provision of the invariance properties required by an ideal numerosity mechanism. As a second step, we then considered how these ideal principles could be implemented and approximated by neurobiologically plausible mechanisms. The other models have different goals or are based on a different type of reasoning. The model of Dehaene and Changeux (
1993) was an early model that tried to model the imageprocessing aspect to a certain degree. However, it is only a onedimensional model and thus cannot deal with twodimensional shapes and the associated invariance properties. We therefore will not include it in the following detailed comparison. The contrast energy model (Dakin et al.,
2011; Morgan et al.,
2014) is motivated by empirical observations in psychophysical experiments which suggested the possibility of a close connection between numerosity and density (Durgin,
2008). The actual model has then been derived from considerations of how a density mechanism can be provided by use of established filterbased texture computations (Dakin et al.,
2011; Morgan et al.,
2014). The model of Stoianov and Zorzi (
2012) is based on a deeplearning neural network. It is then abstracted by a computational analysis to a spatial filter model with point nonlinearities (Cappelletti et al.,
2014; Stoianov & Zorzi,
2012). This model is henceforth designated the network model.
To compare the other models with our model there exist in principle two different strategies. One is to find out the invariance properties which underlie the other models and to compare them to the invariance properties of our model. The other strategy is to map the models to a common framework and to compare the model components within this framework. In the model comparison provided in the
Discussion we made use of the first strategy. Here we will pursue the second strategy, the commonframework approach.
In
Table A2 the three models are split into their operations at different stages or layers. From this representation it becomes apparent that all three models use similar basic computations. They have a similar first layer, consisting of linear filter operations (derivatives and lowpass filtering) and the spatial pooling of local luminance signals into an aggregated cumulative area. The contrast energy model and the network model then use a standard nonlinear transfer function (a sigmoid function and an absolutevalue function). In contrast, our proposed model requires a nonlinear ANDlike (multiplicative) interaction. (As described in the main text, this could be realized in neural hardware by a variety of mechanisms.)
Table A2 Model comparison by operations.
Notes: The different models (left to right) are split into their basic operations (top to bottom). The operations are split into layers, with a linear operation followed by a nonlinear one (NL). The contrast energy model is by Dakin et al. (
2011); the network model is by Stoianov and Zorzi (
2012).
Table A2 Model comparison by operations.
Notes: The different models (left to right) are split into their basic operations (top to bottom). The operations are split into layers, with a linear operation followed by a nonlinear one (NL). The contrast energy model is by Dakin et al. (
2011); the network model is by Stoianov and Zorzi (
2012).
In the second layer, the contrast energy model and the network model perform a spatial integration (the first a global and the second a local integration). Our model also performs such a spatial integration, but only as the last processing step on its top layer. On the second layer, our model performs a linear combination of different local features (different terms of the curvature computation). This combination of different local features is one distinction from the other models, which both use only one type of local spatial feature (a Laplacian bandpass feature or a Gaussian lowpass feature).
The contrast energy model has a ratio operation between the two aggregated contrast energy values as its last operation. However, this operation is considered only for the incorporation of a moderate bias term for the influence of patch size (Dakin et al.,
2011), whereas the basic numerosity variable is argued to be provided by the highfrequency filter output (Dakin et al.,
2011; Morgan et al.,
2014). The network model ends with a linear weighting over the corrected Gaussian filter outputs of the second layer by a linear classifier.
In the following, we will try to compare the models with respect to their general behavior. This will require making some simplifying approximations, but we think that the basic trend will be similar to the behavior of the complete models. For the contrast energy model, the authors suggest that the core computation is the spatial pooling of the local energy from a Laplacian highfrequency filter (Dakin et al.,
2011; Morgan et al.,
2014). The main argument is to use this to measure the amount of contour, since adding more objects to an image amounts to adding more contour (Morgan et al.,
2014). For a given type of element—e.g., for squares of some fixed size—the estimate works perfectly: Changing the number of elements will proportionally change the pooled filter output. But how does this operation behave with respect to the invariance properties, i.e., if we change the shape of the elements? For some changes the influence will be relatively small, particularly if we only use concave elements, as is often done in numerosity experiments. For example, if we replace the squares with disks of the same area, the difference in contour length will be only about 10%. If we also use concave elements, differences can become larger. For the Lshaped element shown in
Figure A3.6, the difference will become as large as 113% in terms of number units (
Figure A4). The differences will become even larger if we also consider elements with different size. For example, if we compare small squares of side length
d/4 with large squares of side length
d, a set of
N large squares will appear to have the same numerosity as a set of 4
N small squares. Thus the contrast energy model makes clear predictions about systematic deviations from the ideal invariance properties. How far these are also present in human bias terms remains to be determined (see
Discussion).
The analysis of the network model is somewhat more complicated. We make the following simplifying assumptions, which we think are valid as far as systematic deviations from invariance are considered: First, the secondlayer operation and the final weighting by the classifier are both linear. The weighting of the secondlayer units at the different positions by the classifier can be expected to be similar, since these units all have basically the same status with respect to the computation of numerosity (any systematic differences between units at different positions would induce positiondependent biases). We thus assume that these two steps can be combined into one global spatial integration Σ_{(}_{m}_{,}_{n}_{)∈}_{X}O^{mn}. Here O^{ij} are the local Gaussianfilter outputs after the sigmoidpoint nonlinearity in the first layer and X is the set of indices of the discretized domain of the input image. We further assume that the aggregated luminance can also be rewritten into such a spatial integration as c = Σ_{(}_{m}_{,}_{n}_{)∈}_{X}kI^{ mn}. For this, we linearize the logarithm (the argument can assume values only between 1 and 2) as
The final sum can then be written as Σ
_{(}_{m}_{,}_{n}_{)∈}_{X}(
O^{mn} −
kI^{mn})—i.e., it is a sum over a difference image between the original image and a nonlinear lowpassfiltered version of it. This analysis suggests that the constant
k should be an absolute constant, which does not depend on further parameters, and that it should be chosen in a way to avoid contributions from the interior object area. This is necessary because otherwise there would remain a systematic dependency of the numerosity estimate on the object area. Formally,
k depends on the normalization constant
c_{max}. This is an absolute constant within an experiment, but it is not quite clear how a subject can have knowledge of it or what this implies for the relation between different types of experiments. In the following, we consider
k as an absolute constant which is chosen to produce a zero difference throughout all areas of constant luminance of the input image. We can then analyze the contributions from the remaining nonzero areas of the difference image. For a class of simple examples, patterns with straight edges and rightangled corners only (compare
Figure A3.5 through
A3.10), we know the correct solution from the Gauss–Bonnet theorem and from the Euler formula: The desired invariant can then be computed by simply summing up the number of (signed) corners as
n = ¼Σ
v_{i}, where each corner
i contributes +1 or −1. For this special case the difference image should thus ideally only generate contributions from the corners, and no contributions from the straight borders. This is also evident from considering the case of enlarging such a figure, since then the number of corners remains constant but the length of the straight contours is increased. Since the lowpass filtering smooths the corners, the network model will indeed provide the desired contribution from the corners. However, there seems also to be a nonvanishing contribution from the straight borders, such that there will be a dependency on the size as well as the shape of the elements. While the shape dependency is moderate for convex shapes, it becomes larger for concave shapes, since for those the total contour length is significantly larger (see
Figures A3.6 and
A4).
In conclusion, both the contrast energy model and the network model can be seen to make use of similar basic local features which result from some version of nonlinear bandpass filtering. This operation produces the basic local features (explicitly represented by the rectified bandpass features in the case of the contrast energy model and implicitly represented by the subtraction of the cumulative area from the lowpassfiltered features in the network model). These local features are then spatially pooled (globally in one step in the contrast energy model and in two steps—first the second Gaussian lowpass filter and then the linear classifier—in the network model). Mapped to this type of architecture, the two models can be directly compared to our model, which can also be seen as consisting of the computation of nonlinear local features and a subsequent spatial pooling. From the mathematical basis of our model we know which local parts of the input image have to play which role, if we want to achieve the desired invariance. We know, for example, that all constant image areas should not generate a systematic contribution. This enforces the mutual cancellation of the lowpass responses and the persample contribution of the cumulative area measurement in the interior object areas in the network model. If this is violated, a systematic influence of the cumulative area is unavoidable. We also know that the contributions from the contour regions (in case of binary images) should systematically vary with the contour curvature. In particular, straight contours should not generate any contribution, since otherwise there will result a systematic dependency on the total contour length and on the size of the elements. This is a problem for both the contrast energy model and the network model, since both produce nonvanishing contributions from straight contours. It should further be noted that it is generally not possible to trade off the false contributions from one class (say area) against the false contributions of the other class (contour), since there exists no shapeindependent relation between the two (see the arguments in the main text regarding the isoperimetric quotient).
Invariance properties of binary objects
Assuming the special case of binary images or objects allows for the derivation of problemspecific computational principles. Here we describe the general rules for the invariant computation of numerosity for this particular signal class. The binary setup can be interpreted from different points of view. In the following we consider two possible interpretations.
(a) We can see it as an inherently onedimensional problem. The objects are bounded twodimensional subsets of the
xyplane such that they can be represented by their onedimensional boundary curve (compare
Figure A5). In this case the onedimensional counterpart of the Gauss–Bonnet theorem is the following standard corollary in differential geometry:
Corollary 3
Let Γ be a closed, regular plane curve. Then we have
where
κ is the curvature of the plane curve and
n is an integer called the rotation index of the curve.
As the rotation index is always 1 for simply connected objects and the integral operator is linear, the integral over the disjunction of multiple boundary curves results in their number. The simple onedimensional variant of the Gauss–Bonnet theorem thus tells us that we have to integrate the curvature of the boundary curve(s). In contrast to the isoperimetric quotient, the integral over the curvature is independent of the shape. In
Figure A5a this quotient decreases from left to right, whereas the integral over the curvature remains constant. This is due to the fact that the increase of positive curvature from left to right in
Figure A5b is compensated by an additional contribution of negative curvature.
If we assume a piecewise constant boundary, the integral over the curvature becomes the discrete sum over the arcs. In particular, this setup includes the special case of objects having rightangled corners and otherwise vertical or horizontal straight boundary segments (see
Figure A3.5 through
A3.10 for examples). The corners can be divided into two classes: We have external corners, where the object area covers one quarter within a circular neighborhood. These corners have an arc of
π/2. The corners where the object area covers three quarters build the second class, the internal corners, with a contribution of −
π/2. If we divide both sides of the corollary by 2
π, we obtain a standard solution to determine the number of simply connected binary objects in computer vision (Umbaugh,
2005). The number of objects then becomes the number of external rectangular corners minus the number of internal rectangular corners divided by 4.
(b) We can consider the objects as twodimensional surfaces such that the approximate invariance requires no contribution from the interior area of the object, no contribution from straight edges, and a contribution from the remaining regions. The following approach, which matches all criteria in the binaryimage case, is motivated by the detection of the contour line of the region with the aid of the Laplace operator. If the contribution is zero on constant luminance regions and the computed quantity at the boundary allows a mapping to the curvature of the boundary curve, the integration over the whole domain should result in an estimate for numerosity. An easy way to detect the contour is to determine the zerocrossing line of the Laplacian filtered image. In order to guarantee the differentiability for the Laplace operator, the binary image must be filtered by an appropriate filter function. One standard filter function is the Gaussian function. This function is not optimal for further consideration, as its support is infinite. We assume a filter function
g with compact circular support Ω around the origin. Furthermore, the Laplace of the function
g is positive in an inner disk Ω
_{in} ⊂ Ω and negative in Ω
_{out} = Ω/Ω
_{in}. The integral of Δ
g over Ω is assumed to be zero. Thus, the Laplace of the convolution with the filter function yields zero for constant luminance regions. Furthermore, if we consider a straight edge with a length greater than or equal to the diameter of Ω, the line integral in the orthogonal (to the edge) direction of the filter output becomes zero for all boundary points where the neighborhood Ω contains the straight edge only. Thus, the integral over the whole domain of the Laplacian filter output already has two of the three desired properties: It causes no contributions at constant luminance regions and no contributions at straight edges. That the spatial integration of the filter output does not result in the desired estimate for numerosity can be seen easily by the following equation:
Note that the following solution does not depend on the Laplace operator. We can replace Δ
g by an arbitrary filter function
h which has the same characteristic behavior on Ω = Ω
_{in} ∪ Ω
_{out}. In order to obtain contributions from the curved regions, a nonlinear function
F : ℜ → ℜ is introduced. This function must be odd symmetric such that the zero contribution on constant regions and on straight edges is still preserved. The problem of the estimation of numerosity
n then becomes the problem of finding an appropriate combination of a nonlinear function
F and a filter function
h such that
Again, the success of this approach is due to its accuracy in approximating the curvature of the boundary curve. We assume that the boundary curve is a piecewise constant line with a minimum distance of
ε between vertices and that the diameter of the region Ω is chosen smaller than
ε (i.e., the radius
δ <
ε/2). In this particular case we know that integrating the linear filter output over the neighborhood with radius
δ around a vertex results in zero. At an edge, the integration domains of positive and negative contributions, which sum up to zero, are equal in size. At a vertex the size of these integration domains is no longer equal, but the overall integral remains zero. Thus a monotonic nonlinear function
F fulfilling the previously formulated constraints is sufficient to guarantee a contribution from the vertices. If this output is proportional to the arc, the integration over the whole domain results in an estimate for numerosity.
Figure A6 illustrates one choice of
h and
F which is sufficient to extract the desired information. We also applied the filter function from
Figure A6a and the modified sigmoid function to the stimuli in
Figure A7b, producing an output of 133.73, 261.58, 399.45, and 533.46, from left to right. Relative to the output of the first stimulus with one object, the responses become 1.00, 1.95, 2.98, and 3.98. This simple choice of
h and
F thus results in the desired proportionality to numerosity.
In conclusion, for binary images, approximating the curvature of the boundary curve and integrating this quantity results in an estimate for numerosity. Furthermore, the fundamental principle which allows the estimation of numerosity from binary images is the invariance property provided by the Gauss–Bonnet theorem.
Data sets
We generated different data sets using the method described by Stoianov and Zorzi (
2012) adapted to image size. The 100 × 100–pixel binary images contained up to 32 randomly placed nonoverlapping objects with variable surface area and shape. The cumulative area of the objects was varied from 288 to 2,304 pixels with a step of 288, resulting in eight levels of cumulative area. For each tuple of number and cumulative area, 200 images were computed. The data sets thus consist of 51,200 100 × 100 binary images, with 32 different numbers and eight levels of cumulative area. The data sets are distinguished by the kind of shape: rectangles (R), rectangles with smoothed corners (RC), circles (C). We generated a training data set (TR) consisting of rectangles to obtain the parameters for the optimal estimators.
Furthermore, we generated additional data sets with rectangular concave objects differing in their ratio q = A/L^{2}, where A is the area of a single object and L is the length of its contour. The cumulative area and the number are controlled as described for data sets TR, R, RC, and C. Given the ratio q_{square} = 1/16, which is the ratio of an arbitrary square, the data sets are generated with ratios q = 90% q_{square} (C1), 80% q_{square} (C2), and 70% q_{square} (C3).
We also generated data sets with randomly shaped objects by adding a randomly determined linear combination of sinus functions with different frequencies to the radius of circles (Zahn & Roskies,
1972). A single object
O ⊂ ℜ
^{2} around the origin with radius
r_{0} is thus defined by
where
θ is randomly chosen from the interval [0, 2
π], and
d ∈ ℕ determines the maximum frequency added to the boundary. The weights
ω_{k} are drawn randomly from a unique distribution over [0, 1]. Subsequently, they are scaled such that their sum becomes 50%
r_{0}. We generated data sets with different image sizes containing up to 32 randomly placed objects and two levels of cumulative area, 288 and 576 pixels, for images of size 100 × 100 pixels. The cumulative area was scaled with respect to image size. The data sets covered randomly shaped objects (
d = 5) in images of size 100 × 100 (RA), 200× 200 (RA200), and 400 × 400 (RA400) pixels. Sample images for all data sets are illustrated in
Figure A8. All data sets can be found online (
http://www.informatik.unibremen.de/cog_neuroinf/en/datasets).
Approximation errors
In this section we discuss different error causes in the model computation that can have an effect on the Weber fraction. This analysis is done by the model outputs in the various data sets, by the parameters of the optimal estimators, by distinction of cumulative areas, and by outliers in the overall Weber evaluation (see
Table 1).
Figures A9 and
A10 show a comparison of the outputs of the noisefree and noiseaffected models (
Figure A9: datasets TR, R, RC, C;
Figure A10: datasets C1, C2, C3, RA, RA200, RA400). This comparison also comprises the full number range up to the maximum of 32 tested in this study. For large numbers the standard deviation increases, which can have multiple reasons. First, in our data sets the mean object size decreases with a higher number of objects such that the Gaussian filter could just smooth them below the threshold. Second, with a higher number of objects the probability increases that two objects will be so close that they are counted as one. In both cases, we expect a smaller model output, which is confirmed by
Figure A9, where the mean model output (blue line) falls below identity (black line) for larger numbers of objects in all data sets. The increase of the standard deviations is in line with the increased probability that one of the previously described cases occurs for larger numbers. We can also observe larger deviations from identity for more curved objects (see top line from left to right in
Figure A9). This is due to the higher degree of regularity required for the computation of the discretized derivatives—i.e., a larger
σ is required. The introduction of noise to the system stabilizes the mean responses of the model but also causes an increase in standard deviation, as can be seen in the bottom row of
Figure A9.
As a second analysis, we compare Weber fractions regarding different reference numbers. From rectangular to circular objects (datasets R, RC, and C) we found slightly different but similar Weber fractions (see
Table A3). In the same–different task we observe better numerosity discrimination in data set RC for the numbers 8 and 16 and worse discrimination for the number 16. These differences may be explained by the optimal estimator parameters in
Table A4. The parameters for the evaluation were obtained from data set TR and describe an optimum of the ROC. Comparing these parameters with the parameters obtained from data sets R and RC should provide explanations for the observed differences. The parameters are determined by maximizing the true positive rate and minimizing the false negative rate. For number 8 in the same–different task, the parameters of R nearly match the parameters of the training set, such that the Weber fraction of R is a good baseline. Compared to the interval of the training set, the interval of RC is a subset with a smaller length. This directly implies that the used parameter setting differs from the optimum of the ROC in data set RC. As the interval is a subset, the true positive rate is at least the same or higher. This can be interpreted as a constant or a lower valley in the curve in
Figure 11c, for example. If the true positive rate is constant, a better Weber fraction would emerge only if the false positive rate is smaller. For the curve in
Figure 11c this would imply that the values next to the tuned numerosity 8 must be higher—i.e., the curve becomes steeper. If the true positive rate is higher, the false positive rate can stay constant or decrease to explain the observed effect of a smaller Weber fraction. But a smaller false positive rate is not reasonable in this setup, because the interval of the estimator is larger than the optimal interval for data set RC. This implies that in the worst case, more different stimuli are classified as the same. The most reasonable case is that the true positive rate increased and the false positive rate is constant. For number 12 the intervals are similar for all data sets, resulting in similar Weber fractions. For number 16 we find a similar parameter setup compared to number 8 (the optimal interval of RC is a subset of estimator interval TR), but the Weber fraction is worse for RC. Here it is reasonable that the false positive rate increased, resulting in a worse Weber fraction.
Table A3 Parameters of the continuous datafitting functions.
Notes: In all cases the noise levels were
σ_{in} = 0.4 and
σ_{lin} = 2. The parameters of the optimal estimators were obtained from data set TR (see
Table A4).
Table A3 Parameters of the continuous datafitting functions.
Notes: In all cases the noise levels were
σ_{in} = 0.4 and
σ_{lin} = 2. The parameters of the optimal estimators were obtained from data set TR (see
Table A4).
Table A4 Parameters of optimal estimators. Notes: In all cases the noise levels were σ_{in} = 0.4 and σ_{lin} = 2.
Table A4 Parameters of optimal estimators. Notes: In all cases the noise levels were σ_{in} = 0.4 and σ_{lin} = 2.
In the larger–smaller task the same arguments hold true for number 8, but numbers 12 and 16 are more complicated. The threshold of RC does not differ much from the estimator threshold (TR) for number 16, but the difference in the Weber fraction is quite high. In this task the similar threshold also implies a constant or higher true positive rate, but here the distinction is done between two sets both consisting of multiple numbers. Even if the true positive rate stays constant, which is reasonable for number 16, the distribution over the whole set of larger numbers can change dramatically. This could result in a worse Weber fraction. Analogous arguments apply to data set C.
As a next step we investigated invariance to cumulative area, as can be seen in
Figure A11. The increase in Weber fraction is a result of the decreased mean distance between the single objects. Increasing the minimum distance between the objects in the data set would resolve this issue.
In
Table 1 we find strong deviation for data set RA. High frequencies were used to generate the random shape of the objects on a small image size. If the frequency is too high, the discrete approximation of the line integral in the numerosity computation causes high errors, which explain the high Weber fraction for data set RA.
We are thus able to identify the following important parameters which can cause model errors:

Regularity: The approximation of the differential operators applied to the luminance function and the boundary curve must be sufficient. This can be guaranteed by a sufficiently high σ in the Gaussian filter function. High regularity is necessary to allow arbitrary shape variations.

Minimum distance: There is a strong dependence between the minimum distance between objects and the standard deviation σ. The higher the σ value, the higher the minimum distance must be to guarantee a clear distinction between separated objects.

Resolution: The resolution can restrict the kind of stimuli with respect to combinations of cumulative area, number, and shape. Assume the resolution is small and the cumulative area is high; increasing the number increases the probability that two objects are nearby. The standard deviation σ is then a tradeoff between fulfilling regularity and satisfying minimum distance. In both cases, errors occur.
To summarize, with the resolution and simultaneously the standard deviation σ increased in combination with a controlled minimum distance between objects, the model becomes a perfect enumerator.