We compared two classes of models to subjects’ responses: RDT and Bayesian. Both model classes shared the same (optimal) decision rule but differed in how they calculated the sensory response. The Bayesian models assumed sensory responses were drawn from a von Mises (circular Gaussian) likelihood given the stimulus, while the RDT models assumed sensory responses were the outputs of an optimal lossy information channel (see
Sims, 2016). We assumed subjects had exact knowledge of how their own sensory responses were produced given a stimulus and that they had accurate knowledge of the stimulus prior distribution in the task when making a decision.
Decision rule: For both RDT and Bayesian models, the decision rule is given by
\begin{equation}
p(y_{\theta }, y_{loc}|\hat{x}) = \sum _{x} p(y_{\theta }, y_{loc}, x, \hat{x}) / p(\hat{x})
\end{equation}
where
The joint distribution can be factorized as \(p(y_{\theta }, y_{loc}, x, \hat{x}) = p(y_{\theta })p(y_{loc})p(x|y_{\theta },y_{loc})p(\hat{x}|x)\). Note that \(p(x|y_{\theta },y_{loc})\) is deterministic, since the stimulus was always identical given values of target angle(s) and location(s).
RDT models: For the RDT models,
\(p(\hat{x}|x)\) was given by the solution to the RDT-constrained optimization problem defined above (
Equation 1). Optimal solutions were found using the RateDistortion package in R (see
Sims, 2016). The exact form of the loss function (penalizing mismatches between
\(x\) and
\(\hat{x}\)) is described below.
Bayesian models: For the Bayesian models,
\(p(\hat{x}|x)\) was given by
\begin{equation}
p(x_s|x) = \frac{\prod _i^N e^{\frac{1}{\sigma }\cos (x_s^{(i)} - x^{(i)})}}{ \sum _{x^{\prime }} \prod _i^N e^{\frac{1}{\sigma }\cos (x^{\prime (i)} - x^{(i)})}}
\end{equation}
where
\(i\) indexes over items in the display, and
\begin{equation}
\hat{x} = \mathop{\arg\min}\limits_{x^{\prime }} \mathbb {E}_{p(x|x_s)} \mathcal {L}(x, x^{\prime }).
\end{equation}
That is, in the Bayesian models, sensory measurement
\(x_s\) has a discretized von Mises distribution (again, for tractability and to be consistent with the RDT models) where noise is i.i.d. between items
\(i\), and sensory response
\(\hat{x}\) is chosen to minimize the expected loss given
\(x_s\).
One-parameter models: We first tried modeling experimental data with simple, single-parameter models: capacity
\(\mathcal {C}\) for RDT models (
Equation 1) and
\(\sigma\) for Bayesian models (
Equation 4). The loss function for both was given by
\begin{equation}
\mathcal {L}(x, \hat{x}) = \Vert \hat{x} - x \Vert ^2.
\end{equation}
However, neither of these models provided good fits with our experimental data. Consequently, we extended the models with two additional free parameters.
Full (three-parameter) models: First, in the two-target condition, it seems plausible that subjects cognitively understood that the two targets were 180
\(^\circ\) apart but that this understanding did not influence their low-level sensory responses. In the models, we implemented this intuition by using the 180
\(^\circ\)-apart constraint in the decision-making part of a model (
Equation 3; for example, the constraint was used when calculating
\(p(y_{loc})\)). However, the full (three-parameter) models did not use this constraint in the sensory part of a model. Calculating
Equations 1 and
5 requires consideration of a prior distribution over sensory displays. A “legal” display is one in which the two targets are 180
\(^\circ\) apart, and an “illegal” display violates this constraint. In the full models, we set the prior probability of an illegal display,
\(p_{illegal}(x)\), to be based on a value denoted
\(\tau\). This was implemented so that if
\(\tau = 0\), then no probability mass was assigned to illegal values (corresponding to use of the 180
\(^\circ\)-apart constraint), and if
\(\tau = 1\), then the distribution over all displays (illegal and legal) was a uniform distribution.
Second, recall that subjects in our experiment indicated both the target location(s) and direction(s) of tilt on each trial. It seems plausible that subjects may have regarded either target location or tilt direction as more important than the other. In particular, our data indicated that subjects were more accurate at identifying target location. Define the following two loss functions, denoted
\(\mathcal {L}_{SE}\) and
\(\mathcal {L}_{loc}\), as follows:
\begin{eqnarray}
\mathcal {L}_{SE}(x, \hat{x}) & = & \frac{\Vert \hat{x} - x \Vert ^2}{\text{max}_{x^{\prime }} \Vert x^{\prime } - x\Vert ^2} \quad
\end{eqnarray}
\begin{eqnarray}
\mathcal {L}_{loc}(x, \hat{x}) & = & \frac{ \sum _{n=1}^N \mathbb {1}(x_n, \hat{x}_n)}{N_{targets}} \quad
\end{eqnarray}
where
\(n\) indexes over target locations,
\(N_{targets}\) is the number of targets, and
\(\mathbb {1}(x_n, \hat{x}_n)\) is an indicator function that equals 1 when a subject’s response incorrectly identifies the Gabor at location
\(n\) as a target.
\(\mathcal {L}_{SE}\) is the square-error loss between
\(x\) and
\(\hat{x}\), whereas
\(\mathcal {L}_{loc}\) measures error based solely on subjects’ estimates of target location. The full models used the loss function
\begin{equation}
\mathcal {L} = (1 - \alpha ) \mathcal {L}_{SE} + \alpha \mathcal {L}_{loc}
\end{equation}
where
\(\alpha\) is a parameter governing how much the loss is based on both target location and tilt direction versus target location alone.
In summary, the full RDT models have three parameters (\(\mathcal {C}\), \(\tau\), and \(\alpha\)), and the full Bayesian models also have three parameters (\(\sigma\), \(\tau\), and \(\alpha\)).
Parameter fitting: For each model, we estimated its maximum likelihood parameter values based on trials from (i) the one-target condition, (ii) the two-target condition, and (iii) both conditions combined, using the optim function in the R programming environment. The likelihood of a model was given by
\begin{equation}
L(\phi ) = \prod _t p_{y_{\theta }, y_{loc}|x}\left(x_{resp}^{(t)}|x^{(t)}\right)
\end{equation}
where
\(\phi\) is the set of model parameters,
\(t\) indexes over trials, and
\(x_{resp}^{(t)}\) is a subject’s response on trial
\(t\). The probability
\(p_{y_{\theta }, y_{loc}|x}\) is the probability of the decision under a model and was given by a probability matching rule (i.e., responses were chosen with frequency proportional to the probability they are correct;
Da Silva et al., 2017;
Wozny et al., 2010;
Craig, 1976).
Visualizing the optimal noisy channel: Figures 2 and
3 qualitatively visualize basic predictions of the optimal lossy channel.
Figure 2 shows how channel output probabilities
\(p(\hat{x}|x)\) vary as a function of capacity. In particular, more probability mass is concentrated on
\(x=\hat{x}\) as channel capacity is increased.
Figure 3 shows how the same probabilities vary as a function of what information is emphasized in the loss function for a fixed capacity. The top row corresponds to
\(\alpha =0\), which penalizes the squared distance between the stimulus and response in angle space and thus assumes subjects care about both tilt and location. The bottom row corresponds to
\(\alpha =1\), which only penalizes location errors and thus assumes subjects only care about location. Intermediate values of
\(\alpha\) would interpolate between the top and bottom rows.