The original integrated reweighting theory (IRT) (
Dosher et al., 2013) was developed to account for perceptual learning and transfer over changes in stimulus (e.g. changed orientation) or changed spatial location in two-alternative pattern discrimination tasks. The I-IRT model (Liu, Lu, & Dosher,
submitted) extends predictions to identification by introducing
n mini-decision units, one for each possible response. The final response on a trial is based on the mini-decision unit with the highest activation. As with the original IRT, the input stimulus is encoded as a pattern of activity in spatial-frequency and orientation tuned representation units at both location-specific and location-independent levels. The model uses hybrid learning rules: unsupervised Hebbian learning augmented by feedback supervision when available (from the augmented Hebbian reweighting model, or AHRM;
Petrov et al., 2005;
Petrov et al., 2006), which accounts for learning outcomes with and without feedback. The simulations mimic exactly the details of the experiments, using the same program to generate stimulus images, numbers of trials and randomization, etc. and the simulated data are then processed as in the behavioral experiment (here, as proportion correct or threshold, and confusion matrices). The model is implemented in Matlab (The MathWorks, Inc., Torrance, CA, USA). We briefly summarize the model here, including equations and descriptions found in the original IRT papers (
Dosher et al., 2013;
Liu et al., 2014;
Liu, Dosher, & Lu, 2015). The descriptions of model equations below are necessarily similar to treatments in
Dosher et al. (2013), and other papers using the two-alternative IRT.
The I-IRT, like the IRT, has a representation module, a decision module, and a learning module. The representation module processes the stimulus images from the experiment to compute the activities in location-specific and location-invariant representations (sometimes cited as analogous to the visual areas V1 and V4 or IT). The input image,
I(
x,
y), is defined as the sum of the signal and noise images for a given trial, corresponding with an integration of noise and signal frames through temporal integration by the visual system. The input image is then convolved with the filter characterizing each spatial-frequency/orientation tuned representation unit using a fast Fourier transform, followed by half-squaring rectification, to produce phase-sensitive activation maps analogous to “simple cells”:
\begin{eqnarray*}S\left( {x,y,\theta ,f,\phi } \right) = [R{F_{\theta ,f,\phi }}\left( {x,y} \right) \otimes I\left( {x,y} \right)]_ + ^2.\end{eqnarray*}
In this implementation the orientation/spatial-frequency filters at each spatial point sample 12 spatial frequency bands (every 1/2 octave) centered at [0.22 0.31 0.43, 0.62, 0.88, 1.23, 1.75, 2.46, 3.51, 4.92, 7.01, and 9.85 cycles/degree] × 12 orientation bands (every 15 degrees) centered at [0°, ±15°, ±30°, ±45°, ±60°, ±75°, and +90° (=−90°)] × four spatial phases [0°, 90°, 180°, and 270°]. In the location-specific representations, the spatial frequency tuning is set at hf = 1 octave and the orientation tuning is set at hθ = 30 degrees (half-amplitude full-bandwidth), based on cellular physiology in primary visual cortex. In the location-invariant representations, bandwidths were set at twice those of the location-specific units since cells in higher visual areas are more broadly tuned to spatial frequency and orientation; they also may have more internal noise. (These representation module parameter values have been used in many model applications in the IRT framework.) The phase-sensitive maps S(x, y, θ, f, ϕ) are pooled over spatial phases to create phase-invariant energy maps: E(x, y, θ, f) = ∑S(x, y, θ, f, ϕ) + ε1, where ε1 is an internal Gaussian noise (mean 0, standard deviation σ1). These maps include nonlinear inhibitory normalization: \(C( {x,y,\theta ,f} ) = \frac{{aE( {x,y,\theta ,f} )}}{{k + N( f )}}\). The normalization pool N(f) sums over all orientations, with slight tuning for similar spatial frequencies, consistent with physiological and psychophysical evidence. The saturation constant k avoids division by zero at very low contrasts when the normalization pool is very small, and can be set to 0 in the current experiment which uses medium-to-high contrasts. The parameter a is a scaling factor that can shift the range of the final activation values.
To compress the number of representations, the normalized phase-insensitive maps
C(x, y, θ, f) are pooled over space around the target stimulus with a Gaussian kernel of radius
Wr, and Gaussian additive noise is added to the system (mean 0 and standard deviation σ
2):
\(A^{\prime}( \theta ,f ) = \sum_{x,y} {W_r}( {x,y} ) C( x,y,\theta ,f ) + \varepsilon _2\). Then, a nonlinear function with a gain parameter γ limits the activations of each representation to the range of (0, A
max):
\begin{equation*}
A( {\theta ,f} ) = \left\{ \begin{array}{@{}c@{\quad}l@{}}
\frac{{1 - {e^{ - \gamma A^{\prime}}}}}{{1 + {e^{ - \gamma A^{\prime}}}}}{A_{max,}} & if\;A^{\prime} \ge 0\\
0,& otherwise. \end{array}\right.
\end{equation*}
Large caches of activation patterns over these representations are computed for different contrasts and samples of external noise for use in the trial-by-trial simulations of the experiments.
The decision module uses eight mini-decision units, one for each spatial-frequency response. On every trial, each mini-decision unit is driven by the weighted activation from the representation units, input from a bias unit, and internal noise, leading to a noisy decision variable: \({u_i}{\rm{\; = \;}} \sum_{j = 1}^{96} {w_{ji}}{\rm{A(}}{\theta _{ji,}}{f_{ji}}{\rm{\;)\; - \;}}{w_b}{b_i} + \;{\varepsilon _d}\). The wji values are the current weights connecting representation units to sub-decision unit i, b is a bias term weighted by wb and εd (Gaussian, mean 0, standard deviation σd) is the (same) decision noise for each sub-decision unit i. A sigmoidal function with parameter γ transforms this into the “early” post-synaptic decision activation: \({o_i}^{\prime} = G( {{u_i}} ) = \frac{{1 - {e^{ - \gamma {u_i}}}}}{{1 + {e^{ - \gamma {u_i}}}}}{A_{max}}\). A maximum rule selects the final response.
The learning module updates the weights between the representation units and the mini-decision units on every trial. The decision variable in each mini-decision unit ui is shifted towards the correct response (provided by the response feedback in the experiments) to generate the “late” post-synaptic activation: oi = G(ui + wfF), which moves the weights in the right direction. With a high feedback weight wf, the “late” decision activation approaches the correct output (±Amax = ±1), which in turn improves learning. (If no feedback signal is available, which never occurred in the experiments here, F = 0, and learning relies on the unsupervised early decision value (o = o′), which can often be less efficient.) Feedback was implemented at each of the mini-decision units. Because the feedback provided the correct response, the algorithm always sets F = 1 for the mini-decision corresponding to the correct response and F = −1 for all the other mini-decision units, regardless of whether the observer's response is correct or not.
Weight changes are determined by: Δwi = (wi − wmin) [δi]− + (wmax − wi)[δi]+ , whereas \({\delta _i} = \eta A( {{\theta _i},{f_i}} )( {o - \bar o} )\) , where A(θ, f) is the pre-synaptic activation, and \((o - \;\bar o\)) is the difference between the post-synaptic activation and its long-term average \(\bar o\) weighted exponentially over the last 50 trials: \(\bar o( {t + 1} ) = \rho \;o( t ) + ( {1 - \rho } )\bar o( t )\), ρ = 0.02, wmin and wmax are the lower and upper bounds of weights (to prevent weights exploding). Each mini-decision unit also receives input from a bias term b to balance the response frequencies, also exponentially time-weighted with a time constant 0.02: \(r( {t + 1} ) = \rho \;R( t ) + ( {1 - \rho } )\bar r( t )\), b(t + 1) = r(t). Here, R(t) is 1 for the actual response, and −1/7 for the other potential responses. The bias input works against unbalanced response frequencies.
The I-IRT model was fit to the data, whether proportion correct or threshold, by varying key parameters of the model, simulating the model 100 times, carrying out the same data analysis as for the behavioral data, and then comparing the mean simulated outcomes with the data. Most of the parameters were set a priori from prior applications of the IRT model, originally motivated by the physiology. A grid of parameter values (noise terms, scaling factor, model learning rate, and initial weights) was evaluated, centered around values from previous fitted applications of the model and spanning around a 10 times to 100 times range. Then the parameter space was heuristically searched in a finer grid in the regions yielding higher quality of fit (least squared errors). Occasionally, when multiple parameter combinations yielded satisfactory fits (equivalent r2), the one most consistent with prior applications was selected. The simulated results are shown in as a region with ± 1 standard deviation of the mean prediction computed from the 100 learning curves simulated from the set of best-fitting parameter values, with the quality of the fit summarized by the r2.