Perceptual multistability refers to the phenomenon of spontaneous perceptual switching between two or more likely interpretations of an image. Although frequently explained by processes of adaptation or hysteresis, we show that perceptual switching can arise as a natural byproduct of perceptual decision making based on probabilistic (Bayesian) inference, which interprets images by combining probabilistic models of image formation with knowledge of scene regularities. Empirically, we investigated the effect of introducing scene regularities on Necker cube bistability by flanking the Necker cube with fields of unambiguous cubes that are oriented to coincide with one of the Necker cube percepts. We show that background cubes increase the time spent in percepts most similar to the background. To characterize changes in the temporal dynamics of the perceptual alternations beyond percept durations, we introduce Markov Renewal Processes (MRPs). MRPs provide a general mathematical framework for describing probabilistic switching behavior in finite state processes. Additionally, we introduce a simple theoretical model consistent with Bayesian models of vision that involves searching for good interpretations of an image by sampling a posterior distribution coupled with a decay process that favors recent to old interpretations. The model has the same quantitative characteristics as our human data and variation in model parameters can capture between-subject variation. Because the model produces the same kind of stochastic process found in human perceptual behavior, we conclude that multistability may represent an unavoidable by-product of normal perceptual (Bayesian) decision making with ambiguous images.

*V*

_{A}” indicates the percept was consistent with the

*viewpoint*of the subject being

*above*the cube in 3D space (resulting in the top surface perceived as in front), and “

*V*

_{B}” indicates that the percept was consistent with a

*viewpoint*

*below*the cube and hence viewing the bottom surface. At the sound of a beep, if the user was currently experiencing the

*V*

_{A}percept, they were instructed to press the “up” arrow key, and if they were experiencing the

*V*

_{B}percept, they were to press the “down” arrow key. If any percept other than

*V*

_{A}or

*V*

_{B}appeared, they were instructed not to press any key.

*x*-axis and 55 degrees away from the

*y*-axis such that vertically and horizontally it subtended a visual angle of 3.16 degrees. Participants were directed to maintain fixation on a tiny green dot rendered in the center of the cube. The user's chin rested on a chin rest which was 71 cm away from the center of the 24.2-cm-high monitor.

- 40 repetitions of
*V*_{A}/*V*_{B}key association practice where a single unambiguous cube would be presented in the middle of the screen, and the subject was asked to indicate with a key press what percept they were experiencing. - 40 repetitions of a task designed to help participants discriminate between orientations around the
*x*-axis. We refer to this as the “orientation discrimination” task. Two columns of 100 unambiguously oriented cubes appeared as shown Figures 2 and 3. The two columns differed in orientation from the Necker cube in the*x*-axis by random offsets with opposite (randomly-chosen) sign (e.g., +6 and −3.5 degrees). The offsets were sampled from a Gaussian distribution with a mean of 7 degrees and a standard deviation of 1. The user was asked to indicate which column appeared more fronto-parallel. - The Necker cube was presented for 1 minute. The user was asked to passively focus on the green dot in the middle of the screen. No beeps were sounded and no key presses were required.
- The Necker cube was presented in the middle of the screen for 20 s. Beeps were generated using a Gaussian distribution with a mean of 0.5 s and a standard deviation of 100 milliseconds after registering the previous response. The subject was asked to quickly respond with a key press at the sound of the beep, which resulted in an average inter-response interval of 1.2 s. Five 20-s blocks were collected from each subject, which constituted the
*base-rate*data for the*no-context*condition.

*blocks*where each block consisted of 10

*trials*. During the duration of a block, the Necker cube appeared in the middle of the screen and was flanked by two columns of 100 unambiguously oriented cubes on either side of the cube as shown in Figures 4 and 5, in the exact conditions of the orientation discrimination task described in the training session. The orientations of the cubes in the two columns were either both

*V*

_{A}or both

*V*

_{B}. The orientation of the cubes for a block was randomly permuted such that at the end of 100 blocks, 50 of them were

*V*

_{A}oriented and 50 of them were

*V*

_{B}oriented. In a block, the beeps were generated from a Gaussian distribution with a mean of 0.5 s and a standard deviation of 100 milliseconds (similar to the training session). It took an average of 0.5 s for the user to respond to a beep (by pressing an arrow key). Therefore, the duration of a block lasts for 10–11 s as 10 events of arrow presses are recorded for every block. If the subject took longer than 1.2 s to respond to the beep, a double beep was sounded and the block restarts. A 2- to 3-minute break was provided at the end of 20 blocks.

*k*th block of responses, let

*D*

^{ k}= {(

*r*

_{1},

*t*

_{1}), (

*r*

_{2},

*t*

_{2}), …, (

*r*

_{ N},

*t*

_{ N})}, where

*r*

_{ j}denotes responses, encoded as 0 for

*V*

_{A}and 1 for

*V*

_{B}, and

*t*

_{ j}represents a recorded response time measured from the start of the response period. We also use superscripts to indicate the block number of a measurement (e.g.,

*t*

_{ j}

^{ k}represents the

*j*th measurement from the

*k*th block). All of our measures were conditioned on

*starting times,*defined for each sequence as:

*i*after an elapsed time

*t*:

*t,*there may have been any number of state transitions away from

*i,*as long as the state returns

*i*to by time

*t*.

*t,*we interpret the response/time sequences as discrete samples from the continuous survival probability function. We classify response events for each block based on whether the initial state was

*V*

_{A}or

*V*

_{B}and linearly interpolate between response times. Let

*S*

^{ k}(

*t*) represent the interpolated indicator function for the

*k*th block. For each

*t*falling between two response times,

*S*

^{ k}(

*t*) is given by (let the subscripts 0 and 1 in

*S*

^{ k}(

*t*) represent states

*V*

_{A}and

*V*

_{B}, respectively):

*V*

_{A}. The interpolated indicator function for initial response of type

*V*

_{B}is similar.

*t*greater than any of the response times. Instead, times exceeding the last response are treated as missing data for that block. To keep track of which times points are valid after

*τ*

_{0}

^{ k}, we compute a second indicator variable that marks valid time points:

*P*

_{S}) for state

*i*is computed as:

*P*

_{R}) are similar but simpler to estimate. The recorded response sequences for each block were interpolated and averaged incorporating only valid times as above.

*τ*

_{0}

^{ k}and

*τ*

_{1}

^{ k}(where subscripts 0 and 1 represent states

*V*

_{A}and

*V*

_{B}). Note that the time spent in a state is the same as the time observed before a transition. To construct an indicator function for a transition from a state,

*V*

_{A}or

*V*

_{B}, we classify the responses subsequent to first response in each block as preceding or following the first observed state transition. Beginning with the initial state, subsequent responses are classified as 0 until the first state-change response is observed. The remaining responses in the block are classified as 1. The classification produces an indicator variable

*Z*

^{ k}(

*t*). Let

*Z*

_{0}

^{ k}denote the transition sequence described above for state

*V*

_{A}. Linear interpolation is used to construct indicator functions for cumulative probability until transition. For the cumulative transition probability of state

*V*

_{A}, the indicator function

*T*

_{0}

^{ k}(

*t*) is given by:

*P*

_{T}) by averaging indicator functions across blocks:

*N*

_{ i}(

*t*) is the number of transitions from state

*i*up to time

*t*after the start time. Note that there is an event equivalence between this event and the time

*D*

_{ i}, continuously spent in a state. Thus, another way of expressing Equation 7a is:

*P*

_{T}computes the cumulative probability that the duration of a percept is less than

*t*. Phase duration means and variances were computed directly from numerical estimates of the moments of the cumulative transition probability curves.

*τ*

_{0}

^{ k}and

*τ*

_{1}

^{ k}occurred in the first 1–3 s would change the estimates but also found no significant effect.

*V*

_{A}percept durations but may also decrease

*V*

_{B}percept durations. We would also expect to see an initial bias toward

*V*

_{A}first in the first few responses.

*P*

_{R}) as a function of time for view from below responses

*V*

_{B}(i.e., the percentage of time participants will respond

*V*

_{B}at the sound of the beep). The base-rate/no-context data (blue curve) show that participants are biased toward

*V*

_{A}, percepts (viewpoint from above) quite strongly for initial responses made early in a response block and asymptote at approximately 40% after 5–6 s. The fact that the asymptotic value is not 50% shows that percepts were biased toward viewpoint from above on average without context. The red curve (

*V*

_{A}context) shows increased

*V*

_{A}response to

*V*

_{A}context. The green curve shows increased

*V*

_{B}responses to

*V*

_{B}context. However, while

*V*

_{B}context strongly biased initial responses toward view from below,

*V*

_{A}context stimuli do not significantly change initial bias over baseline. Overall,

*V*

_{B}context produces a larger change in response behavior than

*V*

_{A}context.

*V*

_{A}context may produce a smaller shift because of the bias toward

*V*

_{A}responses in the no-context data.

*P*

_{T}) and survival probability curves (

*P*

_{S}), where both are conditioned on the initial perceptual state. Based on response probability results that show increased responses to percepts that match the context, we would expect context to increase time until transition and increase survival probabilities for context-consistent perceptual states.

*t*”) (

*P*

_{T}) and survival probability (

*P*

_{S}) data are shown in Figures 7 and 8. In both figures, panel a shows results for initial

*V*

_{A}percepts panel b shows results for initial

*V*

_{B}percepts. Clear effects of context for both types of percept are observed. When context is consistent with the initial percept, this initial percept takes longer to transition (i.e., consistent percepts are

*less*

*likely*to transition with consistent context) and survival probabilities increase. In addition, inconsistent context has the effect of suppressing the unsupported percept, increasing the likelihood of transition. However, suppressive effects are only large for view-from-below initial perceptual states

*V*

_{A}. This might be related to biases toward

*V*

_{A}perceptual states found in the baseline conditions. For example, because

*V*

_{A}percepts are more frequent in baseline viewing (in terms of durations and initial frequencies), there may be a ceiling effect for suppression. The relationship between the two measures is shown in the Supplementary data, “Markov Renewal Process for Bistability.”

*inferences*about the world given varying levels of uncertain retinal input. Inferences resulting from Bayesian statistics have been widely employed to describe visual processing from the level of neuronal behavior (Pouget, Dayan, & Zemel, 2003) to high-level visual processing (Yuille & Kersten, 2006). The purpose of this section is to show how bistable switching behavior might also arise from probabilistic models of perceptual inference and decision making. By providing a simple method for generating spontaneous switching in a Bayesian framework, we offer an alternative to the idea that explanations of perceptual bistability require specialized neural mechanisms whose job is to implement switching. At minimum, theories of bistability should provide explanations of (1) the existence of multiple interpretations, (2) the awareness of only one interpretation at a time, and (3) spontaneous switching between percepts.

*θ*from which the image

*I*of the cube was taken (we do not address the other part, which involves an inference about the shape of the wireframe object). Given that the object's shape is inferred as cuboidal, the posterior probability

*P*(

*θ*∣

*I*) will be bimodal (see Figure 10), with equal peaks given a uniform prior on viewpoint. Bayesian inference can also incorporate viewpoint information (gathered from experience or background context) to bias perception toward more probable views of the cube. Previous work by Mamassian and Landy (1998) shows that human observers are biased toward a viewpoint from above (our “no context” condition data suggests this as well). In Bayesian models, percepts result from selecting a good interpretation, typically by choosing the viewpoint with maximal posterior probability. When the maximum is not unique, Bayesian decision theory provides no mechanism for choosing between equally good interpretations. Thus, while criteria 1 and 2 are natural byproducts of using Bayesian decision theory, criterion 3 (the spontaneous switching of percepts) is not.

*neurally implemented*. In particular, Lee and Mumford (2003) and Pouget et al. (2003), point out that probabilistic population coding models of cortex can be thought of as representing probability distributions in terms of samples from the distribution. In this approach, receptive fields of neurons in the population encode regions of the interpretation parameter space, and firing rates encode the probability of a sample. Whether or not this interpretation of neural coding is an accurate reflection of biology, sampling constitutes a fundamental method for implementing Bayesian decisions, and the brain's computations may perform something equivalent. In the next section, we explain how switching can spontaneously arise in Bayesian decisions based on sampling.

- Posterior probabilities are represented by a set of samples, updated across time. In particular, at a sequence of times
*t*_{ i}(on a neural scale), the brain explores the posterior distribution by collecting a new set of samples and discarding the oldest samples. Samples, consist of both the parameter values*θ*(*t*_{ i}) and the associated posterior probabilities,*w*(*t*_{ i}), i.e., {(*θ*(*t*_{1}),*w*(*t*_{1})),…, (*θ*(*t*_{ N}),*w*(*t*_{ N}))}, where*w*(*t*_{1}) =*P*(*θ*(*t*_{1})∣*I*) are*weights*that represent the posterior probabilities when first sampled and*t*_{ i}are consecutive time points. Weights represent the quality of the interpretation associated with a sample, and they are discounted by a memory decay process. - Memory decay expresses the idea that the quality of an old interpretation decreases with time. We assume simple exponential discounting of a memory sample's weight by the age of the sample.
- Perceptual decisions result from choosing the sample with the highest discounted weight. This sample's parameter value is chosen as the current interpretation and brought into memory. The memory sample's discounted posterior probability
*w*(*t*_{ i}) represents the quality of the interpretation; this is also stored in memory. The interpretation process is illustrated in Figure 11. These basic assumptions produce spontaneous perceptual switching similar to human behavior.

*I*of a Necker cube is theoretically is compatible with the projection of an infinite number of polyhedra, depending on the depth of its 8 vertices. However, human perception is dominated by two interpretations, both cuboidal in shape, that involve two different poses or viewpoints on the object. Therefore, we simplified the 8-dimensional Necker cube vertex—depth space into a 1D viewpoint (elevation) parameter space

*θ*of the observer (we will henceforth refer to the viewpoint elevation parameter as just viewpoint parameter for brevity.) The viewpoint likelihood defined by

*P*(

*I*∣

*θ*) is modeled as the sum of two Gaussians centered on each of the two Necker cube interpretation modes,

*θ*

_{ V A}and

*θ*

_{ V B}. Assuming a Gaussian prior on viewpoint centered on

*θ*

_{ M}, the posterior,

*P*(

*θ*∣

*I*) is proportional to

*P*(

*I*∣

*θ*)

*P*(

*θ*).

*e*

^{− t/ τ}that discounts the weights attached to samples as a function of their age

*t*. The rate of decay is determined by the parameter

*τ*. The sampling rate was set to 2 samples/s.

*n*” independent samples are drawn i.i.d. from the posterior distribution. The number “n” is set to 2

*τ.*The order in which we process the samples is indicative of the temporal nature of the sampling. The memory decay function (

*e*

^{− t/ τ}) is multiplied by the posterior probability of the samples, where “

*t*” indexes the samples. The sample with the largest weight is selected as the current percept—the orientation of this sample represents the perceived orientation of the Necker cube. Subsequently, new samples are drawn in the same way and added to the current set of samples, and samples older than 2

*τ*seconds are discarded.

*θ*

_{ M}fixed to reflect either context

*V*

_{A}or

*V*

_{B}, the relative heights of the two posterior peaks can be controlled by varying the prior variance

*σ*

^{2}, which adjusts the

*ratio*of the prior probability values at the location of the two peaks of the likelihood:

*θ*

_{ V A}is the mean parameter value that represents the “view from above” percept and

*θ*

_{ V B}is the mean that represents the “view from below” percept.

*ω*between 0.1 and 10 for each context condition. For simplicity, we rescaled the viewpoint axis so that the two Necker cube percepts corresponded to −1 (

*V*

_{A}percept at mode

*θ*

_{ V A}) and +1 (

*V*

_{B}percept at mode

*θ*

_{ V B}), and the likelihood standard deviations were set to 0.25. The mean of the prior,

*θ*

_{ M}, was fixed at 0.1 or at −0.1, depending on which context condition we were simulating.

*τ,*which determines how quickly a sample's posterior probability decays. Assuming that 2 samples are generated every second, we varied the memory decay constant (

*τ*) from 20 s to 0.0198.

*Q*

_{ ij}(

*t*) =

*π*

_{ ij}

*F*

_{ ij}(

*t*), which represent the probability that the process makes a transition to state

*j*in time less than or equal to

*t,*given the process just entered state

*i*. MRPs transition between states via a Markov chain generated by the transition matrix

*π*

_{ ij}= P(

*s*

_{ N}=

*j*∣

*s*

_{ N−1}=

*i*). The times between state transitions are described by temporal distribution functions

*F*

_{ ij}(

*t*) that depend on the current state and the next state entered.

*π*

_{ij}are a function of the relative probability mass under the posterior peaks. Because the sampling we used was independent (i.e., the transition to a new state is not dependent on the previous state), the rows in the transition matrix of the corresponding MRP model are equal to each other. The MRP temporal distribution functions

*F*

_{ij}correspond to how long a maximally weighted sample in the Bayesian model remains the maximum before replacement.

*π*

_{ ij}) and the probability of duration in each state given by entries of the temporal distribution matrix

*F*

_{ ij}.

*F*

_{ ij}that are approximately exponential and the same for all four state transitions

*.*In addition, because we used independent sampling, the transition probability is independent of the previous state: i.e.,

*π*

_{ ij}=

*π*

_{ jj}. Thus, the MRP process for bistability reduces to just 2 parameters (the probability of transition to

*V*

_{A}and mean of the exponential temporal function).

*k*” sample times from the exponential distribution, where “

*k*” is the number of self-state transitions that occur before a state-change transition. Because means and variances of i.i.d. random variables sum, we predict correlation between means and variances of percept durations. Moreover, the probability of self-state transitions plays the critical role in the determining perceptual duration distributions. Each self-state transition contributes an exponentially distributed random time—if the number of self-state transitions was fixed rather than stochastic, percept durations would be gamma-distributed (the sum of “

*k*” exponentials is gamma-distributed with parameter

*λ*has mean and

*k*/

*λ*and variance

*k*/

*λ*

^{2}). Since

*k is*random, the duration distributions are mixtures of exponentials, and the probability of self-state transition determines this mixture. Because the relative number of self-state transitions is determined by the posterior probabilities, the effects of changing the posterior probabilities will cause specific, predictable changes in the shape of the percept duration distributions, the survival probability functions, and the mean and variances of the percept durations. We test these predictions in the results section below.

*V*

_{A}are posterior peak ratio,

*ω*= 1.1015 ± 0.0165 (1

*SD*) and

*τ*= 0.0846 s = ±0.02 (1

*SD*). The mean estimate obtained is the 0.632+ estimate (Efron & Tibshirani, 1997). The mean maximum likelihood parameters for context

*V*

_{B}are posterior peak

*ω*= 0.9645 ± 0.0127 (1

*SD*) and

*τ*= .0652 ± .01 (1

*SD*). Since the memory decay parameter,

*τ*is similar within 1 standard deviation between contexts, it is not a parameter that explains the difference between context behavior. Instead the single parameter

*ω*(which determines the heights of the posterior peak ratio) explains the difference in cumulative transition and survival probability functions induced by context. We also fit the mean durations of both the subject and the simulated data durations with gamma curves; all the fits returned could not be rejected.

*V*

_{A}. From the best-matching simulations, we computed

*π*from the percentage of memory update events that were of each transition type. The resulting transition for

*V*

_{A}context was:

*π*

_{00}) occurred at almost twice the rate of 0 → 1 (

*π*

_{01}), and similarly the second row shows 1 → 0 (

*π*

_{10}) occurred at almost twice the rate of 1 → 1 transitions (

*π*

_{11}).

*F*

_{ ij}(

*t*) for context

*V*

_{B}are shown in Figure 14 These curves were obtained by computing histograms of the amount of time between memory updates. Notice that they are similar to exponential functions and to each other, implying that they share the same temporal distribution mean. The temporal distribution functions for

*V*

_{A}context show similar behavior.

*V*

_{A}and

*V*

_{B}percepts, effectively by changing the probability of self-state transitions. The simplicity of this model is surprising given the alternatives. For example, adaptation or fatigue models of bistability make no clear predictions about the effect of context—the distribution of percept durations could be almost anything, and the distributions could change in many ways.

*P*(

*I*∣

*θ*). A more realistic representation would incorporate likelihood functions that implement perspective projection constraints in the inference of 3D objects from 2D, the effects of eye movements on the image data, and priors embodying the typical biases found in interpreting line drawings, like preference for symmetry between 3D edges, or compactness. Additionally, the current system could be embedded in a hierarchical recurrent inference system similar to the one proposed by Lee and Mumford (2003). However, the simple simulations were sufficient to illustrate the idea that implementing Bayesian decision making via sampling can create human-like switching behavior.

*T*. Moreover, the rate of convergence conveys important information about the strength of dependence between subsequent states. By definition, survival probability functions capture exactly the information needed to assess both stationarity and the rate of convergence. For percept

*V*

_{B}, stationarity implies there is an elapsed time

*T*such that the probability of the present state is independent of the past:

*T is*longer). This condition is the same as requiring that survival probability asymptotes agree, which follows by expressing the right hand side of Equation 11 in terms of a survival probability function. By laws of conditional probability,

*T*such that survival probabilities agree according to the equation:

*T,*the survival probability functions should converge to the response probabilities,

*if the process is stationary*. In contrast, there exist non-stationary processes (e.g., periodic processes) that can express long-range dependence between states and will not pass this test.

*P*(

*V*

_{A}(

*t*+

*τ*)∣

*V*

_{A}(

*τ*)) and the light curves shows 1 −

*P*(

*V*

_{B}(

*t*+

*τ*)∣

*V*

_{B}(

*τ*)). Both context and baseline data appear stationary and converge within the data collection period, within measurement error. The rapid convergence observed in our Necker cube data contrasts with the results of Mamassian and Goutcher (2005) for binocular rivalry data. Using survival probability functions, they found rivalry states were strongly dependent on the initial state, and this dependence persisted for at least an order of magnitude longer than we found in our data.