Open Access
Article  |   July 2024
TORONTO: A trial-oriented multidimensional psychometric testing algorithm
Author Affiliations
  • Runjie Bill Shi
    Institute of Biomedical Engineering, University of Toronto, Toronto, Canada
    Temerty Faculty of Medicine, University of Toronto, Toronto, Canada
    bill.shi@mail.utoronto.ca
  • Moshe Eizenman
    Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Canada
    moshe.eizenman@utoronto.ca
  • Leo Yan Li-Han
    Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
    yyaann.li@mail.utoronto.ca
  • Willy Wong
    Institute of Biomedical Engineering, University of Toronto, Toronto, Canada
    Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
    willy.wong@utoronto.ca
Journal of Vision July 2024, Vol.24, 2. doi:https://doi.org/10.1167/jov.24.7.2
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Runjie Bill Shi, Moshe Eizenman, Leo Yan Li-Han, Willy Wong; TORONTO: A trial-oriented multidimensional psychometric testing algorithm. Journal of Vision 2024;24(7):2. https://doi.org/10.1167/jov.24.7.2.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Bayesian adaptive methods for sensory threshold determination were conceived originally to track a single threshold. When applied to the testing of vision, they do not exploit the spatial patterns that underlie thresholds at different locations in the visual field. Exploiting these patterns has been recognized as key to further improving visual field test efficiency. We present a new approach (TORONTO) that outperforms other existing methods in terms of speed and accuracy. TORONTO generalizes the QUEST/ZEST algorithm to estimate simultaneously multiple thresholds. After each trial, without waiting for a fully determined threshold, the trial-oriented approach updates not only the location currently tested but also all other locations based on patterns in a reference data set. Since the availability of reference data can be limited, techniques are developed to overcome this limitation. TORONTO was evaluated using computer-simulated visual field tests: In the reliable condition (false positive [FP] = false negative [FN] = 3%), the median termination and root mean square error (RMSE) of TORONTO was 153 trials and 2.0 dB, twice as fast with equal accuracy as ZEST. In the FP = FN = 15% condition, TORONTO terminated in 151 trials and was 2.2 times faster than ZEST with better RMSE (2.6 vs. 3.7 dB). In the FP = FN = 30% condition, TORONTO achieved 4.2 dB RMSE in 148 trials, while all other techniques had > 6.5 dB RMSE and terminated much slower. In conclusion, TORONTO is a fast and accurate algorithm for determining multiple thresholds under a wide range of reliability and subject conditions.

Introduction
The determination of sensory thresholds in psychophysical testing is a time-consuming process. Nonadaptive methods like the method of constant stimuli or the method of limits require a large number of tests, making the process inefficient. Staircase methods automate the process but are limited by predetermined stimulus step sizes. In contrast, Bayesian adaptive methods offer an efficient and accurate alternative by keeping track of a probability function of the threshold using Bayesian inference based on an initial prior and responses, providing a more streamlined testing process. The original method is referred to as QUEST (Watson & Pelli, 1983), with later modifications made to the placement rule (i.e., ZEST) (King-Smith, Grigsby, Vingrys, Benes, & Supowit, 1994). Subsequently, extensions were made to QUEST to allow for greater flexibility in its use (Bak & Pillow, 2018; Lesmes et al., 2015). For example, the Ψ method of Kontsevich and Tyler extends the Bayesian algorithm to include estimation of both threshold and slope (Kontsevich & Tyler, 1999). More recently, Watson extended the QUEST method to permit the estimation of an arbitrary number of psychometric function parameters as well as other modifications (Watson, 2017). 
However, these methods do not address the situation where multiple thresholds from different psychometric functions are tested. Examples include the visual field test (perimetry), which requires the determination of approximately 54 to 76 thresholds per eye, and pure-tone audiometry, which requires the determination of thresholds at 6 to 10 frequencies. Traditionally, each threshold is treated independently, leading to separate testing for each condition (e.g., a location in perimetry and a frequency in audiometry). However, thresholds at different conditions are often related. In the visual field, neighboring locations’ thresholds in the same hemifield tend to be strongly correlated due to the anatomical proximity of neighboring retinal neural fiber bundles and their optic nerve head locations (Garway-Heath, Poinoosawmy, Fitzke, & Hitchings, 2000). In cases of hearing loss, presbycusis manifests as a slow roll-off or decline in thresholds at higher frequencies, producing characteristic audiogram configurations (Demeester et al., 2009). In both cases, results from testing at one condition (i.e., location or frequency) can increase the knowledge of thresholds at other conditions. Exploiting this relationship will increase testing efficiency while improving result accuracy. The Ψ method and QUEST+ methods do not address the practical issue of how this interlocation dependency can be modeled or how the Bayesian prior information can be derived empirically in the case where there are many thresholds (e.g., 54) to be determined. 
In this article, we propose a new method that exploits the spatial correlation between different visual filed locations’ thresholds in a reference data set, thereby enabling the estimation of thresholds from multiple psychometric functions simultaneously. Visual field tests are crucial for diagnosing and monitoring conditions like glaucoma. However, they can be time-consuming, and so there is a practical desire in clinical practice to increase their efficiency to improve patient experience and clinical workflow. We will examine the most commonly used pattern, 24-2, that assesses the differential light sensitivity thresholds measured in decibels (dB) across 54 locations in the visual field. However, the same method can be applied to other configurations like the 10-2, 24-2C, and 30-2 visual field patterns or even audiometric testing. 
One can think of two paradigms for exploiting the information contained in spatial correlations. For “threshold-oriented” methods, the classic approach is quadrant seeding (Turpin, McKendrick, Johnson, & Vingrys, 2003), which first establishes the estimates in the center of each quadrant (i.e., four parameters) and then propagates this initial estimate to neighboring locations in the same quadrant. Quick visual field map (qVFM) first estimates the general shape of the hill of vision and switches to local testing based on a “switch algorithm” (Xu, Lesmes, Yu, & Lu, 2019). Gaussian processes can be used to approximate both global and local visual field patterns, but they tend to be expensive computationally and are not suitable for real-time usage (Chesley & Barbour, 2020; Song et al., 2015). Sequentially optimized reconstruction strategy (SORS) involves the full determination of only a subset of all thresholds, with the rest of the visual field reconstructed from a data-driven reconstruction model without the need for additional testing (Kucur & Sznitman, 2017; Kucur, Márquez-Neila, Abegg, & Sznitman, 2019). All of the methods mentioned previously involve training a parametric (or semi-parametric) model of the visual field pattern and either fitting the stimulus responses to this model or using this model to predict the most likely shape of the visual field. A second paradigm is one we term as “trial oriented.” In this case, the result of even a single trial is propagated to other locations as there is no need to wait for the full determination of a threshold before utilizing this information. Spatially weighted likelihoods in ZEST (SWeLZ) is an earlier attempt at updating trial-oriented likelihood functions based on a model of the correlation strength between different locations. However, its benefits in terms of reducing testing time were shown to be limited (Rubinstein, McKendrick, & Turpin, 2016). 
This article proposes a new trial-oriented approach called TORONTO (Trial-Oriented Reconstruction on Tree Optimization). TORONTO employs a nonparametric data-driven method to utilize the spatial correlation between test points for updating other locations after a single trial at one location. The approach is a multidimensional generalization of QUEST or ZEST in that measurements at one location are used to update not only the tested location but all other locations as well. The process can alternatively be thought of as a binary decision tree using the locations’ thresholds as features but employs customized decision boundaries similar to soft-margin support vector machines to allow for finite errors (Noble, 2006). TORONTO terminates significantly faster than traditional algorithms with better accuracy. Next, we will provide a detailed description of the algorithm and comparisons to other previously established methods using computer simulations. 
Materials and methods
Overview of approach
The TORONTO method can be applied to any psychometric test involving the determination of multiple thresholds when there is an existing database of representative thresholds. Figure 1 provides an illustrative example of the algorithm when applied to visual field testing. The algorithm operates by sampling locations based on their highest measurement uncertainty. Each trial generates a binary decision, as represented by inequality signs in Figure 1, which then guides the sequential pruning of the tree using an existing database of threshold results. With each pruning step, certain conditions are imposed on the remaining locations, and this iterative process continues until the termination criteria are met. While Figure 1 depicts the subject's response in terms of binary outcomes (i.e., inequalities) for simplicity of illustration, the algorithm does not in fact assume a reliable subject response and instead tracks a likelihood function similar to other Bayesian algorithms. Further clarification of this matter is provided in the methods below and in the discussion. 
Figure 1.
 
An example of a simulated visual field test using TORONTO's trial-oriented reconstruction. The numerical values, shown in units of decibels (dB), represent the differential light sensitivity threshold at each of the 24-2 visual field locations. Following visual field convention, the higher numerical values represent better sensitivity. (A) The input, ground-truth visual field used as the simulated patient. (BD) The information sampled from the first 10, 20, and 30 trials, represented as inequality signs or ranges. For example, “<17” denotes that there has been a stimulus at this location with intensity of 17 dB with a response of NO (did not see); 12–18 denotes that there is a 12 dB stimulus with response YES and an 18 dB stimulus with response NO. (EG) The estimated/predicted thresholds after the first 10, 20, and 30 trials. The evolving pattern provides a “bird's-eye view” of the landscape of the visual field. As more and more trials are added, local details in the field are refined through the updates in the visual field. After just 30 trials, the estimate is already quite close to the defect pattern in the actual visual field shown in Figure 1 (point-wise RMSE = 3.9 dB). Full testing continues until the termination criteria are reached (not shown). While this illustrative example shows the subject making consistent responses, the algorithm makes no such assumptions. Please see main text for more details.
Figure 1.
 
An example of a simulated visual field test using TORONTO's trial-oriented reconstruction. The numerical values, shown in units of decibels (dB), represent the differential light sensitivity threshold at each of the 24-2 visual field locations. Following visual field convention, the higher numerical values represent better sensitivity. (A) The input, ground-truth visual field used as the simulated patient. (BD) The information sampled from the first 10, 20, and 30 trials, represented as inequality signs or ranges. For example, “<17” denotes that there has been a stimulus at this location with intensity of 17 dB with a response of NO (did not see); 12–18 denotes that there is a 12 dB stimulus with response YES and an 18 dB stimulus with response NO. (EG) The estimated/predicted thresholds after the first 10, 20, and 30 trials. The evolving pattern provides a “bird's-eye view” of the landscape of the visual field. As more and more trials are added, local details in the field are refined through the updates in the visual field. After just 30 trials, the estimate is already quite close to the defect pattern in the actual visual field shown in Figure 1 (point-wise RMSE = 3.9 dB). Full testing continues until the termination criteria are reached (not shown). While this illustrative example shows the subject making consistent responses, the algorithm makes no such assumptions. Please see main text for more details.
Detailed approach
The algorithm consists of several steps, including initialization, stimulus placement, update, and termination. Each of these steps is described in the context of 24-2 visual field testing. 
Initialization
The TORONTO algorithm uses a reference data set  
\begin{eqnarray} T = \left[ {\begin{array}{@{}*{4}{c}@{}} {{{T}_{1,1}}}&{{{T}_{1,2}}}& \cdots &{{{T}_{1,54}}}\\ {{{T}_{2,1}}}&{{{T}_{2,2}}}& \cdots &{{{T}_{2,54}}}\\ \vdots & \vdots & \ddots & \vdots \\ {{{T}_{N,1}}}&{{{T}_{N,2}}}& \cdots &{{{T}_{N,54}}} \end{array}} \right] \quad \end{eqnarray}
(1)
that consists of N visual fields, each with 54 locations. Entry Ti,j represents the threshold at the jth location out of the 54 24-2 locations (i.e., j = 1, 2, …, 54) of the ith visual field (i.e., i = 1, 2, …, N) in the reference data set. The reference data set matrix's rows need not be ordered and can in theory have multiple rows of identical values. The order of the columns corresponds to the predefined order of 24-2 locations. 
A probability matrix P with the same dimension as T is initialized with a uniform a priori probability mass, that is,  
\begin{eqnarray} && {\rm{Initialization:}}\ \ \ P = \nonumber \\ && \left[ {\begin{array}{@{}*{4}{c}@{}} {p\left( {{{T}_{1,1}}} \right)}&{p\left( {{{T}_{1,2}}} \right)}& \cdots &{p\left( {{{T}_{1,54}}} \right)}\\ {p\left( {{{T}_{2,1}}} \right)}&{p\left( {{{T}_{2,2}}} \right)}& \cdots &{p\left( {{{T}_{2,54}}} \right)}\\ \vdots & \vdots & \ddots & \vdots \\ {p\left( {{{T}_{N,1}}} \right)}&{p\left( {{{T}_{N,2}}} \right)}& \cdots &{p\left( {{{T}_{N,54}}} \right)} \end{array}} \right] = \left[ {\begin{array}{@{}*{4}{c}@{}} {\frac{1}{N}}&{\frac{1}{N}}& \cdots &{\frac{1}{N}}\\ {\frac{1}{N}}&{\frac{1}{N}}& \cdots &{\frac{1}{N}}\\ \vdots & \vdots & \ddots & \vdots \\ {\frac{1}{N}}&{\frac{1}{N}}& \cdots &{\frac{1}{N}} \end{array}} \right] \quad \end{eqnarray}
(2)
 
Each entry in the initialized P describes the initial probability assigned to the threshold in T. By using a uniform distribution, the visual fields are weighted equally. This assumption is valid when the reference data set is representative of the testing data set. Since each column of T gives the ensemble of possible thresholds at a single location, the prior distribution used in classical one-dimensional QUEST/ZEST can be recovered by constructing the histogram from the threshold values at a single location (i.e., along a single column). However, the problem we wish to tackle here is the determination of all 54 thresholds. Since these values are assumed to be correlated, we do not collapse the data set and instead maintain a full matrix with P being updated in the Bayesian process. 
Stimulus placement
To determine at what intensity and which location the stimulus should be placed next, the mean and variance in each of the 54 columns are calculated, and the location with the highest uncertainty is chosen for testing with stimulus intensity equal to the calculated mean. 
Update rule
A trial at location j with intensity x dB and response as r is denoted as {r, x}j. Watson and Pelli (1983) described the Bayesian update rule for QUEST thresholds (t) at a single tested location (j) as  
\begin{eqnarray}p\left( {t|{{{\left\{ {r,x} \right\}}}_j}} \right) = \frac{{p\left( {{{{\left\{ {r,x} \right\}}}_j}|t} \right)p\left( t \right)}}{{p\left( {{{{\left\{ {r,x} \right\}}}_j}} \right)}} \quad \end{eqnarray}
(3)
 
Next, we generalize the equation for multiple locations. The subscript k is introduced here to denote the location updated, which is not necessarily equal to j, to generalize the equation above and obtain  
\begin{eqnarray}&& p\left( {t = {{T}_{i,k}}{\rm{|}}{{{\left\{ {r,x} \right\}}}_j}} \right) = \frac{{p\left( {{{{\left\{ {r,x} \right\}}}_j}|t = {{T}_{i,k}}} \right)p\left( {t = {{T}_{i,k}}} \right)}}{{p\left( {{{{\left\{ {r,x} \right\}}}_j}} \right)}} \quad \end{eqnarray}
(4)
 
The remaining task is to specify an appropriate likelihood function, p({r, x}j|Ti,k), that reflects the information gained from each trial. 
The likelihood term is a conditional probability. It can be derived from the probability of seeing function, denoted as Ψ, and the empirical data in T. Let Ψ(xt) = p({r = 1, x}|t) describe the probability of seeing a stimulus of log intensity x given that the true threshold is t. From here, the update rule is defined to be  
\begin{eqnarray} && {\rm{Update:}}\,\, {{P}_{i,k}} = p\left( {{{T}_{i,k}}{\rm{|}}{{{\left\{ {r,x} \right\}}}_j}} \right) \nonumber \\ && \quad \propto \left\{ {\begin{array}{@{}r@{\quad}l@{}} {{{{\rm{\Psi }}}_{{{\epsilon }_1}}}\left( {x - {{T}_{i,j}}} \right)p\left( {{{T}_{i,k}}} \right),} & {{\rm{if}}\ r = 1 \wedge k = j}\\ {\left( {1 - {{{\rm{\Psi }}}_{{{\epsilon }_1}}}\left( {x - {{T}_{i,j}}} \right)} \right)p\left( {{{T}_{i,k}}} \right),} & {{\rm{if}}\ r = 0 \wedge k = j}\\ {{{{\rm{\Psi }}}_{{{\epsilon }_2}}}\left( {x - {{T}_{i,j}}} \right)p\left( {{{T}_{i,k}}} \right),} & {{\rm{if}}\ r = 1 \wedge k \ne j}\\ {\left( {1 - {{{\rm{\Psi }}}_{{{\epsilon }_2}}}\left( {x - {{T}_{i,j}}} \right)} \right)p\left( {{{T}_{i,k}}} \right),} & {{\rm{if}}\ r = 0 \wedge k \ne j} \end{array}} \right. \nonumber \\ &&\qquad \forall \left( {i,k} \right) \in \left[ {1..N} \right] \times \left[ {1..54} \right] \end{eqnarray}
(5)
where the psychometric function Ψ is assumed to have a baseline false-positive/false-negative rate (ε1) when updating the location of testing (k = j). A different (and higher) false-positive/false-negative rate (ε2 > ε1) is used when updating other locations (kj), as illustrated in Figure 2. The higher ε2 (false-positive/false-negative rate) indicates a decreased confidence in updating the threshold probabilities in the nontested locations. The key aspect of TORONTO is the use of a soft psychometric function that accounts for the imperfections in the 54-dimensional joint probability function as inferred from the reference data set and is similar to the upper asymptote used in QUEST (Watson & Pelli, 1983). After the update, the values in the probability matrix P are normalized across each column to sum to 1. 
Figure 2.
 
Forms of psychometric functions used in Equation 5. When ε = 50%, the function becomes a flat line.
Figure 2.
 
Forms of psychometric functions used in Equation 5. When ε = 50%, the function becomes a flat line.
Termination rule and final estimate
The algorithm terminates when the uncertainties as quantified by the standard deviations of all columns of T weighted by P are less than the prespecified value of σterm, that is,  
\begin{eqnarray}\sqrt {{\rm{Va}}{{{\rm{r}}}_{{{P}_j}}}\left[ {{{T}_j}} \right]} < {{\sigma }_{{\rm{term}}}}\ \forall \ j \in \left[ {1,54} \right]\quad \end{eqnarray}
(6)
where  
\begin{eqnarray} {\rm{Va}}{{{\rm{r}}}_{{{P}_j}}}\left[ {{{T}_j}} \right] = \mathop \sum \limits_{i = 1}^N \left( {{{P}_{i,j}}T_{i,j}^2} \right) - {{\left( {\mathop \sum \limits_{i = 1}^N {{P}_{i,j}}{{T}_{i,j}}} \right)}^2}\quad \end{eqnarray}
(7)
is the variance of the jth column of T under the distribution described by the corresponding column in P. This is identical to computing the variance in the probability mass function of a single location in classic QUEST/ZEST. The final estimate of each threshold is returned as the mean of T weighted by P, that is,  
\begin{eqnarray}{{{\boldsymbol{\hat{t}}}}_j} = \mathop \sum \limits_{i = 1}^N {{P}_{i,j}}{{T}_{i,j}}\quad \end{eqnarray}
(8)
 
To guarantee convergence, we maintain an additional ZEST algorithm (without seeding), which keeps track of the TORONTO trials running in parallel and independently for each location. In the case where the ZEST algorithm terminates first, the estimate from ZEST is the one that is returned as the final value. The reason for running ZEST concurrently is discussed later. 
Implementation of TORONTO
For TORONTO, Ψ is selected as the complement of a cumulative Gaussian function with 1.0 dB standard deviation, which is the same as in earlier work (Rubinstein et al., 2016). ε1 is chosen to be 5% (following Turpin et al., 2003), and ε2 is set to 30%. The effect of varying ε2 is also shown and discussed later. The shapes of the psychometric functions are shown in Figure 2
Examples of how TORONTO operates with actual data are shown in Appendices A and B
Implementation of simulation
To evaluate the performance of TORONTO, we compare it against other popular strategies. The first is the method of ZEST with quadrant seeding. Testing was initiated from the central points of the four quadrants, and these estimates were subsequently utilized to generate offsets for the priors of the remaining locations within each quadrant (Turpin et al., 2003). Second, SORS was implemented using a linear regression model with batched testing of four locations at a time. The first 36 locations were tested using the ZEST method while the remaining 18 locations were reconstructed using the trained model (Hoehn et al., 2019; Kucur & Sznitman, 2017). Third, the SWeLZ algorithm was implemented using the “correlation model” and “all interconnected” spatial graphs trained using the Rotterdam data set (Rubinstein et al., 2016). For all Bayesian methods, σterm was conventionally set as 1.5 dB (Rubinstein et al., 2016; Turpin et al., 2003), but we also explored other values of 2.0, 2.5, and 3.0 dB to generate different speed–accuracy trade-offs. Lastly, the classic 4-2 double staircase method (similar to full threshold) was implemented with 4 dB steps up to the first reversal followed by 2 dB steps until the termination. No retesting was implemented, but quadrant seeding was used (Turpin et al., 2003). 
The 278 eyes’ 24-2 visual fields in the Rotterdam longitudinal glaucomatous visual field data set were used in the simulations with 10-fold cross validation (Bryan et al., 2013). To evaluate each of the methods fairly and systematically (i.e., all methods take into account the training data), the priors were chosen from the empirically derived histograms generated from each training fold of data, and the staircase algorithm was started at the normal hill of vision defined by the mode of the initial histogram. 
In all cases, the simulated responder follows the equation described in the SWeLZ simulation with the psychometric function width calculated as a function of the true sensitivity (Rubinstein et al., 2016). To evaluate the accuracy of the visual field test estimate, the point-wise root mean square error (RMSE), as given by  
\begin{eqnarray} {\hbox{Point-wise RMSE}} = \sqrt {\frac{1}{{54}}\mathop \sum \limits_{i = 1}^{54} {{{\left( {{{{{\boldsymbol{\hat{t}}}}}_i} - {{{\boldsymbol{t}}}_i}} \right)}}^2}} \quad \end{eqnarray}
(9)
was calculated between the estimated visual field \({\boldsymbol{\hat{t}}}\) and the true input visual field t
Results
Next we present the results comparing the accuracy and test duration of TORONTO against other algorithms. 
Figure 3 shows a comparison between TORONTO, 4-2 staircase, ZEST (with quadrant seeding), SORS, and SWeLZ (with “correlation model” and “all interconnected” spatial graphs). Results are shown for three levels of simulated reliability at 3%, 15%, and 30% false-positive (FP) and false-negative (FN) error (rows) and separated by severity of visual field defect (columns). With the exception of the staircase algorithm, the termination criterion σterm was varied between 1.5, 2.0, 2.5, and 3.0 dB to generate four operating points with different test duration versus error trade-offs. The results show that TORONTO is much faster and has lower errors in visual field estimates than all other existing algorithms. Notably, TORONTO returns much more accurate results even in very unreliable FP = FN = 30% conditions when compared to the other strategies (e.g., 2.8 dB of point-wise RMSE for mild/healthy visual fields while other algorithms have ≥ 4.7 dB of RMSE). 
Figure 3.
 
Error versus test duration plot of TORONTO compared to 4-2 staircase, ZEST, and SORS testing 36 locations in 10-fold cross validation of the Rotterdam data set. Each Bayesian strategy was simulated with four termination levels: 1.5 (slowest but lowest error), 2.0, 2.5, and 3.0 (fastest but highest error) dB. Data points show the median point-wise RMSE and number of trials. Data points closer to the origin (bottom left) of the axes are better (faster and more accurate). Three reliability conditions are simulated: (top) FP = FN = 3%, (middle) FP = FN = 15%, and (bottom) FP = FN = 30%. The left column shows mild eyes, the middle column shows moderate eyes, and the right column shows severe eyes.
Figure 3.
 
Error versus test duration plot of TORONTO compared to 4-2 staircase, ZEST, and SORS testing 36 locations in 10-fold cross validation of the Rotterdam data set. Each Bayesian strategy was simulated with four termination levels: 1.5 (slowest but lowest error), 2.0, 2.5, and 3.0 (fastest but highest error) dB. Data points show the median point-wise RMSE and number of trials. Data points closer to the origin (bottom left) of the axes are better (faster and more accurate). Three reliability conditions are simulated: (top) FP = FN = 3%, (middle) FP = FN = 15%, and (bottom) FP = FN = 30%. The left column shows mild eyes, the middle column shows moderate eyes, and the right column shows severe eyes.
In terms of disease severity, for mild to healthy visual fields, TORONTO shows faster termination for visual fields with mean deviation (MD) > −6 dB. Under reliable conditions, ZEST takes about three times longer to reach the same RMSE, and under unreliable conditions, ZEST shows greater RMSE. TORONTO also outperforms SWeLZ, the second-best algorithm, by terminating nearly twice as fast while maintaining a higher accuracy. In eyes with moderate to severe disease (MD < −6 dB), TORONTO is consistently faster, albeit with a smaller improvement than in the reliable case. These results demonstrate that TORONTO is better able to exploit spatial patterns in defective visual fields, resulting in more efficient and accurate estimates. 
The performance of TORONTO versus ZEST is further compared on an individual-eye basis. Figure 4 shows the paired differences in the point-wise visual field estimates’ RMSE and the paired differences between the number of trials when testing the same eye using both algorithms. Data points in the bottom left quadrant indicate that TORONTO is faster and has lower errors than ZEST. Under the reliable condition (FP = FN = 3%), TORONTO generally achieves the same error as ZEST (−0.2, +0.1, +0.2 dB differences on average for point-wise RMSE in mild, moderate, and severe eyes) but is much faster (−190, −133, −85 trials on average). In the FP = FN = 15% and 30% cases, TORONTO performs significantly better than ZEST in terms of error and duration, with almost all data points in the bottom left quadrant. 
Figure 4.
 
Scatterplots of paired (per-eye) difference between the performance of TORONTO with σterm = 1.5 dB and ZEST with σterm = 1.5 dB in 10-fold cross-validation of the Rotterdam data set. Data points closer to the bottom left indicate TORONTO has a lower error and lower test duration. For almost all conditions and eyes, TORONTO is faster and at least as accurate as ZEST.
Figure 4.
 
Scatterplots of paired (per-eye) difference between the performance of TORONTO with σterm = 1.5 dB and ZEST with σterm = 1.5 dB in 10-fold cross-validation of the Rotterdam data set. Data points closer to the bottom left indicate TORONTO has a lower error and lower test duration. For almost all conditions and eyes, TORONTO is faster and at least as accurate as ZEST.
Relative to ZEST, TORONTO introduces one additional parameter ε2, which governs the degree of confidence when updating the probability functions from trials conducted at other locations. This value was set to be equal to 30% in Figures 3 and 4. Next, we explored the impact of varying this parameter in Figure 5. When ε2 = 50%, this gives the special case of one-dimensional ZEST (without seeding), whereby updating other locations has no effect on any of the other locations. At the other end of ε2 = ε1 = 5%, this assumes that the reference database completely describes the relationship between one location and all other locations. This can be likened to a binary decision tree that expedites the termination of the algorithm but sacrifices accuracy when the tested visual field does not match the values in the database. Figure 5 shows that a value ranging between 20% to 40% yields the best results with the Rotterdam data set. 
Figure 5.
 
Error versus test duration plot of TORONTO (σterm = 1.5 dB) with different values of ε2, the error rate assumed in the psychometric function for updating other nontested locations in the field. The data points shown in each plot correspond, from left to right, to ε2 = 5%, 10%, 20%, 30% (default setting used above), 40%, and 50%. A lower ε2 value represents a more aggressive algorithm (faster but less accurate), relying more on the reference data set, and is more similar to a binary decision tree. A higher ε2 represents a less aggressive algorithm with smaller sensitivity to spatial patterns in the reference data set and is more similar to independent ZEST testing at each location. The algorithm performs best with ε2 between 20% and 40%.
Figure 5.
 
Error versus test duration plot of TORONTO (σterm = 1.5 dB) with different values of ε2, the error rate assumed in the psychometric function for updating other nontested locations in the field. The data points shown in each plot correspond, from left to right, to ε2 = 5%, 10%, 20%, 30% (default setting used above), 40%, and 50%. A lower ε2 value represents a more aggressive algorithm (faster but less accurate), relying more on the reference data set, and is more similar to a binary decision tree. A higher ε2 represents a less aggressive algorithm with smaller sensitivity to spatial patterns in the reference data set and is more similar to independent ZEST testing at each location. The algorithm performs best with ε2 between 20% and 40%.
Discussion
In this article, we propose an entirely new trial-based paradigm for multiple-threshold estimation that we refer to as TORONTO. This data-driven method is based on the use of correlation in a reference data set to perform a multidimensional adaptive Bayesian updating process to determine the thresholds. Results with 24-2 visual field simulations demonstrate that the TORONTO algorithm outperforms existing methods in terms of speed and accuracy across all conditions. Specifically, in eyes with mild defects, the TORONTO algorithm is more than twice as fast as ZEST. Such patients are often the largest population in the clinical and screening settings, so faster testing can result in significant time savings. Furthermore, TORONTO performs well even under extremely noisy conditions (FP = 30% and FN = 30%) where other existing methods fail. By using point-wise correlations within the visual field, we also found that TORONTO ensures that the thresholds and MD score remain unbiased under all conditions, unlike the ZEST and staircase methods, which tend to regress toward their prior assumptions under noisy conditions. 
Table 1 presents a comprehensive summary comparing the performance of common visual field algorithms. The staircase method with fixed step size (commonly referred to as full threshold) was historically the first adaptive threshold estimation procedure. Bayesian methods such as QUEST and ZEST later enabled real-time calculation of probability functions and provided optimal estimates for single thresholds. Meanwhile, thresholds were tested under the assumption of statistical independence, and therefore any paradigm requiring multiple thresholds at multiple locations would be carried out individually and independently. However, spatial correlations suggest that efficiency could be enhanced by making use of this information. Previous efforts have been directed at incorporating spatial patterns through methods like threshold-oriented seeding (Turpin et al., 2003), reconstruction (Kucur & Sznitman, 2017; Kucur et al., 2019), or spatial graphs (Rubinstein et al., 2016) derived from heuristics or statistical correlations between locations. Each method had varying degrees of success. 
Table 1.
 
Brief overview of the history of common visual field testing algorithms. Full Threshold, SITA Standard, and SITA Fast (italicized) are proprietary algorithms used by the Humphrey Field Analyzer. Their performances are estimated based on available information. Other algorithms were simulated in this article, with the median of results shown in parentheses. “Reliable” refers to FP = FN = 3% and “unreliable” refers to FP = FN = 15%.
Table 1.
 
Brief overview of the history of common visual field testing algorithms. Full Threshold, SITA Standard, and SITA Fast (italicized) are proprietary algorithms used by the Humphrey Field Analyzer. Their performances are estimated based on available information. Other algorithms were simulated in this article, with the median of results shown in parentheses. “Reliable” refers to FP = FN = 3% and “unreliable” refers to FP = FN = 15%.
In contrast, TORONTO adopts a more nuanced, nonparametric, trial-oriented approach by bypassing the intermediary step of deducing an entire threshold. Instead, it extrapolates the one-dimensional adaptive Bayesian process to a higher dimension, updating all locations with a single trial, and transcending the confines of just the tested site. A similar concept was explored in SWeLZ, which employed a spatial graph based on correlations among visual field locations or anatomical patterns. SweLZ converted these patterns into a graph with spatial weights, ranging between 0 and 1. Locations with higher associated weights undergo updating with a steeper likelihood function. Unlike SweLZ, TORONTO employs a nonparametric approach and derives the likelihood function from the empirical data, thus directly capturing the nuances within the visual field reference data without fitting the data to a model. 
It is worthwhile to note that the likelihood functions in TORONTO are determined by both Ψ and the data in the reference data set. This means that despite using the same \({{{{\bf \Psi }}}_{{{\epsilon }_2}}}\) to update all visual field locations other than the currently tested location across the field, the updated likelihood functions are dependent on the degree of correlation between data in the reference data set. When the reference data at the update location exhibit a high correlation with the data at the test location, the likelihood function (e.g., approximated by the orange scatter points in Figures 8 and 9 in the Appendices A and B) assumes a steeper slope in a narrow transition zone (see Figures 8 and 9). Conversely, when the reference data in the update location are less correlated with the data at the test location, the likelihood function assumes a wider transition region. It is this data-driven, nonparametric approach that makes TORONTO effective and actually simpler to implement because it does not require any explicit modeling of the correlation patterns in the visual field. 
TORONTO can also be viewed as a decision tree algorithm. When the psychometric functions \({{{\rm{\Psi }}}_{{{\epsilon }_1}}}\) and \({{{\rm{\Psi }}}_{{{\epsilon }_2}}}\) are both formulated as step functions with hard decision boundaries (i.e., takes on values of either 0 or 1), TORONTO becomes a decision tree classifier. The stimulus placement rule mirrors the split in each tree, with the split occurring at the stimulus location using the threshold as the splitting criterion. Initially, the tree’s training data set assigns equal probability to each entry (Breiman, Friedman, Olshen, & Stone, 2017). After each split or stimulus, the possible outcomes are pruned, leaving only the entries consistent with the split. This is akin to multiplying the probability by the likelihood function, which in this case is either 0 or 1. Through sequential binary splits and stimuli, the tree determines which data set entry is most consistent with the visual field test data. In the case where a real visual field test is conducted, adjustments are needed to reflect the uncertainty in the process. The most crucial part is to account for errors of branching by using the soft psychometric functions shown in Figure 2. In this case, however, the tree no longer reaches a single-leaf node with definitive probability of 0 or 1 but attains infinite depth. A termination rule becomes essential (equivalent to limiting the tree depth in a decision tree regressor), and the split/stimulus placements are calculated on demand rather than pretrained. Fortunately, existing psychometric procedures can be directly applied to address these two issues (Breiman et al., 2017). 
One potential concern with the TORONTO approach is that the visual field of the tested subject may not resemble any of the existing entries within the reference database. This may cause the algorithm to not converge and/or return erroneous results. The TORONTO algorithm addresses this problem by two methods. First, recognizing that the likelihood value obtained from empirical conditional probability from the reference database can be inaccurate, TORONTO allows for modifications to be made in the psychometric function to account for false positives/false negatives (see Figure 5). Second, in cases where the tested eye deviates from the entries present in the reference data, an independent ZEST algorithm (that does not use spatial correlations) is run in tandem with TORONTO. Thus, one-dimensional ZEST serves as a backup and permits a graceful fallback whenever there are difficulties with termination or slower convergence. Both the selection of ε2 as well as parallel ZEST guarantees that TORONTO will continue to work even with databases of limited sizes. 
Calculating the time of testing per eye requires tabulating the total number of trials needed to fully estimate all 54 thresholds for 24-2 visual field testing. To minimize the total number of trials required, this traditionally involves considering the trade-off between two terms: (1) “average number of trials per location” and (2) the number of locations. When the second term (number of locations) is fixed at 54, the only possible improvement is to reduce the first term (i.e., the expected number of trials per threshold). Therefore, the goal has traditionally been to develop more efficient single-threshold algorithms, such as the improvement from fixed-step-size staircase to adaptive-step-size Bayesian algorithms, and development of SITA Fast over SITA Standard (Bengtsson & Heijl, 1998; Bengtsson, et al., 1998). In recent years, threshold-oriented reconstruction algorithms (e.g., SORS) (Kucur, Márquez-Neila, Abegg, & Sznitman, 2019; Kucur & Sznitman, 2017) put forth a new idea that only a portion of the 24-2 visual field needs to be estimated while the rest can be reconstructed using a machine learning model. This approach reduces the second term in the equation (i.e., the total number of thresholds). However, this approach is not without its own concerns. SORS tests only a subset of the locations while reconstructing the values of the remainder of the locations, practically reducing the full 24-2 pattern into a subset. The testing sequence is predetermined from a linear regression model trained from an existing database and will not change according to the eye being tested. These are limitations of the SORS method. 
While TORONTO can be thought of as building upon the idea of reconstruction, it avoids these limitations of SORS by not using any predetermined subset or testing sequence. Using a trial-oriented approach, TORONTO is entirely data-driven, and with each trial, the entire visual field is updated. This approach also makes TORONTO extremely flexible, as it is not limited to the 24-2 visual field pattern and can be applied to any arbitrary visual field pattern or other type of psychometric test that has an existing database of thresholds. For future work, it is our wish to validate this algorithm in a patient population, with a wide range of glaucomatous and nonglaucomatous visual field defect patterns, as well as in other psychometric applications requiring assessment of multiple thresholds of the same or different modalities. 
Acknowledgments
Supported by the Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) 458039, Natural Sciences and Engineering Research Council of Canada (NSERC) CGS-D Scholarship, and Vision Science Research Program (VSRP) Scholarship. The authors of this manuscript together with the University of Toronto have submitted a patent application for TORONTO. 
Commercial relationships: none. 
Corresponding author: Willy Wong. 
Email: willy.wong@utoronto.ca. 
Address: Dept. of Electrical and Computer Engineering, University of Toronto, 10 King's College Rd., Toronto, ON M5S3G4, Canada. 
References
Bak, J. H., & Pillow, J. W. (2018). Adaptive stimulus selection for multi-alternative psychometric functions with lapses. Journal of Vision, 18(12), 1–25, https://doi.org/10.1167/18.12.4.
Bengtsson, B., & Heijl, A. (1998). SITA Fast, a new rapid perimetric threshold test. Description of methods and evaluation in patients with manifest and suspect glaucoma. Acta Ophthalmologica Scandinavica, 76(4), 431–437, https://doi.org/10.1034/j.1600-0420.1998.760408.x. [PubMed]
Bengtsson, B., & Heijl, A. (1998). SITA Fast, a new rapid perimetric threshold test. Description of methods and evaluation in patients with manifest and suspect glaucoma. Acta Ophthalmologica Scandinavica, 76(4), 431–437, https://doi.org/10.1034/j.1600-0420.1998.760408.x. [PubMed]
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. New York: Routledge, https://doi.org/10.1201/9781315139470.
Bryan, S. R., Vermeer, K. A., Eilers, P. H. C., Lemij, H. G., & Lesaffre, E. M. E. H. (2013). Robust and censored modeling and prediction of progression in glaucomatous visual fields. Investigative Opthalmology & Visual Science, 54(10), 6694, https://doi.org/10.1167/iovs.12-11185.
Chesley, B., & Barbour, D. L. (2020). Visual field estimation by probabilistic classification. IEEE Journal of Biomedical and Health Informatics, 24(12), 3499–3506, https://doi.org/10.1109/JBHI.2020.2999567. [PubMed]
Demeester, K., van Wieringen, A., Hendrickx, J., Topsakal, V., Fransen, E., van Laer, L., ... Van de Heyning P. (2009). Audiometric shape and presbycusis. International Journal of Audiology, 48(4), 222–232, https://doi.org/10.1080/14992020802441799. [PubMed]
Garway-Heath, D. F., Poinoosawmy, D., Fitzke, F. W., & Hitchings, R. A. (2000). Mapping the visual field to the optic disc in normal tension glaucoma eyes. Ophthalmology, 107(10), 1809–1815, https://doi.org/10.1016/s0161-6420(00)00284-0. [PubMed]
Hoehn, R., Häckel, S., Kucur, S., Iliev, M. E., Abegg, M., & Sznitman, R. (2019). Evaluation of sequentially optimized reconstruction strategy in visual field testing in normal subjects and glaucoma patients. Investigative Ophthalmology & Visual Science, 60(9), 2477.
King-Smith, P. E., Grigsby, S. S., Vingrys, A. J., Benes, S. C., & Supowit, A. (1994). Efficient and unbiased modifications of the QUEST threshold method: Theory, simulations, experimental evaluation and practical implementation. Vision Research, 34(7), 885–912, https://doi.org/10.1016/0042-6989(94)90039-6. [PubMed]
Kontsevich, L. L., & Tyler, C. W. (1999). Bayesian adaptive estimation of psychometric slope and threshold. Vision Research, 39(16), 2729–2737, https://doi.org/10.1016/S0042-6989(98)00285-5.
Kucur, Ş. S., Márquez-Neila, P., Abegg, M., & Sznitman, R. (2019). Patient-attentive sequential strategy for perimetry-based visual field acquisition. Medical Image Analysis, 54, 179–192, https://doi.org/10.1016/j.media.2019.03.002. [PubMed]
Kucur, Ş. S., & Sznitman, R. (2017). Sequentially optimized reconstruction strategy: A meta-strategy for perimetry testing. PLoS ONE, 12(10), e0185049, https://doi.org/10.1371/journal.pone.0185049. [PubMed]
Lesmes, L. A., Lu, Z.-L., Baek, J., Tran, N., Dosher, B. A., & Albright, T. D. (2015). Developing Bayesian adaptive methods for estimating sensitivity thresholds (d′) in Yes-No and forced-choice tasks. Frontiers in Psychology, 6, 1070, https://doi.org/10.3389/fpsyg.2015.01070. [PubMed]
Noble, W. S. (2006). What is a support vector machine? Nature Biotechnology, 24(12), 1565–1567, https://doi.org/10.1038/nbt1206-1565. [PubMed]
Rubinstein, N. J., McKendrick, A. M., & Turpin, A. (2016). Incorporating spatial models in visual field test procedures. Translational Vision Science & Technology, 5(2), 7, https://doi.org/10.1167/tvst.5.2.7. [PubMed]
Song, X. D., Wallace, B. M., Gardner, J. R., Ledbetter, N. M., Weinberger, K. Q., & Barbour, D. L. (2015). Fast, continuous audiogram estimation using machine learning. Ear & Hearing, 36(6), e326–e335, https://doi.org/10.1097/AUD.0000000000000186.
Turpin, A., McKendrick, A. M., Johnson, C. A., & Vingrys, A. J. (2003). Properties of perimetric threshold estimates from full threshold, ZEST, and SITA-like strategies, as determined by computer simulation. Investigative Opthalmology & Visual Science, 44(11), 4787, https://doi.org/10.1167/iovs.03-0023.
Watson, A. B. (2017). QUEST+: A general multidimensional Bayesian adaptive psychometric method. Journal of Vision, 17(3), 10, https://doi.org/10.1167/17.3.10. [PubMed]
Watson, A. B., & Pelli, D. G. (1983). Quest: A Bayesian adaptive psychometric method. Perception & Psychophysics, 33(2), 113–120, https://doi.org/10.3758/BF03202828. [PubMed]
Xu, P., Lesmes, L. A., Yu, D., & Lu, Z. L. (2019). A novel Bayesian adaptive method for mapping the visual field. Journal of Vision, 19(14), 1–32, https://doi.org/10.1167/19.14.16. [PubMed]
Appendix A: An example of testing at three correlated locations with reliable responses
We present a pared-down example illustrating the operation of TORONTO as restricted to three locations in the nasal near-peripheral portion of the 24-2 visual field pattern: 18 (−27°, +3°), 19 (−21°, +3°), and 20 (−15°, +3°). A tolerance of σterm = 2.5 dB was chosen for rapid termination. TORONTO requires a reference data set T, which is illustrated in Figure 6. To make visualization and comprehension easier, we have sorted the values in T, although this is not done usually with TORONTO. The lighter values of the matrix indicate better sensitivity. With some differences, we observe that locations 18, 19, and 20 share common threshold values. That is, when 18 has high sensitivity, locations 19 and 20 also show higher sensitivity and vice versa. Figure 7 shows the bivariate probability mass between locations 18 and 19 and between 18 and 20. Again, these plots indicate that the threshold sensitivity values are correlated. 
Figure 6.
 
Visualization of T, the reference data set matrix, which consists of three locations (18, 19, and 20 of the 24-2 pattern) as described in the main text. Lighter values indicate better sensitivity. While in general the reference data set can be left unsorted, in this case, to aid in visualization and interpretation of P, reference examples with better sensitivity are generally sorted toward the top and ones with worse sensitivity are generally sorted toward the bottom, resulting in the pattern shown.
Figure 6.
 
Visualization of T, the reference data set matrix, which consists of three locations (18, 19, and 20 of the 24-2 pattern) as described in the main text. Lighter values indicate better sensitivity. While in general the reference data set can be left unsorted, in this case, to aid in visualization and interpretation of P, reference examples with better sensitivity are generally sorted toward the top and ones with worse sensitivity are generally sorted toward the bottom, resulting in the pattern shown.
Figure 7.
 
(A) Plot of joint probability density between locations 18 and 19. (B) Plot of joint probability density between locations 18 and 20. Darker color represents higher joint density. Both density plots indicate that these locations are correlated. The marginal histograms of the thresholds at each location are plotted at the top and to the right of each plot.
Figure 7.
 
(A) Plot of joint probability density between locations 18 and 19. (B) Plot of joint probability density between locations 18 and 20. Darker color represents higher joint density. Both density plots indicate that these locations are correlated. The marginal histograms of the thresholds at each location are plotted at the top and to the right of each plot.
The true thresholds for locations 18, 19, and 20 were set to 23, 25, and 27 dB, respectively, with no false-positive or false-negative responses. An example with false responses is shown in Appendix B. The output for these three locations from the TORONTO algorithm after five trials was 24.1, 26.1, and 28.0 dB. A point-wise ZEST routine with the same termination criteria took eight trials (i.e., 60% longer testing duration) and produced a similar accuracy of 23.2, 25.8, and 28.1 dB. 
TORONTO iterates the Bayesian adaptive procedure on P, which contains the probabilities assigned to the threshold values in TFigure 8 shows the evolution of P tracked by the TORONTO algorithm (left) together with the posterior probability mass functions (PMFs) after each trial (right). The PMF is calculated from a weighted histogram of values in T. The overlaying orange dots on the PMFs show the likelihood functions used to update P as a result of the trial and provides a visualization of TORONTO’s update procedure, akin to the classic ZEST Bayesian update procedure found in Figure 1 of King-Smith et al. (1994). However, in this case, the two nontested locations are also updated as well. The likelihood functions for the two nontested locations have the same general shape as the likelihood function at the tested location, but because there is more uncertainty, there is a scatter of the orange dots near the threshold as well as a shallower slope. 
Figure 8.
 
Step-by-step visualization of the TORONTO algorithm. Left column is the probability matrix (P) associated with the thresholds contained in reference data set matrix (T; see Figure 6). Each row in P describes the probability weights of the corresponding row in T (e.g., the first row in P describes the weights associated with the first row in T) after each trial, with a total of 278 rows in T in this example. Right column visualizes the threshold distribution in histogram form (blue), as well as the likelihood function used for updating (orange).
Figure 8.
 
Step-by-step visualization of the TORONTO algorithm. Left column is the probability matrix (P) associated with the thresholds contained in reference data set matrix (T; see Figure 6). Each row in P describes the probability weights of the corresponding row in T (e.g., the first row in P describes the weights associated with the first row in T) after each trial, with a total of 278 rows in T in this example. Right column visualizes the threshold distribution in histogram form (blue), as well as the likelihood function used for updating (orange).
In this particular simulation, the very first stimulus presented was at location 20. Following the response, all three locations were updated using the likelihood function (orange dots). Notably, testing at location 20 also enhanced the threshold estimate at locations 18 and 19. Iterating through the algorithm step by step, the algorithm sampled all three locations and refined the overall estimate of all three locations, as is evident by the increasing contrast in the three columns of P. As the test progresses, the weights assigned to the top portion of P (which corresponds to the top portion of T) increase, and the PMFs converge toward the true thresholds. 
Increasing the number of correlated locations in TORONTO results in even faster convergence and more accurate estimates. When two additional neighboring 24-2 locations are added (10: (−21°, +6°) and 11: (−15°, +6°)) and with ground truth set to 25, 27, 23, 25, and 27 dB for the five locations, TORONTO took seven trials to estimate the thresholds to be 25.5, 26.8, 24.2, 25.6, and 28.0 dB. In comparison, an equivalent ZEST procedure took 12 trials (71% longer) to yield estimates of 25.1, 26.8, 23.2, 25.8, and 28.1 dB. 
Appendix B: An example of testing three correlated locations with false-negative responses
TORONTO demonstrates robustness against errors by using the intrinsic correlation within the reference data. When there is an erroneous response at a particular location, TORONTO is less susceptible to its impact than ZEST due to its capacity to cross-reference information with correct responses from other locations. 
This advantage can be demonstrated by repeating the same three-location experiment in Appendix A at locations 18, 19, and 20 but introducing two false negatives at location 18 (not seeing stimulus at 12 dB and 22 dB). Figure 9 illustrates this scenario. 
Figure 9.
 
Step-by-step visualization of the TORONTO algorithm continuing from the example of Appendix A but with two false-negative trials at location 18. Left: Probability matrix (P) for thresholds in the reference data set (T). Right: Histogram of thresholds with likelihood function (orange) for updating P. See main text for more details.
Figure 9.
 
Step-by-step visualization of the TORONTO algorithm continuing from the example of Appendix A but with two false-negative trials at location 18. Left: Probability matrix (P) for thresholds in the reference data set (T). Right: Histogram of thresholds with likelihood function (orange) for updating P. See main text for more details.
TORONTO converges to the values of 20.8, 25.6, and 27.5 dB after nine trials (true thresholds: 23, 25, 27 dB). Compared to the reliable condition in Appendix A (five trials), the increase to nine trials results in more data collected as well as more refined estimates. Even when location 18 is not directly tested at a specific trial (e.g., trial 9), the algorithm uses the correlation established in reference data set matrix T to refine its estimate for 18. The robustness of TORONTO mitigates the influence of false negatives on the lower tail of the probability distribution to yield an estimate closer to the true threshold. ZEST, under the same condition with two false negatives at location 18 at 12 dB and 22 dB, requires one additional trial (10 trials in total) to achieve comparable accuracy (20.5, 25.8, 28.1 dB). Without this additional trial, ZEST's estimate for location 18 falls to 17.9 dB. 
Figure 1.
 
An example of a simulated visual field test using TORONTO's trial-oriented reconstruction. The numerical values, shown in units of decibels (dB), represent the differential light sensitivity threshold at each of the 24-2 visual field locations. Following visual field convention, the higher numerical values represent better sensitivity. (A) The input, ground-truth visual field used as the simulated patient. (BD) The information sampled from the first 10, 20, and 30 trials, represented as inequality signs or ranges. For example, “<17” denotes that there has been a stimulus at this location with intensity of 17 dB with a response of NO (did not see); 12–18 denotes that there is a 12 dB stimulus with response YES and an 18 dB stimulus with response NO. (EG) The estimated/predicted thresholds after the first 10, 20, and 30 trials. The evolving pattern provides a “bird's-eye view” of the landscape of the visual field. As more and more trials are added, local details in the field are refined through the updates in the visual field. After just 30 trials, the estimate is already quite close to the defect pattern in the actual visual field shown in Figure 1 (point-wise RMSE = 3.9 dB). Full testing continues until the termination criteria are reached (not shown). While this illustrative example shows the subject making consistent responses, the algorithm makes no such assumptions. Please see main text for more details.
Figure 1.
 
An example of a simulated visual field test using TORONTO's trial-oriented reconstruction. The numerical values, shown in units of decibels (dB), represent the differential light sensitivity threshold at each of the 24-2 visual field locations. Following visual field convention, the higher numerical values represent better sensitivity. (A) The input, ground-truth visual field used as the simulated patient. (BD) The information sampled from the first 10, 20, and 30 trials, represented as inequality signs or ranges. For example, “<17” denotes that there has been a stimulus at this location with intensity of 17 dB with a response of NO (did not see); 12–18 denotes that there is a 12 dB stimulus with response YES and an 18 dB stimulus with response NO. (EG) The estimated/predicted thresholds after the first 10, 20, and 30 trials. The evolving pattern provides a “bird's-eye view” of the landscape of the visual field. As more and more trials are added, local details in the field are refined through the updates in the visual field. After just 30 trials, the estimate is already quite close to the defect pattern in the actual visual field shown in Figure 1 (point-wise RMSE = 3.9 dB). Full testing continues until the termination criteria are reached (not shown). While this illustrative example shows the subject making consistent responses, the algorithm makes no such assumptions. Please see main text for more details.
Figure 2.
 
Forms of psychometric functions used in Equation 5. When ε = 50%, the function becomes a flat line.
Figure 2.
 
Forms of psychometric functions used in Equation 5. When ε = 50%, the function becomes a flat line.
Figure 3.
 
Error versus test duration plot of TORONTO compared to 4-2 staircase, ZEST, and SORS testing 36 locations in 10-fold cross validation of the Rotterdam data set. Each Bayesian strategy was simulated with four termination levels: 1.5 (slowest but lowest error), 2.0, 2.5, and 3.0 (fastest but highest error) dB. Data points show the median point-wise RMSE and number of trials. Data points closer to the origin (bottom left) of the axes are better (faster and more accurate). Three reliability conditions are simulated: (top) FP = FN = 3%, (middle) FP = FN = 15%, and (bottom) FP = FN = 30%. The left column shows mild eyes, the middle column shows moderate eyes, and the right column shows severe eyes.
Figure 3.
 
Error versus test duration plot of TORONTO compared to 4-2 staircase, ZEST, and SORS testing 36 locations in 10-fold cross validation of the Rotterdam data set. Each Bayesian strategy was simulated with four termination levels: 1.5 (slowest but lowest error), 2.0, 2.5, and 3.0 (fastest but highest error) dB. Data points show the median point-wise RMSE and number of trials. Data points closer to the origin (bottom left) of the axes are better (faster and more accurate). Three reliability conditions are simulated: (top) FP = FN = 3%, (middle) FP = FN = 15%, and (bottom) FP = FN = 30%. The left column shows mild eyes, the middle column shows moderate eyes, and the right column shows severe eyes.
Figure 4.
 
Scatterplots of paired (per-eye) difference between the performance of TORONTO with σterm = 1.5 dB and ZEST with σterm = 1.5 dB in 10-fold cross-validation of the Rotterdam data set. Data points closer to the bottom left indicate TORONTO has a lower error and lower test duration. For almost all conditions and eyes, TORONTO is faster and at least as accurate as ZEST.
Figure 4.
 
Scatterplots of paired (per-eye) difference between the performance of TORONTO with σterm = 1.5 dB and ZEST with σterm = 1.5 dB in 10-fold cross-validation of the Rotterdam data set. Data points closer to the bottom left indicate TORONTO has a lower error and lower test duration. For almost all conditions and eyes, TORONTO is faster and at least as accurate as ZEST.
Figure 5.
 
Error versus test duration plot of TORONTO (σterm = 1.5 dB) with different values of ε2, the error rate assumed in the psychometric function for updating other nontested locations in the field. The data points shown in each plot correspond, from left to right, to ε2 = 5%, 10%, 20%, 30% (default setting used above), 40%, and 50%. A lower ε2 value represents a more aggressive algorithm (faster but less accurate), relying more on the reference data set, and is more similar to a binary decision tree. A higher ε2 represents a less aggressive algorithm with smaller sensitivity to spatial patterns in the reference data set and is more similar to independent ZEST testing at each location. The algorithm performs best with ε2 between 20% and 40%.
Figure 5.
 
Error versus test duration plot of TORONTO (σterm = 1.5 dB) with different values of ε2, the error rate assumed in the psychometric function for updating other nontested locations in the field. The data points shown in each plot correspond, from left to right, to ε2 = 5%, 10%, 20%, 30% (default setting used above), 40%, and 50%. A lower ε2 value represents a more aggressive algorithm (faster but less accurate), relying more on the reference data set, and is more similar to a binary decision tree. A higher ε2 represents a less aggressive algorithm with smaller sensitivity to spatial patterns in the reference data set and is more similar to independent ZEST testing at each location. The algorithm performs best with ε2 between 20% and 40%.
Figure 6.
 
Visualization of T, the reference data set matrix, which consists of three locations (18, 19, and 20 of the 24-2 pattern) as described in the main text. Lighter values indicate better sensitivity. While in general the reference data set can be left unsorted, in this case, to aid in visualization and interpretation of P, reference examples with better sensitivity are generally sorted toward the top and ones with worse sensitivity are generally sorted toward the bottom, resulting in the pattern shown.
Figure 6.
 
Visualization of T, the reference data set matrix, which consists of three locations (18, 19, and 20 of the 24-2 pattern) as described in the main text. Lighter values indicate better sensitivity. While in general the reference data set can be left unsorted, in this case, to aid in visualization and interpretation of P, reference examples with better sensitivity are generally sorted toward the top and ones with worse sensitivity are generally sorted toward the bottom, resulting in the pattern shown.
Figure 7.
 
(A) Plot of joint probability density between locations 18 and 19. (B) Plot of joint probability density between locations 18 and 20. Darker color represents higher joint density. Both density plots indicate that these locations are correlated. The marginal histograms of the thresholds at each location are plotted at the top and to the right of each plot.
Figure 7.
 
(A) Plot of joint probability density between locations 18 and 19. (B) Plot of joint probability density between locations 18 and 20. Darker color represents higher joint density. Both density plots indicate that these locations are correlated. The marginal histograms of the thresholds at each location are plotted at the top and to the right of each plot.
Figure 8.
 
Step-by-step visualization of the TORONTO algorithm. Left column is the probability matrix (P) associated with the thresholds contained in reference data set matrix (T; see Figure 6). Each row in P describes the probability weights of the corresponding row in T (e.g., the first row in P describes the weights associated with the first row in T) after each trial, with a total of 278 rows in T in this example. Right column visualizes the threshold distribution in histogram form (blue), as well as the likelihood function used for updating (orange).
Figure 8.
 
Step-by-step visualization of the TORONTO algorithm. Left column is the probability matrix (P) associated with the thresholds contained in reference data set matrix (T; see Figure 6). Each row in P describes the probability weights of the corresponding row in T (e.g., the first row in P describes the weights associated with the first row in T) after each trial, with a total of 278 rows in T in this example. Right column visualizes the threshold distribution in histogram form (blue), as well as the likelihood function used for updating (orange).
Figure 9.
 
Step-by-step visualization of the TORONTO algorithm continuing from the example of Appendix A but with two false-negative trials at location 18. Left: Probability matrix (P) for thresholds in the reference data set (T). Right: Histogram of thresholds with likelihood function (orange) for updating P. See main text for more details.
Figure 9.
 
Step-by-step visualization of the TORONTO algorithm continuing from the example of Appendix A but with two false-negative trials at location 18. Left: Probability matrix (P) for thresholds in the reference data set (T). Right: Histogram of thresholds with likelihood function (orange) for updating P. See main text for more details.
Table 1.
 
Brief overview of the history of common visual field testing algorithms. Full Threshold, SITA Standard, and SITA Fast (italicized) are proprietary algorithms used by the Humphrey Field Analyzer. Their performances are estimated based on available information. Other algorithms were simulated in this article, with the median of results shown in parentheses. “Reliable” refers to FP = FN = 3% and “unreliable” refers to FP = FN = 15%.
Table 1.
 
Brief overview of the history of common visual field testing algorithms. Full Threshold, SITA Standard, and SITA Fast (italicized) are proprietary algorithms used by the Humphrey Field Analyzer. Their performances are estimated based on available information. Other algorithms were simulated in this article, with the median of results shown in parentheses. “Reliable” refers to FP = FN = 3% and “unreliable” refers to FP = FN = 15%.
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×