Bayes’ rule is applied to compute updated posteriors upon observation of data {
y,
X}. Because of the Bernoulli likelihood, the posteriors cannot be computed in closed form. They are estimated in this case via variational inference (
Hensman, Matthews, & Ghahramani, 2015;
Titsias, 2009). Variational inference finds the best approximation of the true posterior distribution from a family of simpler distributions by minimizing the Kullback-Liebler divergence between the approximate and true posteriors (
Gardner, Pleiss, Weinberger, Bindel, & Wilson, 2018).
We can estimate the posterior distribution of the GP model efficiently, which when trained with all existing data can compute model updates for any new sample
x* ∈
X* defined over spatial frequency and visual contrast. Therefore, the new sample
x* that, upon observation, maximizes some utility function
U(
x*) is optimal under that function:
\begin{eqnarray}A\left( {{{{\bf x}}^{\boldsymbol{*}}}} \right) = {\rm{argma}}{{\rm{x}}_{{{{\bf x}}^*} \in {{{\bf X}}^*}}}U\left( {{{{\bf x}}^{\boldsymbol{*}}}{\rm{|}}{{\bf X}},{{\bf y}}} \right), \quad \end{eqnarray}
where
A( · ) represents the acquisition function and
U( · ) is a utility function reflecting model quality. Previous implementation of the first-generation acquisition function prioritized uncertainty sampling by defining the utility function as the differential entropy calculated using the predictive mean µ and variance σ
2 of the model observed through the likelihood (
Marticorena et al., 2024):
\begin{eqnarray}
&& DE\left( {{{{\bf x}}^{\boldsymbol{*}}}} \right) = H\left( {\Phi \left( {\frac{{\mu \left( {{{{\bf x}}^{\boldsymbol{*}}}} \right)}}{{\sigma \left( {{{{\bf x}}^{\boldsymbol{*}}}} \right)}}} \right)} \right) \nonumber \\
&& - \frac{C}{{\sqrt {{\sigma ^2}\left( {{{{\bf x}}^{\boldsymbol{*}}}} \right)} + {C^2}}}{\rm{exp}}\left( {\frac{{ - {\mu ^2}\left( {{{{\bf x}}^{\boldsymbol{*}}}} \right)}}{{2\sigma \left( {{{{\bf x}}^{\boldsymbol{*}}}} \right) + {C^2}}}} \right), \quad \end{eqnarray}
where
H is the binary entropy function
H(
x) = −
xlog
2(
x) − (1 −
x)log
2(1 −
x), Φ is the CDF of a standard normal, and
C is a normalizing factor
\(\ C\ = \sqrt {\frac{{\pi {\rm{ln}}( 2 )}}{2}} \) , which affords the approximation of the second term in closed form (
Houlsby, Huszár, Ghahramani, & Lengyel, 2011). This acquisition function finds the next best sample point
x* that maximizes the differential entropy, which is a proxy for information gain.
The gradients of differential entropy for the first-generation MLCRF estimator were very similar over contrast and very shallow over spatial frequency. Imposing a hard nonlinearity under these conditions can lead to degenerate conditions in which the updated model changes little, leading to multiple samples in close proximity. We hypothesized that implementing a sampling density penalty would provide a better balance between exploration and exploitation than the original acquisition function allowed.
The second-generation acquisition function, rather than taking the maximum of U(x*|X,y), prioritizes a subset, Xhi(n), which contains the points within the highest n%. Next, it applies nearest neighbors to identify the point x* within Xhi(n) that is furthest from any point in the existing sample set X. The new sampling density penalty in the acquisition function allows for more effective exploration initially, preventing proximity oversampling. As data acquisition continues, the new method enhances exploitation, ensuring more uniform sampling across spatial frequencies.