Learning rather than explicit modeling is the current preferred approach for many vision problems (
Saxena et al., 2005), often using deep neural networks (DNNs) (
Lindsay, 2021). Instead of postulating a specific image formation model, DNNs function in (semi)supervised fashion, using training data given as pairs,
\((x_n, y_n), n = 1,2, ..., N\). The deep network can be viewed as a function
\(f\) that “learns” the relationship
\(y_n \approx f(x_n)\); that is, it learns to interpolate between images and surfaces by, for example, solving a variational problem of the form:
\begin{equation}
\min _f \sum _{n = 1}^{n = N} \mathit {loss}(f(x_n), y_n) + \lambda ||f||^2
\end{equation}
Note that, again, there are two terms, one for data fidelity and another for regularization (
Parhi & Nowak, 2021). For shape-from-shading networks,
\(x_n\) might be an (RGB) input image and
\(y_n\) is the associated depth map (
Eigen et al., 2014) or surface normal map (
Tang et al., 2012;
Wang et al., 2020). Although this follows the traditional model of seeking surface normals at every point, now it is the function
\(f\) that is being solved for. Some loss functions add in Lambertian terms (
Tang et al., 2012) (or other physical rendering models;
Wu et al., 2015). Others argue that these models amount to latent variables and should be learned (
Storrs & Fleming, 2021), introducing constraint through selected training data. Examples include, for example, “faces” (
Sengupta et al., 2018) or “chairs” or “dormitory rooms” (
Kulkarni et al., 2015); review in
Breuß et al., 2021 In any case, the DNN architecture is tuned to interpolate the given data, so that the resulting algorithms can be brittle outside of it. Progress in generalizing from, say,
chairs to
cars is limited (
Sitzmann et al., 2019). While particular solutions engineered for particular applications can be useful (
Parhi & Nowak, 2021), especially when the data set is acquired from natural scenes, it can be difficult to understand why the interpolation works (
Hutson, 2018). One might speculate that “faces” work because there is an underlying invariance—a template (
Blanz & Vetter, 1999)—that guides the interpolation. In any case,the robustness of our visual systems remains elusive.