The speed–accuracy tradeoff is an important feature of human performance that is difficult to explain with current computational models of object recognition. People perform better when given more time but can also respond quickly when needed, with some loss in accuracy. Trained deep convolutional neural networks, like people, can recognize objects from images. These networks, popular in computer vision, also successfully model both object recognition behavior and neural activity in humans and nonhuman primates (
Geirhos et al., 2021;
Yamins et al., 2014) and can predict aspects of biological vision beyond performance, such as adversarial examples (
Guo et al., 2022), and object representation topography (
Doshi & Konkle, 2022). However, these networks still lack a definition of time, which is a critical dimension of object recognition and decision-making by humans. Models of human reaction time generally include fixed delays (e.g., about 70 ms for the retina and several hundred milliseconds to plan and execute the keyboard response, depending on number of alternatives, distance the hand must travel, and required spatial precision. Those delays are independent of visual task difficulty. Assuming simple proportionality would not cope with those fixed delays. Our modeling deals with this by allowing an unconstrained linear mapping (speed and delay, not just speed) between the network measure (timesteps) and the human reaction time in milliseconds. Standard deep networks are all-or-none; they can either respond to a task using all of their parameters or not respond at all, whereas people adapt to various time constraints and fail gracefully. To bridge the gap between neural networks and biological vision, it is therefore essential that neural networks explain this time-dependent behavior. As a step toward that goal, we use dynamic networks, a special class of deep networks capable of flexibly varying their computational resources. They do so by using one of several possible strategies on top of a standard convolutional network backbone. In our analysis, we consider a representative sample of these strategies: early exits, recurrence, and parallel processing. Previous work in neuroscience supports recurrence (
Kar, Kubilius, Schmidt, Issa, & DiCarlo, 2019) and parallel/distributed processing (
Kugele, Pfeil, Pfeiffer, & Chicca, 2020) as viable representations of time in biological networks.
Rafiei, Shekhar, and Rahnev (2024) performed a different but relevant analysis of some of the networks we tested (CNet-parallel, MSDNet, ConvRNN) as models of human SAT on a MNIST-digit categorization task. While we observe variations in accuracy at fixed reaction time blocks, they allowed the model to decide when to respond. Across different measures, they saw that while CNet does account well for human accuracy and reaction time data, it is outperformed by RTNet, a stochastic evidence-accumulation neural network model.