Abstract
Recent advances in computer vision have enabled machines to have high performance in labeling objects in natural scenes. However, object labeling constitutes only a small fraction of human daily activities. To move towards building machines that can function in natural environments, the usefulness of these models should be evaluated in more diverse and natural tasks. To achieve this, it is necessary to collect databases of a broader set of human behavior towards natural objects. Here, we collected a database of two different behaviors on a large set of 3D-printed objects: 1) a grasping task and 2) a similarity judgment task. For the grasping task, we recorded participants’ finger positions when they grasped the objects. For the similarity judgment task, we asked participants to perform an odd-one-out task on triplets of objects and obtained a similarity matrix based on these judgments. Comparing these matrices across the two tasks suggests distinct features of objects are used for each. We next explored if the features extracted in different layers of the state-of-the-art deep convolutional neural networks (DNNs) could be useful in deriving both outputs. These networks are pre-trained to perform categorization tasks, yet it has been suggested that they could be adopted for other tasks. For similarity judgments, the accuracy of the predictions for similarity judgments increased from low to higher layers of the networks, while those for grasping behavior increased from low to mid-layers and then dropped dramatically at higher layers. These results suggest that for building a system that could perform these two tasks the hierarchy may need to be split starting at the mid-layers. These results could inform future models that can perform a broader set of tasks on natural images.