Abstract
A visual scene can be recursively parsed into parts and sub-parts. We present a causal model of scene parsing that can synthesize and identify parse trees, predict perceptual complexity, and pass a (limited) Turing test. It describes the causal process of generating a scene as splitting it recursively with vertical and horizontal cuts, modulated by two parameters: (a) Splitting Factor (SF) – a large SF favors splitting a part into more sub-parts, generating a wide and shallow tree; (b) Part Similarity (PS) – a large PS favors evenly splitting a part into sub-parts. Given a scene, it infers the best parsing tree by evaluating the number of parts and their similarities at each partition. It inspires three human experiments beyond RT/accuracy measurements. (1) "Just cut it". Participants freely cut a blank scene into 6 rectangles recursively. With human-generated images, the model removed all free parameters by estimating human SF and PS priors. (2) "Complexity comparison". Some scenes are immediately perceived as more complex than others. Scene complexity can be quantified as "information content", determined by the probability of generating that scene (higher probability carries less information). Participants ranked the complexities of 20 scenes via paired comparisons. The model ranked the same images by computing their information contents. The result revealed a strong correlation between human and model rankings (r2 = 0.85). (3) Turing test. Each scene can be interpreted by multiple parsing trees. Participants viewed a scene and one parsing tree, then reported whether the tree was generated by a human or machine. Two baseline models were introduced: (a) the causal model with non-informative SF and PS priors; (b) a model uniformly sampling a tree from valid ones. Only the causal model with human priors passed Turing test. These results demonstrate how to formalize human scene parsing with a causal model.
Meeting abstract presented at VSS 2018