Abstract
To acquire or demonstrate a motor skill, we often break it down into a sequence of steps (e.g., a golf swing has ''backswing'' and ''downswing'' phases). But do we *see* single, smooth actions as containing discrete events? We compiled 20 animations depicting natural actions, spanning sports (e.g., kicking a ball), exercises (e.g., a jumping jack), and everyday tasks (e.g., picking up an object). In Experiment 1, observers determined a ''boundary'' to divide each action into two meaningful units. Consensus among observers implied a similar interpretation of the event structure of each action. Next, we explored whether these actions are spontaneously segmented during visual processing. We reasoned that if we visually represent actions as being divided into units by boundaries, then subtle changes occurring at these boundaries – specifically during the transition between the units – should be less noticeable relative to non-boundary moments. Experiments 2-3 tested observers’ detection of transient slowdowns and frame shifts at boundary, pre-boundary and post-boundary frames. People were worse at detecting changes at boundaries compared to non-boundaries. What kind of information about observed actions drives this effect? Experiments 4-5 applied novel distortions to the videos, removing high-level semantic information while preserving lower-level spatial-temporal dependencies. The boundary effect was weakened yet persisted, suggesting that spatio-temporal dynamics play a crucial role in mental structuring of actions. To quantify these dynamics, we extracted optical flow fields from every two consecutive frames of each video and computed 16 motion statistics from the flow maps to capture global and local motion characteristics. We found that the boundary judgments in Experiment 1 could be predicted by the changes in the magnitude and direction of motion vectors, especially the smoothness of these variations. Our results suggest that the visual system automatically imposes boundaries when observing natural actions via image-computable, spatio-temporal motion patterns.