In this study, binding in visual short-term memory (VSTM) across visual and manual domains was investigated. Six human observers performed a yes-no recognition task for object appearance in three experimental conditions (fully counterbalanced) in which unfamiliar, nonverbal 1/f noise discs served as stimuli. In the memory display, four stimuli (each subtending 2 deg) were presented sequentially (each for 850ms) at random spatial positions. Following a 1000ms blank interval, a test stimulus was presented. In condition 1, observers executed hand movements (spatial tapping) during the memory display by touching a pointer on a graphics tablet at a position corresponding to the screen coordinate of each stimulus as it appeared. The test stimulus was presented at one of the coordinates used in the preceding memory display. Condition 2 was identical to condition 1, except that spatial tapping was not performed. In condition 3, both memory and test stimuli were presented at (different) random coordinates; observers performed spatial tapping during the memory display (like condition 1), except that the positions of test stimuli did not correspond to preceding hand/screen positions. In all three experimental conditions, the cursor was invisible. Observers completed a training session in which the cursor was visible to associate graphics tablet coordinates with screen coordinates. Performance, measured in d′, was significantly greater in condition 1 than in conditions 2 [F(1,5)=20.35, p<0.01] and 3 [F(1,5)=10.14, p=0.02]. Performance was not significantly different between conditions 2 and 3 [F(1,5)=4.54, p=0.09]. These findings suggest that a spatially correlated manual action facilitates VSTM, providing evidence that perception and action bind across representational domains to associate relevant stimulus properties, consistent with event file theory (Hommel, 1998; 2004). Furthermore, given that our stimuli are unlikely to accrue VLTM support, these findings support the notion that visuo-manual binding occurs even where semantic and associative VLTM cues are minimized.