When people listen to sentences referring to objects and events in visual context, their visual attention to objects is closely time-locked to words in the unfolding utterance. How precisely people deploy attention during situated language understanding and in verifying (spatial) utterances is, however, unclear. A visual world hypothesis suggests that we look at what is mentioned (Tanenhaus et al., 1995) and anticipate likely referents based on linguistic cues (Altmann & Kamide, 1999). In spatial language research, in contrast, the Attention Vector Sum model (Regier & Carlson, 2001) predicts that in order to process a sentence such as The plant is above the clock, attention must shift from the clock to the plant. An eye-tracking study examined whether gaze pattern during comprehension of spatial descriptions support the visual world or the Attention Vector Sum account. Analyses of eye movements indicate that we need both accounts to accommodate the findings.