We present an indoor guidance study to explore the interplay between spoken instructions and listeners' eye movements. The study involves a remote speaker to verbally guide a listener and together they solved nine tasks. We collected a multi-modal dataset consisting of the videos from the listeners' perspective, their gaze data, and instructors' utterances. We analyse the changes in instructions and listener gaze when the speaker can see 1) only the video, 2) the video and the gaze cursor, or 3) the video and manipulated gaze cursor. Our results show that listener visual behaviour mainly depends on utterance presence but also varies significantly before and after instructions. Additionally, more negative feedback occurred in 2). While piloting a new experimental setup, our results provide indication for gaze reflecting both: a symptom of language comprehension and a signal that listeners employ when it appears useful and which therefore adapts to our manipulation.