Empirical evidence from studies using the visual world paradigm reveals that spoken language guides attention in a related visual scene and that scene information can influence the comprehension process. Here we model sentence comprehension using the visual context. A recurrent neural network is trained to associate the linguistic input with the visual scene and to produce the interpretation of the described event. The feedback mechanism in the form of sigma-pi connection is added to model the explicit utterance-mediated visual attention behavior revealed by the visual world paradigm. The results show that the network successfully learns sentence final interpretation and also demonstrates the hallmark anticipation behavior of predicting upcoming constituents.