We introduce a modular recurrent neural architecture, which learns distributed, generative temporal models of biological motion. It encodes modal visual and proprioceptive (angular) biological motions separately by means of autoencoders, structuring respective postures, motion directions, and motion magnitudes separately. The submodal encoders are interdependent by predicting each other’s next autoencoder states temporally. As a result, distributed attractor states can develop from self-generated motions. We show that the architecture is able to synchronize its activities across modalities towards overall consistent action-encoding attractors. Moreover, the developing spatial and temporal structures can complete partially observable actions, e.g., when only providing visual information. Furthermore, we show that the network is capable of simulating whole-body actions without any sensory stimulation, thus imagining unfolding actions. Finally, we show that the network is able to infer the visual perspective on a biological motion. Thus, the neural architecture enables embodied perspective taking and action inference.