The detection and categorization of animate motions is a crucial task underlying social interaction and perceptual decision-making. Neural representations of perceived animate objects are built in the primate cortical region STS which is a region of convergent input from intermediate level form and motion representations. Populations of STS cells exist which are selectively responsive to specific animated motion sequences. It is still unclear how and to which extent form and motion information contribute to the generation of such representations and what kind of mechanisms are involved in the learning processes. We demonstrate how the proposed model automatically selects significant motion patterns as well as meaningful static form prototypes. Sequence selective representations are learned in STS by fusing static form and motion input from the segregated bottom-up driving input streams. Cells in STS, in turn, feed activities recurrently to their input sites along top-down signal pathways, enabling predictions about future input.