How do humans interact with each other using gestures? How do they catch the semantics of gestures to predict or react to them? We explore hierarchical slow-feature models to obtain the high-level semantics in conducting gesture conversations. We adopt the hypernetwork model as a basic component to learn the elementary semantics of gestures, and combine two hypernetworks with an added upper-layer of slow features to learn the temporal transition of semantics. This hierarchical slow-feature model abstracts the low-level features of joint angles to the slowly-changing high-level features which represent higher-level semantics of the gestures. This model also learns the probability of the partner’s next gesture given the gesture of one person at a semantic level. We experimented with the Kinect motion capture device to record the gestures of two subjects in gesture conversation scenarios. The human gesture data was used to train the hierarchical slow-feature model to predict the gesture conversations. The trained model is then transfered to the Darwin humanoid robot. We compare the human behaviors and the robot behaviors that learned from the human gestures.