Dividing observed human behavior into individual, meaningful actions is a critical task for both human learners and computer vision systems. An important question is how much action structure and segmentation information is available in the observed surface level motion and image changes, without any knowledge of human pose or behavior. Here we present a novel approach to jointly segmenting and recognizing videos of human action sequences, using a hierarchical topic model. Video sequences are represented as bags of video words, automatically discovered from local space-time interest points. Our model jointly infers both action identification and action segmentation. Our results are a good fit with human segmentation judgments as well as providing relatively accurate action recognition and localization within the videos.