Humans are remarkably good at recognizing spoken language, even in very noisy environments. Yet, artificial speech recognizers do not reach human level performance, nor do they typically even attempt to model human speech processing. In this paper, we introduce a biologically plausible neural model of real-time spoken phrase recognition which shows how the time-varying spiking activity of neurons can be integrated into word tokens. We present a proof-of-concept implementation of the model, which shows promise both in terms of recognition accuracy as well as recognition speed. The model is also pragmatically useful to cognitive modelers who require robust any-time speech recognition for their models such as real-time models of human-robot interaction. We thus also present such an example of embedding our model in a larger cognitive model, along with offline analysis of its performance on a speech corpus.