Constructing Typing-Time Corpora: A New Way to Answer Old Questions


Many current studies in linguistics and psycholinguistics require the use of phonetically labeled speech data. Collecting and annotating such data is expensive and slow. An alternative approach makes use of pre-labeled speech corpora, but these are available for very few languages, might not contain the desired linguistic environment, and the construction of new ones is still expensive and time-consuming. We present a fast and cost-efficient method for constructing a new type of corpus which retains many of the advantages of phonetically labeled speech, typing-time corpora. In this paper we show that an English typing-time corpus collected over the web is sufficient to replicate word frequency and neighborhood density effects. We then demonstrate the transferability of this method to less studied languages and to different orthographies. We show that a smaller Hebrew typing corpus collected over the web can be used to find lengthening effects in infrequent Hebrew words.

