The input to a cognitively plausible model of language acquisition must have the same information components and statistical properties as the child-directed speech. There are collections of child-directed utterances (e.g., CHILDES), but a realistic representation of their visual and semantic context is not available. We propose three quantitative measures for analyzing the statistical properties of a manually annotated sample of child-adult interaction videos, and compare these against the scene representations automatically generated from the same child-directed utterances, showing that these two datasets are significantly different. To address this problem, we propose an interaction-based framework for generating utterances and scenes based on the co-occurrence frequencies collected from the annotated videos, and show that the resulting interaction-based dataset is comparable to naturalistic data. We use an existing model of cross-situational word learning as a case study for comparing different datasets, and show that only interaction-based data preserve the learning task complexity.