Identifying the visual referent of a spoken word – that a particular insect is referred to by the word “bee” – requires both the ability to process and integrate multi-modal input and the ability to reason under uncertainty. How do these tasks interact with one another? We introduce a task that allows us to ex- amine how adults identify words under joint uncertainty in the auditory and visual modalities. We propose a ideal observer model of the task which provides an optimal baseline. Model predictions are tested in two experiments where word recognition is made under two kinds of uncertainty: category ambiguity and distorting noise. In both cases, the ideal observer model explains much of the variance in human judgments. But when one modality had noise added to it, human perceivers systematically preferred the unperturbed modality to a greater extent than the ideal observer model did.