Learning about objects typically involves the association of multisensory attributes. Here, we present three experiments supporting the existence of a specialized form of associative learning that depends on unitization. When multisensory pairs (e.g. faces and voices) were likely to both belong to a single object, learning was superior than when the pairs were not likely to belong to the same object. Experiment 1 found that learning of face-voice pairs was superior when the members of each pair were the same gender vs. opposite gender. Experiment 2 found a similar result when the paired associates were pictures and vocalizations of the same species vs. different species (dogs and birds). In Experiment 3, gender-incongruent video and audio stimuli were dubbed, producing an artificially unitized stimulus reducing the congruency advantage. Overall, these results suggest that unitizing multisensory attributes into a single object or identity is a specialized form of associative learning.