The nature of audio-visual interactions is poorly understood for meaningful objects. These interactions would be indirect through semantic memory according to the amodal nature of knowledge, whereas these interactions would be direct according to the modal nature of knowledge. This question, central for both memory and multisensory frameworks, was assessed using a cross-modal priming paradigm from auditory to visual modalities tested on familiar objects. For half of the sound primes, a visual abstract mask was simultaneously presented to the participants. The results showed a cross-modal priming effect for semantically congruent objects compared to semantically incongruent objects presented without the mask. The mask interfered in the semantically congruent condition, but had no effect in the semantically incongruent condition. The semantic specificity of the mask effect demonstrates a memory-related effect. The results suggest that audio-visual interactions are direct. The data support the modal approach of knowledge and the grounded cognition theory.