Human beings have an excellent ability which can form and recognise object categories. In this paper, a novel system of multimodal object recognition and categorisation by perform- ing interactive behaviours is introduced. Video clips are filmed as the raw input of the system. A dataset of 100 objects with 18 categories and 5 different interactions is used to evaluated the performance. Convolutional neural network is used to train the classifier and learn the categories. The result shows the high- est, lowest and average recognition accuracies of every specific object in every category and the receiver operating character- istic for every category. The connection between the presented system and human cognitive system is discussed in the conclu- sion and future works.