In the past few years, deep convolutional neural networks (CNNs) trained on large image data sets have shown impressive visual object recognition performances. Consequently, these models have attracted the attention of the cognitive science community. Recent studies comparing CNNs with neural data from cortical area IT suggest that CNNs may—in addition to providing good engineering solutions—provide good models of biological visual systems. Here, we report evidence that CNNs are, in fact, not good models of human visual perception. We show that a 3D shape inference model explains human performance on an object shape similarity task better than CNNs. We argue that deep neural networks trained on large amounts of image data to maximize object recognition performance do not provide adequate models of human vision.