Do Humans Look Where Deep Convolutional Neural Networks “Attend”?

AbstractConvolutional Neural Networks (CNNs) have recently begun to exhibit human level performance on some visual perception tasks. Performance remains relatively poor on vision tasks like object detection. We hypothesized that this gap is largely due to the fact that humans exhibit selective attention, while most object detection CNNs have no corresponding mechanism. We investigated some well-known attention mechanisms in the deep learning literature, identifying their weaknesses and leading us to propose a novel CNN approach to object detection: the Densely Connected Attention Model. We then measured human spatial attention, in the form of eye tracking data, during the performance of an analogous object detection task. By comparing the learned representations produced by various CNNs with that exhibited by human viewers, we identified some relative strengths and weaknesses of the examined attention mechanisms. The resulting comparisons provide insights into the relationship between CNN object detection systems and the human visual system.

Return to previous page