How people achieve long-term goals in an imperfectly known environment, via repeated tries and noisy outcomes, is an important problem in cognitive science. There are two interrelated questions: how humans represent information, both what has been learned and what can still be learned, and how they choose actions, in particular how they negotiate the tension between exploration and exploitation. In this work, we examine human behavioral data in a multi-armed bandit setting, in which the subject choose one of four “arms” to pull on each trial and receives a binary outcome (win/lose). We implement both the Bayes optimal policy, which maximizes the expected cumulative reward in this ﬁnite horizon bandit environment, as well as a variety of heuristic policies that vary in their complexity of information representation and decision policy. We ﬁnd that the knowledge gradient algorithm, which combines exact Bayesian learning with a decision policy that maximizes a combination of immediate reward gain and long-term knowledge gain, captures subjects’ trial-by-trial choice best among all the models considered; it also provides the best approximation to the computationally intense optimal policy among all the heuristic policies.