We present a comprehensive strategy for evaluating image retrieval algorithms. Because automated image retrieval is only meaningful in its service to people, performance characterization must be grounded in human evaluation. Thus we have collected a large data set of human evaluations of retrieval results, both for query by image example and query by text. The data is independent of any particular image retrieval algorithm and can be used to evaluate and compare many such algorithms without further data collection. The data and calibration software are available on-line (http://kobus.ca/research/data). We develop and validate methods for generating sensible evaluation data, calibrating for disparate evaluators, mapping image retrieval system scores to the human evaluation results, and comparing retrieval systems. We demonstrate the process by providing grounded comparison results for several algorithms.