CleverBirds

CleverBirds: A Multiple-Choice Benchmark for
Fine-grained Human Knowledge Tracing

Leonie BossemeyerSamuel HeinrichGrant Van HornOisin Mac Aodha

Fine-grained visual recognition skills are vital to many expert domains, yet understanding how humans acquire such expertise remains an open challenge. We introduce CleverBirds, a large-scale benchmark for knowledge tracing in fine-grained visual recognition. The dataset contains 17.9 million multiple-choice questions from 40,144 participants across 10,779 bird species, with an average of 444 questions per participant. This dataset was introduced in CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing, to appear at NeurIPS 2025 (Datasets and Benchmarks track).

CleverBirds enables us to study how individuals learn to recognize fine-grained visual distinctions over time. We evaluate state-of-the-art knowledge tracing methods on this benchmark and find that tracking learner knowledge across participant subgroups and question types is challenging, with different forms of contextual information providing varying degrees of predictive benefit.

We collected CleverBirds from the Photo and Sound Quiz of the eBird citizen science platform. In this quiz, participants are shown a bird image and asked to identify the species from a list of options. They receive immediate feedback on the correct answer after each response. Quiz responses were collected from March 2018 to October 2024.

Figure 1: Three examples of quiz questions from CleverBirds. Each question has four species options plus a "None of the above" option. The correct answer is indicated in green. All five options are valid answers, and the candidate species differ for each question.

CleverBirds is one of the largest benchmarks for visual knowledge tracing, with substantially more learnable concepts than existing datasets. The dataset captures learning patterns across a diverse participant population, enabling us to study how expertise develops over time.

40,144

Participants

17,859,392

Total Interactions

10,779

Bird Species

444

Avg Questions/User

Participants can choose which location they want to be quizzed on, and quiz locations are distributed globally.

Figure 2: World map showing quiz locations using Hex 3 polygonal bins, where color intensity encodes the number of interactions per location cell.

The knowledge tracing task requires predicting a participant's response given the current question and the correct answer. Models are provided with additional context such as the participant's interaction history and species information and have to infer the learner's evolving knowledge state to predict their guess.

Figure 3: (Left) Human Learning. Participants learn from CleverBirds quiz questions through repeated interactions. For each question, participants see a bird image and a list of possible species names, which may include the correct answer. After making a guess, they receive feedback with the correct answer. (Right) Knowledge Tracing. The prediction task: given a participant's interaction history, the current question's image, options, and correct answer, predict the participant's guess.

We evaluated a range of machine learning models and state-of-the-art knowledge tracing methods on CleverBirds. We found that tracking learner knowledge is challenging, especially when predicting incorrect choices, with different forms of contextual information providing varying degrees of predictive benefit.

Figure 4: Performance on the multiple-choice and binary tasks. Top-left: accuracy on the full multiple-choice dataset. Top-right: accuracy on the subset of questions answered incorrectly. Bottom-left: macro-averaged accuracy on the binary task. Bottom-right: average precision (AP) for predicting user errors. Models are grouped by color into simple classifiers (RF U, RF S, RF U+S), MLPs (MLP U+S+Img, MLP Img), KT models (LM MCC, LM Seq2seq, AKT¹, ATKT², and simpleKT³), and simple heuristics (Always Correct, Random binary, Random multiple-choice, Conf Prior, Conf Prior Inc).

We release this dataset to support the development and evaluation of new methods for visual knowledge tracing. CleverBirds is among the largest benchmarks of its kind, offering a substantially higher number of learnable concepts. With it, we hope to enable new avenues for studying the development of visual expertise over time and across individuals.

If you found CleverBirds useful, please consider citing our work:

@inproceedings{bossemeyercleverbirds,
  title={CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing},
  author={Bossemeyer, Leonie and Heinrich, Samuel and Van Horn, Grant and Mac Aodha, Oisin},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track}
}

We used the following recordings from Cornell Lab of Ornithology | Macaulay Library: ML573697681, ML428128071, ML88531421, ML211835021, ML610451060, ML72428651, ML564023871, ML116768801, ML61383711, ML316745031, ML246875981, ML92749501, ML174757801, ML608961847, ML289306081, ML614584584, ML586246051, ML300360751, ML612959309, ML178092791, ML422231291, ML612007611, ML117594561, ML87415241, ML179573301, ML611679955, ML46165631, ML551526071, ML101165001, ML134292171, ML125987801, ML450270361, ML26554101, ML610427401, ML89210001, ML537341591, ML165370941, ML614031554, ML586608881, ML586481701, ML544108441, ML235493731, ML186111231, ML560669831, ML69662191, ML406173031, ML215047171, ML327232121, ML227464551, ML160575501, ML119087321, ML68277481, ML102836251, ML166040121, ML513049811, ML92172161, ML220455911, ML116049931, ML142874561, ML602284521, ML66677101, ML230656561, ML609684978, ML585337501, ML608937156, ML284964471, ML48680621, ML191838061, ML565126211, ML160571121, ML589092841, ML620875620, ML113700001, ML357819031, ML151864601, ML195952631, ML221405381, ML175066671, ML606431381, ML173863411, ML614845753, ML624914011, ML624836085, ML615927847, ML621578731, ML617550217, ML621294128, ML39633601, ML50619491, ML38293181, ML226495281, ML30091521, ML117787821, ML302310521, ML83984151, ML141517111, ML284199291, ML51777001, ML26854421, ML301728521, ML290513131, ML50787721, ML174404171, ML463868861, ML613090562.

Project page adapted from INQUIRE: A Natural World Text-to-Image Retrieval Benchmark — Vendrow et al., NeurIPS 37 (2024).