Medal Matters: Probing LLMs’ Failure Cases Through Olympic Rankings

COLM 2025 ORLGen Workshop

Medal Matters: Probing LLMs’ Failure Cases Through Olympic Rankings

Juhwan Choi, Seunguk Yu, JungMin Yun, YoungBin Kim

Large language models (LLMs) have achieved remarkable success in natural language processing tasks, yet their internal knowledge structures remain poorly understood. This study examines these structures through the lens of historical Olympic medal tallies, evaluating LLMs on two tasks: (1) retrieving medal counts for specific teams and (2) identifying rankings of each team. While state-of-the-art LLMs excel in recalling medal counts, they struggle with providing rankings, highlighting a key difference between their knowledge organization and human reasoning. These findings shed light on the limitations of LLMs' internal knowledge integration and suggest directions for improvement. To facilitate further research, we release our code, dataset, and model outputs.

Previoius

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples

PUBLICATIONS

Medal Matters: Probing LLMs’ Failure Cases Through Olympic Rankings