The Physical Property Becomes the Affordance Test
Affordance reasoning asks what an object can do because of what it is physically like. Affordance20Q hides the object's name and asks whether a model can infer the action from material, shape, size, and surface evidence rather than from a memorized object-function pair.
The Paper
The paper is AFFORDANCE20Q: Evaluating Affordance Reasoning from Physical Properties, arXiv:2606.14240 [cs.AI], by Yifan Jiang, Meige Yang, Zitong Li, and Jay Pujara. arXiv lists version 1 as submitted on June 12, 2026, with DOI 10.48550/arXiv.2606.14240. The affiliations shown in the arXiv HTML are Information Sciences Institute, University of Southern California, and University of Southern California.
The core problem is benchmark leakage through object identity. If a model is told the object is a knife, answering that it affords cutting may be recall rather than physical reasoning. Affordance20Q removes the object name and makes the model ask questions about observable physical properties.
The Game
Each game has a hidden target object and a candidate set of eight affordances. One affordance is correct and seven are distractors. The Questioner sees only the candidate affordances and asks one yes/no question per turn about a static physical property of the hidden object. A Checker validates that the question is well-formed and grounded in material, shape, size, or surface properties. An Oracle sees the hidden object and property set and answers the validated question.
The game succeeds if the Questioner identifies the target affordance within the 20-turn budget. It fails if the model guesses incorrectly or runs out of turns. This turns affordance reasoning into an active evidence-gathering problem rather than a one-shot classification problem.
Dataset Construction
The released benchmark contains 1,009 games over 454 objects and 59 affordances. The construction pipeline starts from human-made physical objects and Commonsense Knowledge Graph affordance relations such as CapableOf and UsedFor, expands affordance candidates with GPT-4.1, and then filters out cases that depend on context, hidden mechanisms, or overly general functions.
The authors exclude object-affordance pairs whose affordance cannot be deduced from physical-property dimensions alone, such as a microwave's heating affordance depending on an internal magnetron. Six human annotators worked on manual annotation and refinement in stages 2 and 3. Three additional annotators re-annotated 1,298 sampled pairs, with 85.2 percent majority agreement against the released labels.
The paper also validates the question-answering machinery. A single Qwen3-14B instance serves as Oracle and Checker at temperature 0, using the object's property set as context. Appendix D reports five volunteers for the human baseline on a 30 percent subset, three annotators for Oracle-answer verification on a 300-question sample, and six annotators plus three label verifiers for the construction process.
Results
The evaluation covers 15 LLMs as Questioners: ten open-source models and five closed-source models. The open-source group includes Qwen3-8B, Qwen3-14B, Qwen3.5-9B, Phi-4-14B, Llama-3.1-8B, Ministral-8B, Nemotron-9B, Gemma-3-12B, DeepSeek-V4-Pro, and DeepSeek-V4-Flash. The closed-source group includes GPT-5, GPT-5-mini, Gemini-2.5-Pro, Gemini-2.5-Flash, and MiniMax-M2.5.
Humans reach 64.2 percent success in 10.7 turns on the sampled subset. The best reported model is Gemini-2.5-Pro at 45.9 percent success, and the strongest open-source baseline is DeepSeek-V4-Flash at 41.3 percent. The fixed-question baseline reaches only 24.8 percent despite using the full 20-question budget, showing that asking many generic questions is not enough.
The information-gain analysis is the diagnostic core. Models tend to start with material questions and shift toward shape questions, but the KL-based information-gain score collapses after turn 5 and approaches zero even while models keep asking questions. The failure is not merely that they ask too few questions. It is that later questions stop narrowing the candidate affordance set.
Affordance difficulty varies sharply. Single-property affordances such as conduct_heat and transmit_light are easier, while multi-property or alternative-rule affordances such as sink_in_water, hang_from_above, float_on_water, hold_between, and ignite are much harder.
KARI
The paper's improvement method is KB-Anchored Rule Induction, or KARI. KARI uses LLMs plus knowledge bases to generate affordance rules that are grounded in physical commonsense evidence. Its components are a Rule Proposer, a Validator, and an Auditor. In the paper's run, Qwen3-14B is used for all components and produces 2,223 rules.
At inference time, candidate affordances are matched to generated rules by sentence similarity, with a threshold of 0.7. Matching rules are verbalized into the Questioner's system prompt. KARI improves open-source LLMs by up to 15.2 points, but the gains are limited by knowledge-base coverage and do not eliminate the human-model gap.
Governance Standard
An affordance benchmark should ship with an affordance receipt. The receipt should name the hidden object, candidate affordance set, target affordance, distractor source, property dimensions, property set, filtered object-identity fields, Checker prompt, Oracle prompt, Oracle model, Checker model, temperature, turn budget, question transcript, invalid-question decisions, final guess, success rate, turn count, information-gain trace, human-baseline protocol, annotator roles, inter-annotator agreement, KARI rule match, knowledge-base evidence, and cases removed for context dependence or hidden mechanisms.
The governance point is physical deployment. A robot or embodied agent does not need to know only that a named object usually performs a named function. It needs to infer what can be pushed, cut, pierced, held, poured, floated, folded, swept, or heated from properties visible in the current situation. If a model is passing by object-name recall, the benchmark has measured a vocabulary association rather than a physical safety capability.
This connects directly to Embodied AI and Robotics, World Models and Spatial Intelligence, Vision-Language-Action Models, Reasoning Models, AI Evaluations, The Object Slot Becomes the Planning State, The Embodied Agent Becomes the Test-Time Scaling Problem, The Committed Plan Becomes the Action Gate, The Evaluation Bench Becomes the Test Rig, and The Benchmark Becomes the Curriculum. Physical reasoning evidence needs the same separation between recognition, inference, and action that safety cases need between score, system, and deployment.
Limits
The paper is careful about scope. Affordance20Q is text-only, so it does not test vision, touch, contact dynamics, force, friction, pose, or real-time interaction. It asks about static physical properties rather than complete embodied control.
The QA setting is also narrower than real-world planning. Real tasks may require affordance cues embedded in narratives, messy scenes, incomplete observations, or open-ended action plans rather than explicit yes/no property questions. The benchmark also depends on the coverage and quality of property descriptions, commonsense knowledge bases, and manually filtered labels.
The official GitHub repository linked by the paper is public at 1171-jpg/Affordance20Q, but at review time the browser listing showed only Readme.md, two commits, and no releases. That should be checked again before treating the artifact as a complete reproducibility package.
The Spiralist reading is that the object name is a shortcut. The world does not hand a robot a label before every consequence. It offers shape, material, size, surface, and contact. The test is whether the model can reason from those properties before the action becomes physical.
Sources
- Yifan Jiang, Meige Yang, Zitong Li, and Jay Pujara, AFFORDANCE20Q: Evaluating Affordance Reasoning from Physical Properties, arXiv:2606.14240 [cs.AI], submitted June 12, 2026.
- arXiv HTML: Affordance20Q: Evaluating Affordance Reasoning from Physical Properties, reviewed for affiliations, abstract, game formulation, construction pipeline, models, metrics, results, KARI, limitations, ethics note, prompts, oracle validation, and human annotation details.
- arXiv PDF: Affordance20Q: Evaluating Affordance Reasoning from Physical Properties.
- Official artifact link: 1171-jpg/Affordance20Q, reviewed for public repository status, Readme-only listing, commit count, and release status.
- Related pages: Embodied AI and Robotics, World Models and Spatial Intelligence, Vision-Language-Action Models, Reasoning Models, AI Evaluations, The Object Slot Becomes the Planning State, The Embodied Agent Becomes the Test-Time Scaling Problem, The Committed Plan Becomes the Action Gate, The Evaluation Bench Becomes the Test Rig, and The Benchmark Becomes the Curriculum.