The Human Capital Becomes the Forecasting Benchmark
A model benchmark can tell an institution how a tool performed alone. It cannot tell the institution who will use the tool as a collaborator, a crutch, or a confirmation device.
The Paper
The paper is Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting, arXiv:2607.02467 [cs.CY, cs.AI]. The arXiv record lists Vivienne Ming as the author, version 1 as submitted on July 2, 2026, and the comment as "4 pages, 1 figure, PNAS brief style." The PDF lists the affiliations as The Human Trust, Possibility Science, and UCL Global Business School for Health, and labels the work a preprint pilot study.
The site already has pages on prediction markets, model benchmark monoculture, and human-agent collaboration. Ming's paper asks a narrower question: when people and models forecast together, is the useful predictor the model's standalone score, the person's raw cognitive score, or the person's collaboration style?
The Setup
The pilot recruited 108 adults by flyer in Berkeley, California, with compensation of 20 dollars per session for three sessions. The main study used 78 participants, including 42 UC Berkeley students and 36 community adults, organized into 26 three-person teams. Twelve teams worked in a human-only condition. Fourteen teams had access to one of four large language models. After one excluded out-of-range Brier score, the main analysis used 77 participants.
The forecasting task used 30 live Polymarket contracts that resolved between November 2025 and January 2026, covering economics, international relations, and business. Each team forecast 10 randomly drawn questions. The AI-only baseline had Llama 3.1 8B, Qwen3 8B, GPT-4o, and Gemini 3 Pro forecast all 30 questions independently. Accuracy was measured with scaled Brier score, where lower is better and an uninformative 0.5 forecast scores 25.
Three Modes
The paper reports a trimodal pattern rather than a single "AI helps" average. Human-only forecasters scored 14.7. Automators, who largely adopted the model's answer, scored 10.4: better than the human-only group but worse than the AI-only baseline of 5.5. Validators, who used the model to check a prior guess, scored 31.7, worse than the human-only group. Cyborgs, the paper's label for iterative complementary reasoners, scored 3.8, below the four-model AI-only mean and near the market benchmark reported as 3.5.
This is the governance point. The same nominal tool access produced three different operating modes. One mode outsourced judgment. One mode laundered a prior belief through the model. One mode used the model as part of an active reasoning loop. A deployment plan that records only "users have AI assistance" hides the variable that mattered.
The Human Side
The paper found that among hybrid forecasters, raw cognitive ability did not predict accuracy: the reported correlations for a general cognitive proxy and fluid reasoning were small and not significant. Collaborative human capital did better. Perspective-taking predicted lower error, with the paper reporting r = -0.32 and p = .04. Curiosity and intellectual humility trended in the same direction.
The group differences were large in the pilot. Cyborg forecasters exceeded other hybrid forecasters on intellectual humility, curiosity, and perspective-taking. The paper reports raw group means of 6.5 versus 4.4 for intellectual humility, 6.7 versus 5.1 for curiosity, and 6.5 versus 4.6 for perspective-taking, with all three differences reported below p = .01. Those are not model features. They are properties of the human-machine relation.
Collaboration Receipt
A human-AI collaboration receipt should therefore record more than model name, benchmark rank, and task score. It should include the human role, the question type, forecast timestamp, market state, model used, prompt surface, whether the user copied, challenged, revised, or decomposed the model output, the individual forecast before and after model access, the Brier score, and any measured collaboration training or screening variables.
This matters for labor and governance. If a workplace buys a forecasting assistant, the relevant control may be training people to interrogate it, not swapping in a higher-ranked model. If a team is full of Validators, adding a better model may harden bad priors. If the work rewards Automators, the institution has not built hybrid intelligence. It has built delegation with a human signature.
Limits
The paper is careful about scope. It is a pilot with small Cyborg and Validator cells, nine participants each. Interaction style was emergent and collinear with human capital, so style comparisons are descriptive rather than causal. The paper also notes low within-condition variance, a four-model baseline, no independent re-audit of per-question market calibration, and a planned pre-registered replication.
Those limits should travel with any use of the result. The claim is not that three traits certify who should use AI. The stronger claim is institutional: if the value of AI assistance depends on how people reason with it, model benchmarks are not enough evidence for deployment. The user population becomes part of the evaluation object.
Sources
- Vivienne Ming, Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting, arXiv:2607.02467 [cs.CY, cs.AI].
- arXiv PDF for Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting, checked for affiliations, study design, participant counts, forecasting task, model baselines, Brier scores, trait measures, results, supplementary notes, and limitations.
- arXiv listing pages for Computers and Society and Artificial Intelligence, checked for current subject listing and comment metadata.