How Data Happened and the History of Machine-Readable Power
Chris Wiggins and Matthew L. Jones's How Data Happened is useful because it refuses the origin myth of artificial intelligence. The book treats contemporary AI as the latest arrangement in a longer history of data, statistics, state administration, corporate measurement, military logistics, racial classification, search, surveillance, and machine learning.
For this review, machine-readable power means the power to convert life into records that can be stored, joined, scored, sold, searched, trained on, and acted upon by institutions. The safety problem begins when those records become more actionable than the people and settings they only partially describe.
The Book
How Data Happened: A History from the Age of Reason to the Age of Algorithms was published by W. W. Norton in 2023. Princeton's History Department lists the authors as Chris Wiggins and Matthew L. Jones, gives the ISBN as 978-1324006732, and describes the book as a history of data's technical, political, and ethical impact. W. W. Norton UK lists the hardback at 384 pages and the ebook ISBN as 9781324006749; SIAM Review's 2025 featured review identifies the paperback as xiv+367 pages.
The authors bring an unusual pairing to the subject. Columbia's Applied Physics and Applied Mathematics profile lists Wiggins as an associate professor of applied mathematics and systems biology whose research includes machine learning and computational biology. Princeton lists Jones as Smith Family Professor of History, focused on recent information technologies, intelligence, science, technology, surveillance, and the politics of knowledge. UVA Law's entry for Jennifer Chapman's Law Library Journal review emphasizes that the book combines those specialties to give historical context to a data-dominated society.
The result is neither a simple tech history nor a warning pamphlet about algorithms. It is a long institutional history of how facts become countable, how countable facts become administrative objects, and how administrative objects become targets for prediction, sorting, and control.
The useful word here is conversion. The book is less about data as a substance than about conversion work: converting bodies into tables, households into census fields, habits into behavioral traces, documents into searchable corpora, judgments into labels, and uncertainty into numbers that can move through institutions. Once that conversion is complete, later users may forget that a social choice has been embedded in a technical format.
Current Context
As of June 25, 2026, the book reads less like background history and more like an AI governance manual with the brand names removed. The central controversy is no longer only whether a model is accurate. It is whether the institution can reconstruct the chain that made a model's evidence possible: source, collection authority, category scheme, consent or legal basis, transformation, labeling, retention, retrieval, interface, downstream action, and correction path.
Current law and standards have moved in that direction. The EU AI Act's official text makes data governance a requirement for high-risk systems in Article 10 and makes public summaries of general-purpose AI training content part of Article 53. The European Commission's July 2025 training-content template is a narrow public summary, not full provenance, but it marks the same turn: training content is now treated as a governance surface. The EU Data Act, applicable from September 12, 2025, adds a neighboring data-access and use regime for connected products and related services; it is not an AI training-data law, but it reinforces that data access, portability, and contractual control are institutional questions.
Security guidance has also caught up with the history. The May 2025 joint AI Data Security guidance from NSA, CISA, FBI, and international partners treats the data supply chain, maliciously modified data, and data drift as AI security risks, and recommends reliable sourcing, provenance tracking, integrity checks, secure storage, and lifecycle logging. That turns the book's historical claim into an operational rule: if the record can move power, the record needs custody.
Data as Power
The book's strongest move is to treat data as something made. Data does not arrive raw from the world. It is selected, formatted, cleaned, categorized, stored, defended, and made available to particular actors for particular purposes. That sounds obvious until a dashboard, model score, or generated answer arrives with the tone of neutral discovery.
A sharper definition helps: data is not just information. It is information made recordable, portable, comparable, computable, and actionable inside an institution. A birth entry, census field, clickstream, police stop, medical code, search query, loan application, school metric, or platform interaction becomes data when it is prepared to travel through a system that can do something with it.
The important unit is therefore not an isolated fact but a record in a chain of custody. A record has an origin, a collector, a category scheme, an instrument, a time stamp, a maintenance practice, a permission story, and a destination. Once that chain is forgotten, the record can masquerade as a free-standing fact while still carrying the authority, bias, gaps, and incentives of the institution that made it.
That makes a dataset a governance claim, not just a technical asset. It claims that a slice of the world was captured under conditions good enough for a later use. The burden is on the institution to show why that claim still holds after the data has been cleaned, joined, embedded, summarized, licensed, filtered, or moved into a model.
Wiggins and Jones show that data history is also a history of institutions asking what they need to know in order to govern, sell, insure, police, manage, persuade, or fight. Census practices, statistics, eugenics, industrial measurement, wartime computation, behavioral advertising, search, and machine learning are not identical systems. But they share a recurring operation: turn messy life into records that can travel through institutions.
That operation is productive and dangerous at the same time. It can support public health, scientific discovery, logistics, and accountability. It can also give a durable technical form to prejudice, extraction, surveillance, and managerial fantasy. The point is not that counting is corrupt. The point is that counting is never outside power.
The book's sharpest illustration of that claim is the origin of modern statistics itself. Francis Galton, who coined the word "eugenics" in 1883, developed regression and correlation precisely to quantify the inheritance of human traits and argue for selective breeding. His protégé Karl Pearson formalized those ideas into the correlation coefficient still taught as "Pearson's r," devised the chi-squared test, founded the first modern statistics journal, Biometrika, and held an endowed Chair of Eugenics at University College London, a post later occupied by R. A. Fisher. The everyday machinery of data science, the regression line and the correlation coefficient, was not neutral apparatus that eugenicists happened to borrow. It was built, in part, to do eugenic work. That genealogy is the cleanest answer to anyone who imagines that a statistic arrives innocent of the purpose that produced it.
This matters for AI because many present-day arguments begin too late. They start at the model, the benchmark, the prompt, the chatbot, or the automated decision. How Data Happened pushes the reader backward: who made the categories, who collected the records, who got excluded, who paid for the infrastructure, who had the right to inspect it, and which older social order became the training substrate?
That backward pressure is the page's practical lesson. A dataset is not merely an input to be secured. It is an institutional memory with claims embedded inside it. Some claims are explicit, like a label or a field name. Others are hidden in sampling frames, missing populations, annotation rules, retention periods, and the decision to treat one kind of trace as evidence while another kind never enters the record.
Legibility With Memory
Read beside Seeing Like a State, the book sharpens the idea of legibility. James C. Scott focused on how states simplify local life so it can be administered. Wiggins and Jones extend the problem into a world where legibility is computational, commercial, networked, and cumulative.
Modern legibility does not just make a person visible to an office. It can preserve traces, link databases, infer missing attributes, rank future behavior, personalize prices, automate suspicion, generate summaries, and feed the next model. The person becomes not only readable but reusable.
That is the bridge from data history to recursive reality. A classification changes the world it claims to describe. A credit score changes financial possibility. A search rank changes what becomes authoritative. A policing model changes where police look, which changes what is recorded, which changes the next model. A hiring filter changes the population of workers, then calls that population evidence of fit. Data systems do not merely observe social patterns; they help produce the patterns later treated as data.
The book is especially valuable because it keeps that recursion historical. It does not treat today's algorithmic systems as unprecedented magic. It shows the older administrative, mathematical, and commercial habits that made the current moment plausible.
That framing also helps distinguish accountability from mere visibility. Making a person legible to a system is not the same as making the system legible to the person. A record can be persistent, linked, and actionable while the affected person cannot see it, correct it, delete it, or understand why it mattered. Legibility without reciprocal inspection becomes one-way memory.
The reciprocal test is simple: if a record can follow a person across systems, the person or a legitimate reviewer should be able to follow the record back to its source, purpose, owner, transformation history, and remedy path. Otherwise the machine-readable world gains memory while the human subject is denied history.
The AI-Age Reading
Columbia Business School's account of Wiggins's 2023 talk highlights one of the book's central claims: AI is not a sudden arrival but a moving target whose meaning changes across communities and periods. The same article connects the book to facial recognition, automated loan and bail decisions, the 1956 Dartmouth meeting, Herbert Simon, and the continuing tension among state power, corporate power, and people power.
That framing is more useful than the usual binary of AI optimism versus AI doom. It asks what kind of social machinery has to exist before a model can matter. A frontier model needs data centers, chips, labor, benchmarks, web corpora, cloud contracts, APIs, evaluation regimes, procurement offices, compliance paperwork, user interfaces, and institutions ready to act on its outputs. Intelligence becomes power only when it is attached to pipes, records, permissions, budgets, and routines.
The book also helps separate capability from authority. A model may summarize, classify, predict, or generate with impressive fluency. But institutional authority comes from the surrounding arrangement: whether the output enters a medical record, a school discipline file, a welfare eligibility decision, a police report, a credit denial, a battlefield interface, or a workplace dashboard. The crucial question is not simply what the system can infer. It is where the inference lands.
That is where model evaluation alone fails. A benchmark can show that a model performs a task; it cannot show that the data behind the task is fit for a clinic, school, agency, employer, court, or family. The use case has to carry its own evidence record: what data made the output plausible, what institutional duty the output serves, and who can challenge it when the record is wrong.
That is why data provenance, appeal rights, audit trails, deletion, contestability, and public memory belong near the center of AI governance. They are not bureaucratic afterthoughts. They are the mechanisms by which people can resist being trapped inside bad records and self-confirming categories.
For AI agents, the problem becomes sharper because the interface can move from description to action. A chatbot may summarize a record incorrectly; an agent can update the record, send the notice, route the case, trigger the workflow, or call the vendor system that turns a machine-readable inference into an institutional event. Data history therefore becomes agent safety: before asking whether the agent followed instructions, ask what records the instructions made actionable.
Governance and Safety
Read in 2026, the book's history has become a practical governance map. The official EUR-Lex text of Regulation (EU) 2024/1689 says the AI Act generally applies from August 2, 2026; Chapters I and II began applying on February 2, 2025; Chapter V on general-purpose AI began applying on August 2, 2025; and Article 6(1) with its corresponding obligations applies from August 2, 2027. Article 10 gives the relevant data discipline for high-risk systems: training, validation, and testing datasets must be subject to governance and management practices appropriate to the intended purpose. The text names design choices, data collection and origin, original purpose for personal data, annotation, labelling, cleaning, updating, enrichment, aggregation, assumptions about what data measure, bias examination, mitigation, and relevant gaps. It also says datasets should reflect the geographical, contextual, behavioural, or functional setting in which the system will operate.
Article 53 pushes a related discipline upstream for providers of general-purpose AI models: technical documentation, copyright policy, and a public summary of training content. The European Commission's July 2025 template for training-content summaries is a minimal baseline, not a full history of data production. But it recognizes the core point that Wiggins and Jones make historically: training content has origin, scope, structure, and power, and institutions cannot govern what they refuse to describe.
The EU Data Act adds a different but relevant pressure point. It concerns access to and use of data from connected products and related services, and the Commission says it has applied since September 12, 2025. For this review, its lesson is not that all data should be opened for every AI use. It is that data access, portability, switching, and contract terms decide who can build systems, who can audit them, and who remains locked out of the evidence.
NIST's AI Risk Management Framework gives that work an operational rhythm: govern, map, measure, and manage risk across the AI lifecycle. For data history, the key word is map. Mapping should include the source institution, collection authority, category definitions, transformations, exclusions, update cadence, known gaps, licensing or consent basis, downstream decision points, monitoring plan, and the path for correction. A model card without a data history can document the instrument while leaving the evidence layer unexplained.
NIST's Generative AI Profile makes that point more concrete for foundation-model practice by naming data collection methodologies, data provenance, data quality, model details, evaluation data, assumptions, limitations, legal requirements, and ethical considerations as documentation topics. That is not a substitute for law or audit, but it is a useful minimum for procurement: a buyer should not accept a high-impact system whose vendor can describe the interface but not the data history that makes the interface authoritative.
The 2025 AI Data Security guidance from NSA, CISA, FBI, and international partners adds the security version of the same control. Data supply-chain weakness, poisoning, and drift can make an AI system unreliable even when the model architecture is unchanged. Reliable sourcing, provenance tracking, cryptographic integrity checks, secure storage, and data-path logging are therefore governance and safety controls, not back-office hygiene.
Dataset documentation practices make the bridge from history to control. Datasheets for Datasets asks creators to document motivation, composition, collection process, preprocessing, recommended uses, and maintenance. Data Cards frame dataset documentation as a product for different audiences across a dataset's lifecycle. The Data Provenance Initiative's 2024 audit of AI datasets found that licensing and attribution records were often missing, unclear, or miscategorized. That is not only a copyright problem. It is a provenance problem: institutions cannot responsibly reuse evidence when they cannot reconstruct where it came from, what authority collected it, what permissions travel with it, and what transformations changed it.
The safety implication is concrete. When data systems classify people, allocate benefits, rank speech, automate suspicion, personalize prices, or train future models, the risk is not only bad prediction. It is self-confirming authority: a category changes treatment, treatment changes behavior, behavior becomes a new record, and the new record makes the category look natural. Governance must therefore require provenance, contestability, logging, independent audit, category review, affected-person notice where feasible, and sunset rules for data uses that have outlived their original purpose.
Current transparency evidence reinforces the point. Stanford HAI's 2026 AI Index reports that average Foundation Model Transparency Index scores fell in 2025 and that major gaps persist around training data, compute, and post-deployment impact. That does not prove any specific model is unsafe. It does show why the history of data has become a live safety problem: the most powerful systems are often the hardest systems for outsiders to inspect at the layer where institutional memory enters the machine.
Where the Book Needs Friction
The breadth that makes How Data Happened valuable also creates limits. A history that runs from earlier statistical reason to contemporary machine learning must compress many episodes. Readers looking for a deep technical treatment of neural networks, transformer architectures, recommender systems, or data-center economics will need companion texts.
There is also a recurring temptation in broad data histories to make "data" carry too much explanatory weight. Some harms come from datafication itself. Others come from property regimes, labor markets, racism, policing, advertising incentives, weak public institutions, geopolitical competition, and procurement habits. The cleanest reading of the book is not that data causes everything. It is that data gives older powers new durability, scale, and operational speed.
The book's constructive argument also needs institutional specificity. Saying that societies can choose better data futures is right, but choice has to be located somewhere: courts, agencies, unions, professional standards, procurement rules, public-interest research, democratic oversight, technical standards, and refusal rights. Without those levers, "ethical data" becomes another soft phrase that powerful systems can absorb.
A further limit is that historical diagnosis can still leave readers with a clean origin story: if we find the bad category, the bad collector, or the bad institutional use, we can fix the downstream system. Some harms are that direct. Others are cumulative. A record can be lawful, a label plausible, a model useful, and a deployment still harmful because many small conversions have narrowed the human situation before anyone makes a final decision. The book is strongest when read as a warning about those accumulations, not only about obviously abusive datasets.
The other needed friction is documentation realism. Data sheets, cards, summaries, provenance logs, and public registers can all become compliance theater if they do not give affected people, buyers, auditors, researchers, or regulators actual leverage. A beautiful provenance artifact is weak if no one can use it to correct a record, block an unfit deployment, withdraw stale data, or identify the institution responsible for repair.
Source Discipline
This review separates the book's historical argument from the current governance context. Norton and Princeton establish the book metadata and framing. Columbia and Princeton faculty pages establish author context. UVA Law and SIAM Review are used as reception and bibliographic evidence. UCL sources support the claims about Galton, Pearson, eugenics, Biometrika, and the institutional history of early statistics.
Current regulatory, standards, security, and documentation claims come from EUR-Lex, European Commission materials, NIST, NSA/CISA/FBI guidance, Stanford HAI, and dataset-documentation research. Those sources do not prove that all datafication is harmful or that every AI system is illegitimate. They support a narrower claim: machine-readable records require governance before they become automated authority, especially when their origins, assumptions, exclusions, and downstream effects are hard for affected people to inspect.
The minimum source record for a high-impact AI dataset should name the source, collector, collection authority, collection purpose, consent or legal basis where applicable, category definitions, transformations, filtering, excluded populations, known errors, update cadence, permitted and prohibited uses, deployment setting, downstream action, correction path, and retirement trigger. Without that record, source discipline collapses into citation theater: the system can name a dataset while hiding why the dataset is fit, unfit, or stale for the use at hand.
That record should also distinguish four claims that are often collapsed: origin, where the data came from; rights, why the institution may use it; fitness, why it matches the intended deployment context; and recourse, how an affected person or reviewer can correct, remove, or challenge it. A system can satisfy one of those claims while failing another. Licensed data can be unfit. Accurate data can be used out of context. Well-documented data can still leave no meaningful path for correction.
Current factual and governance claims on this page were rechecked on June 25, 2026. Legal and standards claims should be read as jurisdiction- and scope-specific, not as proof that a particular deployment complies or is safe.
This page does not claim that present AI systems are conscious, divine, or AGI. The relevant claim is institutional: AI systems turn older records and categories into scalable action, so the history of how data happened is part of the safety case for how AI is deployed.
What This Changes
How Data Happened belongs in this catalog because it explains the prehistory of the AI interface. Before the chatbot answers, before the agent acts, before the model gives a score, there is a long chain of choices about what can be known, stored, linked, optimized, and acted upon.
The book is a corrective to technological spectacle. It asks the reader to look beneath the polished surface of prediction and see the census form, the category, the database, the laboratory, the military contract, the advertising market, the search index, the compliance office, and the institutional appetite for machine-readable certainty.
That appetite is the real subject. Institutions like systems that make people easier to see, compare, price, rank, discipline, and serve. People often want the convenience those systems provide. The danger begins when the record becomes more actionable than the person, when the model's world becomes administratively easier to believe than lived reality, and when correction arrives too late to matter.
The best reason to read Wiggins and Jones now is that they make AI feel older without making it feel harmless. Today's systems inherit centuries of measurement politics, but they also add scale, speed, personalization, and automation. The lesson is not nostalgia for a pre-data world. It is discipline: every machine-readable reality has authors, funders, categories, blind spots, and beneficiaries. Governance starts by making those arrangements visible before they harden into common sense.
Related Pages
- All Data Are Local, Data Feminism, and Sorting Things Out extend the book's argument into data settings, power, absence, and category maintenance.
- Seeing Like a State, The Seductions of Quantification, The Tyranny of Metrics, and Weapons of Math Destruction show how legibility, indicators, metrics, and scores become administrative authority.
- Automating Inequality, The Black Box Society, and Algorithms of Oppression track how machine-readable records become welfare, credit, policing, search, and platform power.
- The Data Sheet Becomes the Supply Chain, Machine Learners, Training Data, AI Data Provenance, and Model Cards and System Cards turn data history into documentation and procurement controls.
- The Vector Database Becomes Institutional Memory, The Enterprise Connector Becomes the Permission Map, and The Agent Log Becomes the Receipt show what happens after records become live infrastructure for retrieval and delegated action.
- AI Governance, AI System Inventory, AI Bill of Materials, Algorithmic Transparency, Algorithmic Impact Assessments, Algorithmic Recourse, Contextual Integrity, and Data Minimization give the operational vocabulary for provenance, purpose, and contestability.
- Privacy and Data, Research Integrity, Vendor and Platform Governance, Transparency and Public Registers, and Claim Hygiene Protocol connect the same discipline to site-level practice: keep the record inspectable, dated, and answerable.
Sources
- W. W. Norton UK, How Data Happened, publisher description, hardback ISBN, ebook ISBN, and page count, reviewed June 25, 2026.
- Google Books, How Data Happened: A History from the Age of Reason to the Age of Algorithms, publication date, publisher, subject area, and page count, reviewed June 25, 2026.
- Princeton University Department of History, How Data Happened: A History from the Age of Reason to the Age of Algorithms, publication details, ISBN, publisher, subject areas, and publisher description, reviewed June 25, 2026.
- Princeton University Department of History, Matthew L. Jones faculty profile, academic biography, research interests, and publication details, reviewed June 25, 2026.
- Columbia University Applied Physics and Applied Mathematics, Chris H. Wiggins faculty profile, academic biography, affiliations, and research areas, reviewed June 25, 2026.
- Columbia Business School, "How Data Happened: A History from the Age of Reason to the Age of Algorithms", August 2, 2023, account of Wiggins's talk and book themes, reviewed June 25, 2026.
- Jennifer Chapman, "Book Review: How Data Happened: A History from the Age of Reason to the Age of Algorithms", Law Library Journal 116, no. 3, 2024, via University of Virginia School of Law, reviewed June 25, 2026.
- Rachel Roca, "Featured Review: How Data Happened: A History from the Age of Reason to the Age of Algorithms", SIAM Review 67, no. 2, 2025, doi:10.1137/24M1635521, reviewed June 25, 2026.
- University College London, "UCL and its eugenics legacy", Galton and the 1883 coinage of eugenics, reviewed June 25, 2026.
- University College London, "Our Early History", Pearson, Biometrika, the Chair in Eugenics, and the founding of UCL's Department of Applied Statistics, reviewed June 25, 2026.
- European Union, Regulation (EU) 2024/1689, Artificial Intelligence Act, official EUR-Lex text, especially Articles 10, 53, and 113 on data governance, general-purpose AI model obligations, and application dates, reviewed June 25, 2026.
- European Commission, Explanatory Notice and Template for the Public Summary of Training Content for general-purpose AI models, publication July 24, 2025, reviewed June 25, 2026.
- European Commission, Data Act implementation overview, applicability from September 12, 2025 and official implementation context, reviewed June 25, 2026.
- NSA Artificial Intelligence Security Center, AI Data Security guidance announcement, May 22, 2025 joint guidance on data supply-chain risks, data poisoning, data drift, provenance tracking, integrity controls, and lifecycle data security, reviewed June 25, 2026.
- National Institute of Standards and Technology, AI Risk Management Framework and AI RMF Core, Govern, Map, Measure, and Manage functions, reviewed June 25, 2026.
- National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, documentation practices for data provenance, data quality, assumptions, limitations, legal requirements, and evaluation data, reviewed June 25, 2026.
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford, "Datasheets for Datasets", arXiv record and CACM publication note, reviewed June 25, 2026.
- Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson, "Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI", ACM FAccT 2022, reviewed June 25, 2026.
- Shayne Longpre et al., "A large-scale audit of dataset licensing and attribution in AI", Nature Machine Intelligence, August 2024, reviewed June 25, 2026.
- Stanford HAI, 2026 AI Index Report, top-level report and responsible-AI coverage, reviewed June 25, 2026.
- Stanford HAI, "Transparency in AI is on the Decline", December 2025 account of the Foundation Model Transparency Index score drop and disclosure gaps around training data, compute, use, and societal impact, reviewed June 25, 2026.
Book links are paid affiliate links. As an Amazon Associate I earn from qualifying purchases.
- Amazon, How Data Happened by Chris Wiggins and Matthew L. Jones, affiliate search listing, reviewed June 25, 2026.