Raw Data Is an Oxymoron and the Dataset Myth
Lisa Gitelman's edited volume "Raw Data" Is an Oxymoron gives AI criticism a basic discipline: before debating what a model knows, inspect how the data became data.
The Book
"Raw Data" Is an Oxymoron was published by The MIT Press in 2013 as part of the Infrastructures series. MIT Press lists Lisa Gitelman as editor, the paperback ISBN as 9780262518284, the ebook ISBN as 9780262312325, and the book at 208 pages. Amazon lists the paperback product at ISBN-10 0262518287 and ISBN-13 978-0262518284. NYU Steinhardt's profile for Gitelman identifies her as editor of the collection and as a media historian whose work includes Always Already New.
The volume brings together essays on the history of data: the meaning of "data" before modern databases, financial modeling, astronomical records, newspaper clippings, scientific curation, online tracking, and other cases where supposedly given facts turn out to have been generated, formatted, cleaned, interpreted, and stabilized. It is not a book about neural networks. It is more useful than that: it is a book about the conditions that make datasets look natural.
Raw Is a Claim
The title is the argument. "Raw" data is never simply found. It is collected by someone, for a purpose, with instruments, categories, omissions, thresholds, and expectations already in place. Even when a record feels mechanical, the machine was situated inside a culture of measurement and use. A datum is not a little piece of reality that escaped mediation. It is reality after a capture practice has given it a form.
That point matters because the word "raw" often launders authority. It makes a dataset appear prior to interpretation, prior to politics, and prior to accountability. Once data has been described as raw, later analysis can look like the first human intervention rather than one more step in a longer chain of decisions.
The AI Reading
Read in June 2026, the book is an AI-governance text. Training data, evaluation sets, benchmark questions, red-team transcripts, user logs, preference rankings, embeddings, and retrieval corpora are all made things. They carry collection choices, licensing assumptions, language distributions, annotation rules, platform affordances, moderation histories, and gaps. A model trained on them inherits those arrangements without necessarily exposing them.
This is why dataset critique is not a side issue. A model's failure may begin before architecture, before fine-tuning, before deployment, before the prompt. It may begin when a world was made machine-readable in a particular way. The archive becomes the ontology. The category becomes the possible answer. The missing record becomes the missing person.
The Dataveillance Problem
Publisher and bibliographic descriptions place contemporary "dataveillance" beside older histories of records and curation. That connection is essential. Surveillance data is not just observed behavior. It is behavior captured through designed systems: clicks, likes, movements, forms, purchases, scores, images, texts, timestamps, and device traces. Those records are then repurposed into markets, security decisions, recommender systems, fraud models, and AI training pipelines.
The danger is not only that data is wrong. It is that data can be socially successful while remaining partial. A platform may know enough to target, sort, and predict without knowing enough to understand context, consent, coercion, irony, vulnerability, or refusal. That asymmetry is the moral center of the dataset myth: institutions can act on data long before the data deserves the authority it receives.
Governance of Data Making
NIST's AI Risk Management Framework describes AI risk management across design, development, use, and evaluation. NIST's Generative AI Profile specifically points organizations toward documenting proposed uses, assumptions, limitations, data collection methodologies, data provenance, data quality, model details, training approaches, evaluation data, legal requirements, and ethical considerations. Gitelman's volume gives the historical reason for that documentation burden: without a record of data making, there is no serious account of what the system is doing.
For AI agents, this becomes more urgent. An agent that retrieves files, updates records, drafts decisions, or calls tools can turn a questionable dataset into action. Governance therefore has to ask not only whether the model output is plausible, but what records it relied on, how those records were made, what they omit, and who can contest them.
Where the Book Needs Care
The book is an edited academic collection, not a policy manual. It does not give a single operational checklist for AI builders, procurement officers, or auditors. It also predates the current generative AI platform economy, so readers must do the translation work themselves.
That limitation is also its strength. The book slows the reader down before the usual AI debate begins. It asks for a history of the dataset, not just a metric. It asks for provenance, not just performance. It asks who produced the record, who cleaned it, who named it, who paid for it, who was captured by it, who was excluded from it, and who is governed by the system that now calls it evidence.
"Raw Data" Is an Oxymoron belongs in this archive because it makes machine intelligence less mystical and more inspectable. AI does not learn from the world in general. It learns from worlds already made into data. The politics begins there.
Sources
- The MIT Press, "Raw Data" Is an Oxymoron, publisher listing for exact title, editor Lisa Gitelman, Infrastructures series, paperback ISBN 9780262518284, ebook ISBN 9780262312325, page count, and description, reviewed June 16, 2026.
- Amazon, Raw Data Is an Oxymoron, retail listing at product path /dp/0262518287 with ISBN-10 0262518287 and ISBN-13 978-0262518284, reviewed June 16, 2026.
- New York University Steinhardt, Lisa Gitelman profile, official profile listing the edited collection and author background, reviewed June 16, 2026.
- Google Books, Raw Data Is an Oxymoron, bibliographic listing and author/editor context, reviewed June 16, 2026.
- University of Surrey Open Research, "Raw Data" Is an Oxymoron record, repository metadata for title, editor, subject, and series, reviewed June 16, 2026.
- National Institute of Standards and Technology, AI Risk Management Framework, official NIST page for AI risk management across AI design, development, use, and evaluation, reviewed June 16, 2026.
- National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, official NIST profile covering documentation of data collection methodologies, provenance, data quality, assumptions, limitations, and evaluation data, reviewed June 16, 2026.
Book links are paid affiliate links. As an Amazon Associate I earn from qualifying purchases.