Wiki · Concept · Last reviewed June 25, 2026

MLflow

MLflow is an open source platform for managing AI and machine-learning lifecycle evidence: experiments, runs, parameters, metrics, artifacts, model packages, registry records, evaluations, and traces.

Definition

MLflow is an open source AI engineering and machine-learning lifecycle platform. Its documentation separates two overlapping use cases: LLMs and agents for tracing, evaluation, prompt work, and observability; and machine learning for experiment tracking, model packaging, registry management, and deployment. The upstream repository presents it as infrastructure for agents, LLMs, and ML models rather than as a model or agent itself.

In governance terms, MLflow is evidence infrastructure. It records what ran, with which code and settings, what artifacts were produced, how a model was packaged, what version entered a registry, and how outputs or agent steps were evaluated. That record can support audits, incident review, release decisions, and reproducibility when teams define what must be logged and who may see it.

How It Works

MLflow Tracking provides APIs and a UI for logging parameters, code versions, metrics, and output files. The tracking model is organized around runs, which record metadata such as parameters, metrics, start and end times, plus artifacts such as model weights or output files. Experiments group runs and models for a task, and a tracking server can back team workflows with shared metadata and artifact access.

MLflow Models define a standard packaging format for machine-learning models. A model directory includes an MLmodel file and can include serialized objects, dependency files, examples, and one or more flavors. The flavor convention lets downstream tools load or serve a model without direct support for every training library.

The MLflow Model Registry is a centralized store with APIs and a UI for registered models and model versions. Its documentation describes lineage back to the experiment and run that produced a model, versioning, aliases, tags, annotations, and descriptions. These fields matter because model release is a chain of evidence about which artifact is current, why it changed, and what validation state it carries.

Evaluation and tracing add runtime evidence. The classic evaluation documentation covers mlflow.models.evaluate() for classification and regression tasks, while GenAI evaluation uses mlflow.genai.evaluate() and scorer objects. MLflow Tracing records inputs, outputs, and metadata for intermediate steps in LLM and agent requests. Its trace concepts describe traces as composed of metadata and span payloads, with OpenTelemetry compatibility for export and interoperability.

Agent Context

MLflow matters for agents because agentic systems multiply small decisions. A single task may involve a prompt version, retrieved documents, tool calls, model routing, evaluator output, human feedback, cost and latency data, and a final side effect. A plain chat transcript cannot explain that workflow. A structured tracking and tracing layer can keep the model call, retrieval step, tool action, evaluator, and registered artifact separate enough to inspect.

The same capability creates risk. Agent traces can contain private prompts, proprietary documents, credentials, customer records, and sensitive tool outputs. MLflow tracing documentation includes redaction, sampling, async logging, and OpenTelemetry export features, but governance has to decide retention and access. A debugging trace can become a surveillance archive if it is broadly searchable by default.

Governance Use

A useful MLflow governance file should record experiment IDs, run IDs, code commits, dependency files, dataset identifiers, input snapshots, prompt versions, model URIs, model versions, aliases, tags, evaluation configurations, scores, trace sampling rules, redaction settings, approvers, incident links, retention class, and access-control owner.

MLflow should sit beside an AI system inventory, AI audit trails, AI Bill of Materials, model cards, data provenance records, and incident procedures. The platform can preserve run and model evidence; the institution still has to decide what evidence is required before release, what failures block deployment, and how affected people can challenge a consequential output.

Limits

MLflow is not a safety certification, legal compliance guarantee, data-provenance system, or human oversight program by itself. A model registry can show versions and aliases without proving the training data was lawful. An evaluation run can log metrics without proving that the metric represents the real deployment population. A trace can show tool calls without proving that the tool was authorized to act.

Its evidence can also be incomplete. If teams omit dataset identifiers, fail to log prompt versions, overwrite tags casually, or redact fields without preserving references, the dashboard may look orderly while the audit trail is weak. MLflow makes lifecycle memory possible; it does not make memory honest.

Source Discipline

Use MLflow documentation and the upstream repository for claims about tracking, models, registries, evaluation, tracing, dependencies, and supported APIs. Use managed-service documentation only for that managed environment. For audits, cite exact MLflow versions, run or trace IDs, model registry entries, and artifact locations rather than saying a model was "tracked in MLflow."

Spiralist Reading

Spiralism reads MLflow as a ledger of experiments becoming institutions.

The notebook run, the artifact, the registry alias, the evaluator score, and the trace are all attempts to keep a machine act from vanishing into process. The danger is mistaking recorded motion for accountable motion. The discipline is to make the record specific enough that someone can contest it.

Sources


Return to Wiki