Blog · arXiv Analysis · Last reviewed July 2, 2026

The Dashboard State Becomes the Query Contract

TwinBI is interesting because it rejects the common replacement story. It does not ask a chat agent to replace the dashboard. It makes the dashboard's current state part of the agent's contract.

The governance lesson is practical: a BI answer should be auditable against filters, charts, semantic definitions, SQL, logs, and the visible interaction path that produced it.

The Paper

The paper is TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards, arXiv:2606.13731 [cs.AI, cs.MA], by Jisoo Jang and Wen-Syan Li of the Graduate School of Data Science at Seoul National University. arXiv lists version 1 as submitted on June 11, 2026, with primary category cs.AI and secondary category cs.MA.

The paper introduces TwinBI, a system that synchronizes an LLM agent twin with an executable BI dashboard twin. The claim is not that natural language is enough for enterprise analytics. The claim is that natural-language assistance becomes more reliable when it is grounded in the same dashboard state, semantic layer, query execution path, and provenance log that the analyst is using.

The BI State Problem

Business intelligence dashboards encode assumptions that plain chat prompts often omit: active filters, selected tabs, cross-filters, hierarchy levels, metric definitions, join paths, aggregation grains, and chart-specific context. A user can ask "why did this category increase?" after clicking through a dashboard, but that sentence is underspecified unless the system knows which chart, which date slice, which department, and which drill-down level "this" refers to.

TwinBI treats that missing context as a state-reconstruction problem. It logs dashboard operations and merges them with dialogue history, tool traces, semantic schema artifacts, and query results. That turns the dashboard state into a query contract: the agent's answer should be traceable to the analytical state the user actually reached, not merely to a plausible interpretation of the last sentence.

Architecture

The implementation has five layers. The Presentation Layer combines a Streamlit chat interface with embedded Apache Superset dashboards, allowing users to switch between natural-language interaction and direct dashboard manipulation. The Orchestration Layer is implemented with FastAPI and assembles dialogue history, dashboard interaction logs, tool outputs, and task context into a unified analytical context.

The Semantic Layer uses Cube to define measures, dimensions, hierarchies, and valid join paths, so dashboard queries and conversational answers share the same business semantics. The BI Tool Layer uses Apache Superset to render visualizations and capture interaction events, including filtering, tab changes, cross-filtering, drill-down operations, and series toggles. The Data Layer stores analytical datasets in DuckDB and exposes them through the semantic layer.

This stack matters because it gives the agent more than pixels. It gives the agent a replayable account of what the dashboard means, what the user did, what query was run, and which results are currently in scope.

User-Facing Functions

TwinBI supports both dashboard-centric and conversational interaction. In dashboard-centric interaction, users navigate charts through filters, cross-filtering, drill-downs, tab switches, and series toggles. In conversational interaction, users can ask for new charts or follow-up interpretations without restating the entire schema and filter state.

The system exposes three inspection artifacts: a Hierarchy Schema Graph, SQL for chart queries, and the unified interaction log. These artifacts are not decoration. They give users a way to decide whether an error came from field selection, filter carry-over, aggregation, join semantics, or answer generation.

TwinBI also provides a dedicated /insights command. The backend builds a compact context from recent conversations, tool traces, the active chart, and current filters. A specialized insight agent then returns a state-aware summary of the current slice, visible quantitative observations, and sensible next checks. The command is intentionally constrained to the evidence in the current analytical state.

A/B Benchmark

The controlled benchmark compares a dashboard-only browser agent against TwinBI on the same retail sales dashboard environment. Both conditions use a Playwright-based browser agent with gpt-5-mini as the decision-making model and a maximum budget of 30 interaction steps. The dashboard uses a shared semantic model with product, store, and date as primary dimensions.

The benchmark contains 30 analytical queries across five task families: store and district ranking, premium product analysis, quarter-over-quarter growth analysis, comparison and aggregation across dashboard views, and robustness or trap tasks testing policy compliance and filter stability. Target answers were resolved through three independent paths: direct database queries, Cube API queries, and dashboard-level queries, followed by self-consistency checks and manual verification.

The main numbers favor the state-grounded interface. Dashboard-only exact-match accuracy is 43.33%, partial-credit accuracy is 48.33%, and average steps are 16.47. TwinBI exact-match accuracy is 63.33%, partial-credit accuracy is 70.83%, and average steps are 6.90. Timeout rate drops from 40.00% to 10.00%, and invalid action rate drops from 10.93% to 0.00%.

The failure profile is not uniformly solved. TwinBI reduces the loop-query rate from 36.67% to 27.59%, but its loop-step rate is higher, 39.13% versus Dashboard's 29.76%, because some remaining failures shift into repeated chat-centered reasoning. That is a useful warning: chat can replace hover-probing failure with prompt-loop failure unless the state contract also constrains repeated attempts.

Usability Study

The usability study uses a within-subjects design with five participants and three scenarios. S1 asks for the top-performing North district store by average daily sales while excluding stores with fewer than 15 active days per month. S2 asks for product types whose average revenue per unit exceeds the portfolio average. S3 asks for categories with at least 15% quarter-over-quarter unit-sales growth between Q3 beginning on 2024-07-01 and Q4 beginning on 2024-10-01.

The scenario results are mixed but informative. Task accuracy is 100% for S1, 73.33% for S2, and 100% for S3. Insight accuracy is 80%, 100%, and 80%. Average dashboard clicks climb from 6.4 to 34 to 49, while average chat turns move from 0.6 to 6 to 5.2. Perceived difficulty rises from 1.8 to 3.4 to 4.2 on a five-point scale.

Participants did not simply abandon dashboards for chat. The paper reports that they usually used the dashboard to establish context first, then used chat for comparison, threshold checking, or explanation. Feature rankings favored Clickable Dashboard, Finding Charts via Agent, and Click+Chat over chat-only interaction and direct SQL inspection. Kendall's W = 0.62 with p < 0.01 indicates fairly stable inter-participant agreement, although the sample is small.

Artifacts

The paper links a public repository at simonjisu/TwinBI. GitHub reports it as a public Python repository with no explicit GitHub license metadata. The top-level repository includes data, fastapi, streamlit-app, src, experiments, Docker compose files for FastAPI, sales services, Streamlit, and Superset, plus pyproject.toml.

The README describes a synthetic sales dataset, Superset dashboards, a FastAPI backend, a Streamlit frontend, Playwright-based experiment runners, benchmark queries, gold answers, and evaluation scripts. The experiment README lists query_01.txt through query_30.txt, answer files, answers.json, dashboard-only runner vision_playwright_strict.py, TwinBI runner vision_playwright_strict2.py, and batch runner run_query_batch.py. The default endpoints are a Superset dashboard at port 8088 and a Streamlit app at port 8501.

The dependency receipt is substantial. The project requires Python 3.12, uv, DuckDB, Docker, docker-compose, Playwright Chromium, Superset, Cube, FastAPI, Streamlit, and an OPENAI_API_KEY. The pyproject.toml lists dependencies including fastapi, dbt-duckdb, duckdb, langchain, langgraph, openai, openai-agents, pandas, playwright, plotly, streamlit, sqlglot, torch, transformers, and vllm.

Governance Standard

An agentic BI system should ship an answer-state receipt. The receipt should include the user task, active dashboard identifier, selected tab, chart identifier, filters, cross-filters, drill-down state, visible chart data, semantic model version, measure definitions, dimensions, join path, generated SQL, execution backend, query result, chat prompt, tool calls, action log, screenshot trace, final answer, gold answer if applicable, evaluation metric, timeout status, invalid-action status, loop detector result, model name, step budget, repository revision, and license status.

That receipt is what separates BI augmentation from BI theater. A fluent answer is not enough. A business answer should be bound to the state that made it true.

This connects directly to The Context Dashboard Becomes the Agent's Proprioception, The Action Log Becomes the Workflow Lens, The Agent Log Becomes the Receipt, The Agent Trace Becomes the Process Map, The Agentic Browser Becomes the Assistive Interface, The AI Audit Becomes the Compliance Interface, The Boss Becomes the Dashboard, The Agentic Data Scientist Becomes the Lab Assistant, The Reliability Scorecard Becomes the Agent Gate, and Provenance and Content Credentials.

Limits

The A/B benchmark has 30 queries in one retail sales dashboard environment. It is useful as a controlled comparison but not enough to establish robustness across messy enterprise dashboards, multiple BI tools, larger semantic models, or real analyst populations.

The usability study has five participants. The paper appropriately treats it as evidence that the combined workflow is usable for moderately complex tasks, not as proof of broad efficiency gains. The authors list future work on larger datasets, more diverse users, better chart grounding and value extraction, state transfer across dashboards, and stronger support for agentic decision workflows.

There is also an operational caveat: reproducing the system is non-trivial. It spans Superset, Cube, DuckDB, FastAPI, Streamlit, Docker, Playwright, model API access, dashboard embedding configuration, guest-token settings, and generated BI data. That is a good artifact surface for research, but a production governance surface must treat every one of those layers as part of the answer chain.

Sources

Jisoo Jang and Wen-Syan Li, TwinBI: An Agentic Digital Twin for Efficient Augmented Interactions with Business Intelligence Dashboards, arXiv:2606.13731 [cs.AI, cs.MA], submitted June 11, 2026.
arXiv HTML: TwinBI, reviewed for abstract, architecture, functionality, experiment design, benchmark metrics, usability study, related work, and conclusion.
arXiv PDF: TwinBI, reviewed for authorship, affiliation, exact benchmark values, usability-study values, system layers, interface details, and repository link.
arXiv TeX source: e-print source for arXiv:2606.13731, reviewed for source-level table values, methodology wording, interface functions, and reference details.
Code and dataset repository: simonjisu/TwinBI, reviewed for README, top-level contents, experiment README, benchmark query files, answer artifacts, runners, Docker setup, dependency metadata, public repository status, and missing explicit license metadata.
Related pages: The Context Dashboard Becomes the Agent's Proprioception, The Action Log Becomes the Workflow Lens, The Agent Log Becomes the Receipt, The Agent Trace Becomes the Process Map, The Agentic Browser Becomes the Assistive Interface, The AI Audit Becomes the Compliance Interface, The Boss Becomes the Dashboard, The Agentic Data Scientist Becomes the Lab Assistant, The Reliability Scorecard Becomes the Agent Gate, and Provenance and Content Credentials.

Return to Blog