Blog · arXiv Analysis · Last reviewed July 2, 2026

The Similarity Threshold Becomes the Query Contract

This paper turns tabular embedding similarity from a vague nearest-neighbor score into a thresholded query contract. The important move is not only better row retrieval; it is knowing when the right answer is no row at all.

The Paper

The paper is Hyperdimensional computing for structured querying on tabular data embeddings, arXiv:2606.13871 [cs.AI, cs.DB], by Sebastián Bugedo and Stijn Vansummeren of UHasselt, DSI, Diepenbeek, Belgium. arXiv lists version 1 as submitted on June 11, 2026, with DOI 10.48550/arXiv.2606.13871. The arXiv page lists 15 pages with appendices, 8 figures, and "Under review."

The arXiv HTML includes a reproducibility note linking to UHasselt-DSI-Data-Systems-Lab/code-hdc-for-tabular-data. The repository is public, GPL-3.0 licensed, mostly Jupyter Notebook, and contains data folders, query sets, scripts, notebooks, outputs, requirements.txt, and a README. The README recommends Python 3.9+, and the requirements include faiss-cpu, gensim, scikit-learn, numpy, pandas, torch, and torch-hd. The README states that EmbDI embedding generation is not included as part of the code.

The Score Problem

Tabular embeddings are useful because they let systems retrieve nearby rows, columns, cells, or whole tables. They show up in data profiling, entity annotation and resolution, schema matching, column type detection, table search, and RAG-like data workflows.

The practical weakness is calibration. A nearest-neighbor score such as 0.72 does not tell an operator whether the row is a true match or merely the closest wrong row in a corpus with no valid answer. Top-k retrieval always returns something when k is at least 1, so it cannot represent a zero-match query. That is a governance problem: the system cannot distinguish "found a match" from "found the least bad candidate."

The paper narrows the task to row embeddings and structured select-project queries. Row retrieval corresponds to selection: find rows satisfying equality or non-equality predicates. Attribute projection corresponds to recovering the value for an attribute from an embedded row.

HDC as Query Algebra

The method uses HyperDimensional Computing, specifically Holographic Reduced Representations, or HRR. HDC encodes structured information with high-dimensional vectors and simple operations: binding, bundling, and unbinding. In this paper, a row is encoded as a bundle of bound attribute-value pairs, and a predicate is encoded in the same vector algebra.

For equality row retrieval, the paper derives the expected similarity between an HRR row encoding and a predicate encoding. For a row with m attributes and a matching predicate with n attributes, the expected similarity converges to sqrt(n / m) as dimension grows. For non-equality predicates, the matching expected similarity converges to 0. The variance terms shrink with dimension, which is why larger HRR vectors make the thresholds sharper.

The advised thresholds are called tau_eq(n, m, d) and tau_neq(d). They use one standard deviation from the expected value to tolerate cross-talk noise between bound pairs in bundles. The paper leaves an equivalent probability analysis for attribute projection to future work because projection is performed by unbinding and selecting the most similar value vector, not by thresholding row retrieval.

The baseline is EmbDI, a graph-based data-integration embedding method. EmbDI builds a tripartite graph over rows, column names, and values, samples random walks, and trains Word2Vec-style embeddings. That makes it a strong non-LLM baseline for structured tabular embeddings, but it does not provide a single theoretically derived threshold for zero-match detection.

Experiments

The authors use two real-world tabular datasets from prior EmbDI work. Movie has m max = 15 columns and 49,875 rows. DBLP has m max = 4 columns and 66,876 rows. They create 15 Movie tables and 4 DBLP tables by taking the first m columns for each table size, keeping duplicate rows when present.

For each table, they generate 10 equality predicates and 10 non-equality predicates for each predicate length n. They also generate 10 zero-match equality predicates for each n and table by sampling a row and modifying one attribute to a value from a different row or larger table.

EmbDI is evaluated at dimensions 300 and 512, with 500,000 random walks for Movie and 1,000,000 for DBLP, using Word2Vec with window size 3. HDC uses the HRR model at dimensions 300, 512, and 1024, with 3 runs per dimension to average over random atomic-vector generation. Row retrieval is evaluated with top-k for k in 1, 2, 5, 10, and 20, and also with threshold sweeps: 0.1 to 1.0 for equality and -0.3 to 0.2 for non-equality, step 0.05.

For attribute projection, the authors sample 50 rows from each table, recover candidate values by unbinding the row vector with the attribute vector, and measure exact-match accuracy.

Results

For equality row retrieval, HDC and EmbDI behave similarly in top-k retrieval for the closest rows, especially at small k. The problem is that top-k is not a general query answer rule. When the true result set has a different size from k, precision and recall move mechanically; when the predicate has zero matches, top-k still returns rows.

With threshold retrieval, HRR becomes much more interesting. On the Movie dataset, Table 1 reports average F1 over predicate lengths by table size. At m = 15, HRR 300 scores 0.82, HRR 512 scores 0.92, HRR 1024 scores 0.99, EmbDI 300 scores 0.76, and EmbDI 512 scores 0.82. At m = 10, HRR 300 scores 0.85, HRR 512 scores 0.97, HRR 1024 scores 0.99, EmbDI 300 scores 0.71, and EmbDI 512 scores 0.69. The authors summarize that advised HDC thresholds outperform EmbDI's average best threshold in almost every case.

Zero-match behavior is the cleanest demonstration. Table 2 reports average retrieved rows for zero-match equality predicates on the Movie dataset using tau_eq. HRR 1024 retrieves 0.00 rows from m = 1 through m = 12, then 0.01 at m = 13, 0.06 at m = 14, and 0.40 at m = 15. Even the weaker HRR 300 setting retrieves 101.99 rows at m = 15, which is small relative to 49,875 rows, but the higher-dimensional result is the important one for reliable zero-match decisions.

For non-equality predicates, the paper reports that EmbDI has very low performance except for short predicates on Movie, while HDC remains consistent across table sizes and predicate lengths, with results comparable to equality retrieval. The proposed non-equality thresholds align with observed F1 peaks, and the mean threshold tau_mean = 0 marks where recall begins to decrease.

For attribute projection, Table 3 shows that HRR 512 is already enough to reach 1.00 overall accuracy on DBLP and 0.93 overall accuracy on Movie, while HRR 1024 reaches 1.00 overall on both datasets. Movie is harder at low dimension: HRR 300 scores 0.50 overall, below EmbDI 300 at 0.78. But EmbDI is inconsistent across columns, especially low-cardinality fields such as rating, where EmbDI 300 scores 0.30 and EmbDI 512 scores 0.08, while HRR 512 reaches 0.98 and HRR 1024 reaches 1.00.

Governance Standard

An HDC tabular retrieval system should ship a threshold receipt. The receipt should include the source table, row count, column count, schema subset, duplicate-row policy, row encoder, atomic-vector initialization, HRR dimension, random seed, binding operation, bundling rule, normalization rule, predicate type, predicate length, selected threshold formula, concrete threshold value, nearest-neighbor index, top-k fallback if used, zero-match rule, retrieved-row count, precision/recall/F1 evaluation, attribute projection rule, code version, data version, and whether EmbDI or another baseline was regenerated or imported.

This matters because a similarity score is not an answer. A thresholded query can say "return these rows" or "return no rows" under a documented rule. A nearest neighbor alone can only say "this was closest." For public records, enterprise data integration, or RAG over structured tables, that difference is the boundary between evidence and suggestion.

This connects directly to Embeddings and Vector Representations, Vector Databases, Retrieval-Augmented Generation, AI Evaluations, Training Data, AI Audit Trails, JSON Schema, The World Becomes an Embedding, The Vector Database Becomes Institutional Memory, The Table Reference Becomes the Reasoning Error, The Evaluation Bench Becomes the Test Rig, The Factory Manual Becomes the RAG Playground, and The Knowledge Conflict Becomes the Resolution Trace.

Limits

The paper is intentionally scoped. It studies row embeddings, not column embeddings or whole-table embeddings. It studies exact matching through structured select-project queries, not semantic matching, approximate entity linking, or natural-language table QA.

Dimension is a real cost knob. The best projection results require HRR 1024, and the authors note that vectors that are too large may not be desirable in practice. The threshold story is strongest when the vector dimension is high enough for cross-talk noise to be controlled.

The repository improves reproducibility, but the README states that EmbDI embedding generation is not included. A full replication receipt should therefore name whether baseline embeddings were regenerated, reused, or taken from prior artifacts. The broader open question is whether HDC's exact-match guarantees can be combined with pretrained semantic value vectors such as Word2Vec, and whether the same style of thresholded interpretability extends to column or table embeddings.

Sources


Return to Blog