Databasin One

Documents and chat

How the agent searches inside the files you upload.

Last updated June 29, 2026

Reading time 4 min read

When you drop a PDF, Word doc, or spreadsheet into Databasin One, the agent can read inside it — not just reference it. This is the under-the-hood walkthrough: what gets stored when you upload, and how the agent finds the right passages when you ask a question.

What an embedding is (and how we actually use it)

An embedding is a vector — a list of numbers, 384 of them in Databasin's case — that captures the meaning of a chunk of text. Two passages that mean similar things produce similar vectors. The classic use is "semantic search": embed your question, then rank chunks by how close their vectors are.

Databasin generates an embedding for every chunk you upload and stores it next to the text. Here's the honest part, though: on Trino, those vectors aren't used to rank results today. Trino's similarity function expects a sparse map, while Databasin stores a dense 384-number array — the shapes don't match, so the function is disabled for these tables. The embeddings are still written (they keep your uploads compatible with vectors produced by the ingestion pipeline), but the live document search runs a different way, described below.

What happens when you upload

When you add a file to Databasin One's data context:

Parsing. PDFs are extracted to text, page by page. CSVs and Excel are read as structured rows.
Chunking. Long text is split into ~80-word chunks with a ~20-word overlap, so a sentence that straddles a boundary still appears whole in one of the pieces. Short pages stay as a single chunk.
Embedding. Each chunk is run through a small model in your browser — Transformers.js with the 384-dim bge-small model — or through Ollama if you've configured one as a fallback (nomic-embed-text, 768-dim). The vector is attached to the chunk.
Landing. Chunks land in an uploaded_documents table with columns content, source_file, page_number, chunk_number, and databasin_embeddings.

CSV and Excel files are handled differently: each one becomes its own structured table, so you can query columns and rows directly.

Open Databasin One and drop in a file

File size limit

Individual files are capped at 20 MB. For larger documents, split them first. For a workbook with many sheets, save each sheet as its own file.

What happens when you ask a question

When your question hits a data context that has uploaded documents, the agent's semantic-search skill activates and writes SQL against the document table. How it searches depends on the engine:

On Trino, it searches the content column with LIKE / ILIKE keyword matching, expanded with synonyms — case-insensitive, several terms OR'd together to widen recall. There's no vector-ranking step; the agent itself supplies the "semantics" by expanding your question into related words.
On Databricks, it uses ai_similarity(), a built-in function that scores how close two pieces of text are in meaning on a 0-to-1 scale. The agent orders by that score and keeps the closest matches.

Either way, the agent pulls the top matches, keeps the citation columns (source_file, page_number, chunk_number), and feeds them to the model — which answers citing the specific pages and chunks it used. Every claim points back to a source.

Why multi-term LIKE is a real strategy

Keyword search sounds blunt, but with good term expansion it's a strong tool — and on Trino it is the tool. The skill teaches the model to turn one question into many search terms: abbreviations, synonyms, and word stems.

Example search for "PTO":

WHERE LOWER(content) LIKE '%pto%'
   OR LOWER(content) LIKE '%paid time off%'
   OR LOWER(content) LIKE '%vacation%'
   OR LOWER(content) LIKE '%personal leave%'

The richer the OR expansion, the better the recall. A typical question expands to four to eight terms.

Combining documents with structured data

The real payoff is JOINing a document table with a structured one — for example, "find every order whose customer contract mentions auto-renewal" is a join between your orders table and the matching chunks in uploaded_documents.

You don't have to ask for this specially. If your question requires it, the agent writes the join.

Privacy and where your data lives

Document content and embeddings stay inside your lakehouse, the same as any other table. The embedding model runs in your browser at upload time, so generating vectors doesn't ship your file anywhere. At query time the search is plain SQL inside your warehouse — nothing about a document leaves it except:

The question itself (sent to the model).
The handful of matching chunks (sent to the model).
The agent's answer (returned to you).

Limits worth knowing

Keyword search has limits. On Trino, "find the passages that mention X" works well; "summarize the themes across 50 documents" is weaker, because keyword matching can't read everything. The agent will tell you when it's searching broadly, and answer from what the keywords surfaced rather than pretend it read the whole corpus.
Spreadsheet content is queried structurally — columns and rows — not as prose. The document-search path is for text.
No cross-document reasoning unless the documents are in the same data context.

Where to go next

Skills and what they doThe semantic-search skill is one of seven. Meet the agentHow retrieval plugs into the tool loop. Multi-engine SQLTrino vs. Databricks, and why the search differs. Databasin One overviewThe big picture.