Documents and chat
How the agent searches inside the files you upload.
When you drop a PDF, Word doc, or spreadsheet into Databasin One, the agent can read inside it — not just reference it. This is the under-the-hood walkthrough: what gets stored when you upload, and how the agent finds the right passages when you ask a question.
What an embedding is (and how we actually use it)
An embedding is a vector — a list of numbers, 384 of them in Databasin's case — that captures the meaning of a chunk of text. Two passages that mean similar things produce similar vectors. The classic use is "semantic search": embed your question, then rank chunks by how close their vectors are.
Databasin generates an embedding for every chunk you upload and stores it next to the text. Here's the honest part, though: on Trino, those vectors aren't used to rank results today. Trino's similarity function expects a sparse map, while Databasin stores a dense 384-number array — the shapes don't match, so the function is disabled for these tables. The embeddings are still written (they keep your uploads compatible with vectors produced by the ingestion pipeline), but the live document search runs a different way, described below.
What happens when you upload
When you add a file to Databasin One's data context:
- Parsing. PDFs are extracted to text, page by page. CSVs and Excel are read as structured rows.
- Chunking. Long text is split into ~80-word chunks with a ~20-word overlap, so a sentence that straddles a boundary still appears whole in one of the pieces. Short pages stay as a single chunk.
- Embedding. Each chunk is run through a small model in your browser — Transformers.js with the 384-dim
bge-smallmodel — or through Ollama if you've configured one as a fallback (nomic-embed-text, 768-dim). The vector is attached to the chunk. - Landing. Chunks land in an
uploaded_documentstable with columnscontent,source_file,page_number,chunk_number, anddatabasin_embeddings.
CSV and Excel files are handled differently: each one becomes its own structured table, so you can query columns and rows directly.
Open Databasin One and drop in a file
Individual files are capped at 20 MB. For larger documents, split them first. For a workbook with many sheets, save each sheet as its own file.
What happens when you ask a question
When your question hits a data context that has uploaded documents, the agent's semantic-search skill activates and writes SQL against the document table. How it searches depends on the engine:
- On Trino, it searches the
contentcolumn withLIKE/ILIKEkeyword matching, expanded with synonyms — case-insensitive, several terms OR'd together to widen recall. There's no vector-ranking step; the agent itself supplies the "semantics" by expanding your question into related words. - On Databricks, it uses
ai_similarity(), a built-in function that scores how close two pieces of text are in meaning on a 0-to-1 scale. The agent orders by that score and keeps the closest matches.
Either way, the agent pulls the top matches, keeps the citation columns (source_file, page_number, chunk_number), and feeds them to the model — which answers citing the specific pages and chunks it used. Every claim points back to a source.
Why multi-term LIKE is a real strategy
Keyword search sounds blunt, but with good term expansion it's a strong tool — and on Trino it is the tool. The skill teaches the model to turn one question into many search terms: abbreviations, synonyms, and word stems.
Example search for "PTO":
WHERE LOWER(content) LIKE '%pto%'
OR LOWER(content) LIKE '%paid time off%'
OR LOWER(content) LIKE '%vacation%'
OR LOWER(content) LIKE '%personal leave%'
The richer the OR expansion, the better the recall. A typical question expands to four to eight terms.
Combining documents with structured data
The real payoff is JOINing a document table with a structured one — for example, "find every order whose customer contract mentions auto-renewal" is a join between your orders table and the matching chunks in uploaded_documents.
You don't have to ask for this specially. If your question requires it, the agent writes the join.
Privacy and where your data lives
Document content and embeddings stay inside your lakehouse, the same as any other table. The embedding model runs in your browser at upload time, so generating vectors doesn't ship your file anywhere. At query time the search is plain SQL inside your warehouse — nothing about a document leaves it except:
- The question itself (sent to the model).
- The handful of matching chunks (sent to the model).
- The agent's answer (returned to you).
Limits worth knowing
- Keyword search has limits. On Trino, "find the passages that mention X" works well; "summarize the themes across 50 documents" is weaker, because keyword matching can't read everything. The agent will tell you when it's searching broadly, and answer from what the keywords surfaced rather than pretend it read the whole corpus.
- Spreadsheet content is queried structurally — columns and rows — not as prose. The document-search path is for text.
- No cross-document reasoning unless the documents are in the same data context.