BlogAIMulti-Layer Search: Combining…

Multi-Layer Search: Combining Full-Text, Semantic HNSW, and OCR in One System (2026)

April 28, 2026Updated May 1, 202624 min readStan ChangAI·#engineering #search #semantic-search

On this page (38)

Here is a number that surprised us: 70% of search queries in our production system are simple keyword lookups. Not semantic. Not conceptual. Just "Q1 budget" or "standup notes."

We had spent months building a semantic search layer with 1536-dimensional HNSW vectors, and most users were still typing exact phrases. That did not mean semantic search was useless — the 30% of conceptual queries it handled were disproportionately valuable. But it forced us to rethink our architecture. Instead of replacing keyword search with semantic search, we needed both. And then we needed a third layer for content trapped inside uploaded files.

This post is about what happens when you run three fundamentally different search systems behind one query box and try to merge their results into a single ranked list.

Taskade search

TL;DR: Taskade search combines three layers: OpenSearch BM25 full-text for exact keyword matches, 1536-dimensional HNSW vectors for semantic similarity, and file content OCR for searching inside PDFs and images. Every query is permission-filtered through a BoolQueryBuilder to respect workspace boundaries and 7-tier RBAC. Try it free.

🔍 Why One Search Layer Is Never Enough

When a user types "quarterly revenue" into a search box, they might want three very different things:

A document titled "Quarterly Revenue Report" — an exact keyword match.
A project about "income analysis" — conceptually related but lexically different.
A PDF attachment containing revenue figures — content locked inside a file.

Traditional search handles the first case. Most "AI-powered search" products handle the first two. Almost nobody handles all three with unified ranking and permission awareness.

We built our search system reactively. Each layer was added because the previous ones could not handle emerging query patterns. Here is the timeline:

Year	Milestone	What We Learned
2017-2022	Basic text search (workspace-level)	Full-text was table stakes
2023	AI-generated content in search results	AI-generated content needed to be findable too
2023	"Ask AI" from the search bar	Users expected search to understand their query
2024	Similarity search (early semantic layer)	Keyword search missed 30% of what users wanted
2025	Search across AI actions and knowledge projects	Every new feature generated more searchable content
2026	Search-first data browsing	Search became the primary navigation model

The lesson is clear: search is not a single algorithm. It is a system. And the system needs to evolve as the product evolves.

⚡ The Three-Layer Architecture

Here is the architecture we run in production. Every query flows through a permission filter first, then fans out to three search layers, and the results are merged into a single ranked list.

Each layer has a fundamentally different approach to finding content. That diversity is the strength of the system — and also its biggest engineering challenge.

Layer	Technology	What It Finds	When It Wins	Score Range
Full-Text (BM25)	OpenSearch	Exact keyword matches	User types specific terms: "Q1 budget"	0 to infinity (relative TF-IDF)
Semantic (HNSW)	OpenSearch k-NN, 1536-dim	Conceptually similar content	User types "revenue analysis," finds "income report"	0 to 1 (distance-based)
File Content (OCR)	OCR pipeline + OpenSearch	Text inside PDFs, images, docs	User searches for text in an uploaded contract	BM25 on extracted text

Let me walk through each layer.

🏗️ Layer 1: Full-Text Search with OpenSearch BM25

This is the workhorse. Despite all the excitement around vector search and embeddings, BM25 full-text search handles 70% of our production queries. Users with specific intent type specific words, and BM25 serves them better than any embedding model.

How It Works

OpenSearch (AWS-managed) indexes all workspace content: project titles, task content, document text, and AI agent descriptions. BM25 scoring ranks results by term frequency (how often the word appears in this document) and inverse document frequency (how rare the word is across all documents).

This means a document where "revenue" appears 15 times ranks higher than one where it appears once — but only if "revenue" is a relatively uncommon term across the entire index. Common words like "the" or "project" get appropriately downweighted.

We also handle pre-processing before the query reaches BM25:

Fuzzy matching for typos (edit distance 1-2). Searching "reveneu" still finds "revenue."
Mention detection extracts @-mention patterns before the BM25 query runs.
Hashtag search extracts #tag patterns and queries the tag index.

Why OpenSearch, Not Elasticsearch?

Same engine, fully managed, no licensing concerns. For our use case, they are functionally identical. The decision was pragmatic: AWS manages it, we do not have to think about cluster operations. This freed our team to focus on search quality instead of infrastructure.

What BM25 Is Good At

Finding exactly what you typed. "Q1 Budget Review" returns "Q1 Budget Review" as the top result. There is no ambiguity, no conceptual interpretation, no surprises. For 70% of queries, this is exactly what users want.

What BM25 Misses

"Revenue Analysis" will not find "Income Report." These are semantically identical — they mean the same thing — but they share zero keywords. BM25 has no concept of meaning. It matches strings. For the 30% of queries where users are thinking conceptually rather than searching for specific terms, we need something else.

This is where most search systems stop. We needed to go further.

🧬 Layer 2: Semantic Search with HNSW Vectors

The second layer embeds queries and documents as 1536-dimensional vectors and finds nearest neighbors in vector space. This is the layer that understands meaning.

The Embedding Pipeline

When content is created or updated in Taskade, the following happens:

Chunking: The document is split into semantically meaningful chunks. We do not embed entire documents as single vectors — a 50-page project plan has too much content for a single embedding to represent meaningfully.
Embedding: Each chunk is passed through an embedding model that produces a 1536-dimensional vector. This vector captures the semantic meaning of the chunk in a way that similar concepts end up close together in vector space.
Indexing: The vectors are stored in an OpenSearch k-NN index using the HNSW (Hierarchical Navigable Small World) graph algorithm.

At query time, the user's search query goes through the same embedding model, producing a 1536-dimensional query vector. The HNSW index then finds the document vectors closest to this query vector — the nearest neighbors.

Why HNSW, Not Flat Vector Search?

This is a critical architectural decision. Flat vector search compares the query vector against every single document vector in the index. That is O(n) — linear time. For millions of documents, this takes seconds, which is unacceptable for interactive search.

HNSW provides approximate nearest neighbor search in O(log n) time. It achieves this by building a multi-layer graph where each layer has decreasing density:

A search starts at the top sparse layer (long-range jumps), then descends to denser layers for fine-grained navigation. This is what gives HNSW its logarithmic time complexity. The trade-off is approximate results — you might miss the absolute nearest neighbor — but in practice, HNSW achieves 99.5%+ recall at 100x the speed of exact search.

HNSW Tuning Parameters

Getting these right matters more than most engineering blog posts admit. The defaults work for benchmarks. Production is different.

Parameter	What It Controls	Trade-Off
`ef_construction`	Index build quality. Higher = better recall, slower indexing	Recall vs index build time
`M`	Bidirectional links per node. Higher = better recall, larger index	Recall vs memory/disk usage
`ef_search`	Query-time search depth. Higher = better recall, slower queries	Recall vs query latency
`engine`	ANN algorithm implementation (nmslib, faiss, lucene)	Feature set vs performance profile
`space_type`	Distance metric (L2, cosine, inner product)	Must match embedding model's training metric

We use the nmslib engine with L2 (Euclidean) distance. We started with OpenSearch defaults and tuned based on production query patterns over several months. The specific parameter values are proprietary, but the lesson is universal: tune on your real workload, not on benchmarks. ANN-benchmark results do not predict how your specific query distribution will behave.

What Semantic Search Is Good At

Finding conceptually related content. A user searching "revenue analysis" finds documents titled "Income Report," "Sales Dashboard," and "Financial Projections." None of these share keywords with the query. BM25 would return zero results. The semantic layer returns exactly what the user was looking for.

This is also essential for AI agents that need to retrieve context from the workspace. As Barry Zhang from Anthropic noted in his talk on building effective agents, everything the model knows is in a 10-20k token context window. Search determines which tokens fill that window. Better search means better agent context, which means better agent output.

What Semantic Search Misses

Very specific terms. If a user searches for a product code like "SKU-7842" or an internal project ID like "PRJ-2026-Q1," semantic embeddings do not help. These identifiers have no semantic meaning — they are arbitrary strings. BM25 handles them perfectly.

📄 Layer 3: File Content OCR

The third layer solves a problem that most search systems ignore entirely: content locked inside uploaded files.

A user uploads a contract PDF. Without OCR, they can search for the file name — "Vendor_Agreement_2026.pdf." With OCR, they can search for "indemnification clause" and find the exact document. The difference is enormous for teams that rely on uploaded documents, images, and scanned materials.

The OCR Pipeline

When a user uploads a file to a Taskade project, a background workflow handles text extraction:

File type detection — PDF, image (PNG, JPG), Word document, or other supported format.
Text extraction — The pipeline routes the file to the appropriate extractor. PDFs use direct text extraction when possible. Scanned documents and images go through OCR. Office documents use format-specific parsers.
Confidence scoring — OCR quality varies wildly. A cleanly scanned PDF produces near-perfect text. A photo of a whiteboard produces noisy, partial text. We store a confidence score alongside every OCR extraction.
Indexing — Extracted text is indexed in OpenSearch alongside native workspace content, with the confidence score influencing result ranking.

Why Confidence Scoring Matters

Not all OCR is created equal. A high-resolution scan of a typed document produces text with 99%+ accuracy. A phone photo of handwritten notes might produce text with 60% accuracy. If we treated both equally in search ranking, the noisy OCR results would pollute the result list.

By storing and using confidence scores, we can:

Boost high-confidence OCR results (they are almost as reliable as native text).
Downweight low-confidence OCR results (they might match, but we are less sure).
Skip extremely low-confidence extractions entirely to avoid garbage results.

This makes the difference between a search system that occasionally surfaces nonsense from bad OCR and one that surfaces file content reliably.

🔐 Permission-Aware Search: The BoolQueryBuilder

This is the part that does not get enough attention in engineering discussions about search. You can build the best three-layer search architecture in the world, and it is useless if users see results they should not have access to. Or worse — if they do not see results they should have access to.

Taskade's 7-tier RBAC system (Owner, Maintainer, Editor, Commenter, Collaborator, Participant, Viewer) means that permission filtering is not a simple on/off check. Different users have different levels of access to different spaces and projects within the same workspace.

How Permission Filtering Works

Every search query is wrapped in a BoolQueryBuilder that adds permission filters before scoring. The filter structure looks conceptually like this:

The critical design choice here is pre-filtering vs post-filtering.

Why We Pre-Filter (And You Should Too)

The naive approach is post-filtering: run the search, get the top 100 results, then remove any results the user does not have access to. This works perfectly in demos with 50 documents. It breaks catastrophically at scale.

Here is why. Suppose a user searches for "budget" and the top 10 results include 8 documents from spaces they do not have access to. After post-filtering, the user sees a "page" of 2 results. The pagination says "Page 1 of 10" but only shows 2 items. The user clicks "Page 2" and sees 3 items. Or zero items. The experience is broken.

Approach	How It Works	Pagination	Performance	Correctness
Post-filter	Search first, remove unauthorized results	Broken. Pages have inconsistent counts.	Fast initial query	Correct but poor UX
Pre-filter	Apply permissions before scoring	Correct. Every page is full.	Permission filter in hot path	Correct with good UX

We chose pre-filtering. Every results page contains exactly the requested number of results (or fewer only if there genuinely are not enough matching documents). Pagination works correctly. The cost is that the permission filter is in the search hot path, which brings its own challenges.

The Permission Filter Bottleneck

For users with access to a small number of spaces, the permission filter is tiny and fast. For enterprise users with access to hundreds or thousands of projects, the filter query itself becomes large and expensive. A BoolQuery with 500 space IDs in a terms filter is non-trivial for OpenSearch to evaluate on every search.

We optimize this by pre-computing access lists. When a user's permissions change (they are added to a space, their role changes, a project's visibility is updated), we update a cached access list. The search query references this pre-computed list rather than dynamically computing permissions at query time.

This is one of those engineering trade-offs that does not show up in architecture diagrams: the permission model is the hardest part of enterprise search, not the ranking algorithm.

The 7-Tier RBAC Model in Search Context

Taskade's role-based access system has seven tiers, and each affects search visibility differently:

Role	Search Visibility
Owner	All content in the workspace
Maintainer	All content in assigned spaces
Editor	All content in assigned spaces
Commenter	All content in assigned spaces (read-only search)
Collaborator	Content in specifically shared projects
Participant	Content in assigned tasks and threads
Viewer	Public content and explicitly shared items

The search system respects these boundaries at every layer. A Viewer searching for "budget" will only see results from public projects and items explicitly shared with them, even if the BM25 score for a private document is higher.

🔀 Merging Results: The Hardest Unsolved Problem

Each search layer produces results with fundamentally incompatible scoring metrics. Combining them into a single ranked list is the hardest problem in multi-layer search — and I want to be honest: there is no clean mathematical solution. Every production system uses heuristics and calls it an algorithm.

The Score Normalization Challenge

Layer	Score Type	Range	Meaning
BM25	TF-IDF	0 to infinity	Higher = more keyword matches relative to corpus
HNSW	Distance	0 to 1 (after normalization)	Lower distance = more semantically similar
OCR	BM25 on extracted text	0 to infinity	Same as BM25, but on OCR-extracted content

You cannot directly compare a BM25 score of 12.5 with a vector distance of 0.23. They measure different things on different scales. A BM25 score of 12.5 might be exceptional for a short query against a large corpus, or mediocre for a long query against a small corpus.

Our Approach

We normalize scores across layers using several strategies:

Min-max normalization within each layer's result set. The top result in each layer gets a score of 1.0, the bottom gets 0.0, and everything else is linearly interpolated.
Query-adaptive weighting. Short, specific queries (1-2 words, no stop words) get boosted BM25 weight. Longer, more descriptive queries get boosted semantic weight. This reflects the observed pattern: users typing "Q1 budget" want exact matches, while users typing "analysis of quarterly revenue trends across departments" want conceptual matches.
Deduplication. The same document may appear in all three layers. A PDF containing the word "revenue" will match BM25, will be semantically similar to revenue-related queries, and will match OCR if it was uploaded as a file. We merge these into a single result entry with a combined score.
Confidence-weighted OCR. OCR results are weighted by extraction confidence. A 99% confidence OCR match is treated almost like native text. A 60% confidence match is significantly downweighted.

Trade-Offs We Accept

Fully semantic results can be surprising. A user searches for "team meeting" and sees a document about "sprint retrospective" in the results. Semantically, these are related. But the user might think, "I did not search for that." We bias toward BM25 for short queries specifically to minimize this surprise.

The weights are tuned, not derived. There is no theorem that tells you the optimal blend ratio between BM25 and vector similarity for a collaborative workspace product. We arrived at our current weights through iterative testing on production query logs. When the query distribution shifts (and it does — new features generate new search patterns), the weights need re-evaluation.

"Search is the most important feature nobody talks about. If users cannot find their data, the data does not exist."

⏱️ Real-Time Indexing: Content Should Be Searchable in Seconds

Building a great search system means nothing if new content takes minutes or hours to appear in results. When a user creates a task in a Taskade project, searches for it 3 seconds later, and gets no results — that is a broken experience.

The Indexing Pipeline

When content changes in Taskade — a new task is created, a document is edited, a file is uploaded — the indexing pipeline activates:

Change event — The content change produces an event that is durably queued.
Event ordering — Events are processed in the correct order. If a user creates a task and then edits it, the index must reflect the edit, not the original version. Event ordering ensures this.
Index update — A durable workflow triggers the appropriate index update activities: BM25 re-indexing for text content, re-embedding for the HNSW layer, and OCR extraction for new file uploads.
Confirmation — The workflow completes when all relevant indexes are updated.

Our automation infrastructure powers this pipeline. The same durable execution engine that runs user-facing workflow automations also runs internal indexing workflows. This gives us retry logic, failure handling, and observability for free.

Latency Target: 2-5 Seconds

We target content being searchable within 2-5 seconds of creation or modification. This is fast enough that users perceive search as "instant" — they create a task, switch to the search bar, type a query, and the task appears. In practice, the gap between creation and the user reaching the search bar is almost always longer than our indexing latency.

Search Feature	Implementation
Fuzzy matching	OpenSearch fuzziness parameter (edit distance 1-2)
Mention detection	@mention pattern matching pre-search
Hashtag search	#tag extraction and indexing
Permission filtering	BoolQueryBuilder with workspace + RBAC scoping
Real-time indexing	Durable workflows + event ordering for async index updates
Confidence scoring	OCR quality metric stored alongside extracted text

🧪 Production Lessons After Three Years

We have been running this multi-layer architecture in production for over a year, supporting search across millions of documents for teams using Taskade workspaces. Here is what we have learned.

1. Full-Text Is Still King

I cannot stress this enough. 70% of queries are specific keyword searches. The industry narrative is that semantic/vector search replaces BM25. The data says otherwise.

Users with specific intent type specific words. They know the name of the document they want. They remember a phrase from a task. They are looking for a project by title. BM25 serves these queries faster and more accurately than any embedding model.

Semantic search is additive, not a replacement. If you are building a search system and you have to choose where to invest your first engineering month, invest it in better BM25 — better tokenization, better fuzzy matching, better field boosting. Then add semantic as a second layer.

2. HNSW Tuning Matters More Than You Think

We started with OpenSearch's default HNSW parameters. They produced acceptable results on our staging dataset. In production, with real query distributions and real document collections, the defaults left significant recall on the table.

The two parameters that made the biggest difference:

ef_construction: Controls how thoroughly the graph is built during indexing. Higher values produce a more connected graph with better recall, but indexing takes longer. We found that the default was too low for our document distribution.
M: Controls the number of bidirectional links per node. Higher values improve recall but increase index size. We experimented with several values before finding our sweet spot.

The takeaway: tune on your real workload. ANN-benchmark results are useful for rough comparisons between algorithms, but they do not predict how your specific query distribution will behave against your specific document collection.

3. OCR Quality Varies Wildly

A cleanly scanned, typed document produces near-perfect OCR. A photo of a whiteboard with markers produces garbled text. A PDF exported from a design tool might contain no extractable text at all because the "text" is actually a raster image.

Without confidence scoring, these all contribute equally to search results. With confidence scoring, we can weight OCR results appropriately and avoid surfacing garbage matches from low-quality extractions.

4. Permission Filtering Is the Bottleneck

For users with access to thousands of projects (common in enterprise workspaces), the BoolQuery permission filter becomes large. A terms filter with 500+ space IDs is computationally expensive for OpenSearch to evaluate on every query.

Pre-computing access lists mitigates this. But it introduces a cache invalidation problem: when permissions change, the cached access list must be updated before the next search query, or the user will see stale results (either seeing content they should not, or missing content they should).

This is the kind of problem that does not appear in architecture diagrams or conference talks. It is mundane, unsexy, and absolutely critical.

5. Index Size Management

1536-dimensional vectors are not small. Each vector is ~6 KB. Multiply by millions of document chunks, and the HNSW index becomes significant.

We learned to be selective about what gets embedded. Single-word tasks do not need vector embeddings. A task titled "Review" or "Follow up" has no semantic content worth embedding. We apply a minimum length threshold — content below a certain length is only indexed for BM25, not for semantic search. This reduces index size substantially without meaningfully affecting recall, because short content items are almost always found by exact keyword match anyway.

🔗 Search and AI Agents: Why This Matters Beyond Search

Search is not just a user-facing feature. It is the foundation for AI agents that operate within the workspace.

When an AI agent needs to find relevant context to answer a question, write a document, or complete a task, it searches the workspace. The quality of that search directly determines the quality of the agent's output. As Barry Zhang from Anthropic noted in his talk on building effective agents, agents operate within a 10-20k token context window. Search decides which tokens fill that window.

Three search layers give the agent three chances to find relevant context:

Keyword catches 70% — the agent often knows the exact terms it needs.
Semantic catches the conceptual 20% — the agent can describe what it needs even when it does not know the exact document name.
OCR catches what is locked inside files — contracts, reports, and scanned documents that would otherwise be invisible to the agent.

This connects directly to Taskade's Workspace DNA architecture, where Memory (projects) feeds Intelligence (AI agents), and Intelligence triggers Execution (automations). Search is the bridge between Memory and Intelligence. Without reliable search, the agents cannot access the workspace's accumulated knowledge, and the Intelligence layer degrades.

For teams building AI-powered workflows, the search system is not a feature — it is infrastructure. Every automation that retrieves context, every agent that answers a question, every workflow that routes tasks based on content — they all depend on search working correctly, quickly, and within permission boundaries.

🔮 What We Are Building Next

Our multi-layer search architecture is stable and serving production traffic well, but we are far from done. Here is what is on our roadmap:

Better score normalization. The merge problem is never truly solved. We are experimenting with learned ranking models that use query-level features to predict optimal layer weights, rather than relying on static heuristics. Early results are promising but not yet production-ready.

Multi-modal search. Today we extract text from images via OCR. Tomorrow, we want to understand the visual content of images directly — search for "architecture diagram" and find actual diagrams, not just documents that mention "architecture." This requires image embeddings alongside text embeddings, adding a fourth dimension to our multi-layer system.

Agent-powered search. Instead of returning a list of results, the search system could use AI agents to refine queries, explain why results were returned, and synthesize answers from multiple documents. This transforms search from "here are 10 links" to "here is the answer, synthesized from these 3 sources."

Cross-workspace search for enterprise. Currently, search is scoped to a single workspace. Enterprise customers with multiple workspaces want to search across all of them, with the same permission-aware filtering applied at the workspace level. This requires a federated search layer on top of our existing three layers.

Frequently Asked Questions

What is multi-layer search and why does it matter?

Multi-layer search combines multiple search approaches in one system. Taskade uses three layers: full-text BM25 for exact keyword matches, 1536-dimensional HNSW vectors for semantic similarity, and OCR for searching inside uploaded files. This ensures users find content whether they remember the exact words, the general concept, or the content is inside an attachment.

What is HNSW and how does it work for semantic search?

HNSW (Hierarchical Navigable Small World) is a graph-based algorithm for approximate nearest neighbor search in high-dimensional vector spaces. Documents are embedded as 1536-dimensional vectors. When a user searches, their query is also embedded, and HNSW finds the most similar document vectors in O(log n) time, enabling semantic search across millions of documents.

How does Taskade handle search permissions across workspaces?

Every search query is wrapped in a BoolQueryBuilder that adds permission filters for workspace, space, project visibility, and user membership before scoring. This pre-filtering approach respects Taskade's 7-tier RBAC system (Owner, Maintainer, Editor, Commenter, Collaborator, Participant, Viewer) and ensures users only see results they have permission to access, with correct pagination.

How quickly does new content become searchable in Taskade?

Taskade targets 2-5 second indexing latency. When content changes, a durable workflow triggers index update activities, with event ordering ensuring updates are processed in the correct sequence. This means newly created tasks and documents are searchable almost immediately.

Why does Taskade still use BM25 full-text search alongside semantic vectors?

Production data shows that 70% of search queries are specific keyword lookups like project names or exact phrases. BM25 handles these faster and more accurately than vector similarity. Semantic search adds value for the remaining 30% of conceptual queries where users describe what they want rather than using exact terms. The two approaches are complementary, not competing.

How does OCR search work for uploaded files in Taskade?

When a user uploads a file such as a PDF, image, or document, a background workflow triggers text extraction via OCR. The extracted text is indexed in OpenSearch alongside workspace content. Users can then search for text inside attachments, not just file names. A confidence score is stored alongside OCR results to weight lower- quality extractions appropriately.

What is the difference between pre-filtering and post-filtering in search?

Post-filtering runs the search first and then removes unauthorized results. This breaks pagination because if 8 of 10 results are filtered out, the user sees a page with only 2 results. Pre-filtering adds permission constraints before scoring, ensuring every results page is full and correctly ranked. Taskade uses pre-filtering via BoolQueryBuilder to avoid this problem.

How does Taskade merge results from three different search layers?

Each search layer produces scores on different scales. BM25 scores range from 0 to infinity, vector similarity scores range from 0 to 1, and OCR matches are near-binary. Taskade normalizes scores across layers, weights them based on query characteristics such as length and specificity, deduplicates documents that appear in multiple layers, and produces a single merged ranking.

What HNSW tuning parameters affect search quality?

The two most important HNSW parameters are ef_construction, which controls index build quality and recall, and M, which sets the number of bidirectional links per node. Higher values improve recall but increase index size and build time. The ef_search parameter controls query-time accuracy versus speed. Tuning these from defaults based on production query patterns is essential for good search quality.

Can Taskade search across multiple workspaces?

Users can search across all spaces they have access to within a workspace. The permission filter scopes results per space based on the user's role in Taskade's 7-tier RBAC system. Cross-workspace search for enterprise customers with multiple workspaces is on the roadmap for future releases.

🏁 Conclusion: Build for the 70%, Optimize for the 30%

The most important thing I have learned building this system is that the industry narrative about search is wrong. Semantic search is not replacing keyword search. Vector databases are not making full-text engines obsolete. The future is not "embed everything."

The future is multi-layer: keyword search for the 70% of queries where users know what they want, semantic search for the 30% where they are exploring, and OCR for the content trapped inside files. Wrapped in a permission layer that respects your workspace's access control model.

If you are building search for a collaborative product, start with BM25. Make it excellent. Then add semantic search as an additive layer. Then add file content extraction. And throughout all of it, solve the permission problem first — it will be harder than you think.

Try Taskade's multi-layer search across your projects, documents, and uploaded files. Create a free workspace and see how the three layers work together — search for exact terms, conceptual queries, and text inside uploaded documents.

Further reading from the Taskade engineering blog: