Skip to main content
Taskadetaskade
PricingLoginSign up for free →Sign up for free →
Loved by 1M+ users·Hosting 100K+ apps·Deploying 500K+ AI agents·Running 1M+ automations·Backed by Y Combinator
TaskadeAboutPressPricingFeaturesIntegrationsChangelogContact us
GalleryReviewsHelp CenterDocsFAQ
VibeVibe AppsVibe AgentsVibe CodingVibe Workflows
Vibe MarketingVibe DashboardsVibe CRMVibe AutomationVibe PaymentsVibe DesignVibe SEOVibe Tracking
Community
FeaturedQuick AppsTools
DashboardsWebsitesWorkflowsProjectsFormsCreators
DownloadsAndroidiOSMac
WindowsChromeFirefoxEdge
Compare
vs Cursorvs Boltvs Lovable
vs V0vs Windsurfvs Replitvs Emergentvs Devinvs Claude Codevs ChatGPTvs Claudevs Perplexityvs GitHub Copilotvs Figma AIvs Notionvs ClickUpvs Asanavs Mondayvs Trellovs Jiravs Linearvs Todoistvs Evernotevs Obsidianvs Airtablevs Basecampvs Mirovs Slackvs Bubblevs Retoolvs Webflowvs Framervs Softrvs Glidevs FlutterFlowvs Base44vs Adalovs Durablevs Gammavs Squarespacevs WordPressvs UI Bakeryvs Zapiervs Makevs n8nvs Jaspervs Copy.aivs Writervs Rytrvs Manusvs Crewvs Lindyvs Relevance AIvs Wrikevs Smartsheetvs Monday Magicvs Codavs TickTickvs Any.dovs Thingsvs OmniFocusvs MeisterTaskvs Teamworkvs Workfrontvs Bitrix24vs Process Streetvs Toggl Planvs Motionvs Momentumvs Habiticavs Zenkitvs Google Docsvs Google Keepvs Google Tasksvs Microsoft Teamsvs Dropbox Papervs Quipvs Roam Researchvs Logseqvs Memvs WorkFlowyvs Dynalistvs XMindvs Whimsicalvs Zoomvs Remember The Milkvs Wunderlist
Genesis AIVideo GuideApp BuilderVibe Coding
Agent BuilderDashboard BuilderCRM BuilderWebsite BuilderForm BuilderWorkflow AutomationWorkflow BuilderBusiness-in-a-BoxAI for MarketingAI for Developers
AI Agents
FeaturedProject ManagementProductivity
MarketingTranslatorContentWorkflowResearchPersonalSalesSocial MediaTo-Do ListCRMTask AutomationCoachingCreativityTask ManagementBrandingFinanceLearning and DevelopmentBusinessCommunity ManagementMeetingsAnalyticsDigital AdvertisingContent CurationKnowledge ManagementProduct DevelopmentPublic RelationsProgrammingHuman ResourcesE-CommerceEducationLegalEmailSEODeveloperVideo ProductionDesignFlowchartDataPromptNonprofitAssistantsTeamsCustomer ServiceTrainingTravel PlanningUML DiagramER DiagramMath TutorLanguage LearningCode ReviewerLogo DesignerUI WireframeFitness CoachAll Categories
Automations
FeaturedBusiness-in-a-BoxInvestor Operations
Education & LearningHealthcare & ClinicsStripeSalesContentMarketingEmailCustomer SupportHubSpotProject ManagementAgentic WorkflowsBooking & SchedulingCalendarReportsSlackWebsiteFormTaskWeb ScrapingWeb SearchChatGPTText to ActionYoutubeLinkedInTwitterGitHubDiscordMicrosoft TeamsWebflowRSS & Content FeedsGoogle WorkspaceManufacturing & OperationsAI Agent TeamsMulti-Agent AutomationAgentic AutomationAll Categories
Wiki
GenesisAI AgentsAutomation
ProjectsLiving DNAPlatformIntegrationsProductivityMethodsProject ManagementAgileScrumAI ConceptsCommunityTerminologyFeatures
Templates
FeaturedChatGPTTable
PersonalProject ManagementSalesFlowchartTask ManagementEngineeringEducationDesignTo-Do ListMarketingMind MapGantt ChartOrganizationalPlanningMeetingsTeam ManagementStrategyGamingProductionProduct ManagementStartupRemote WorkY CombinatorRoadmapCustomer ServiceLegalEmailBudgetsContentConsultingE-CommerceStandard Operating Procedure (SOP)Human ResourcesProgrammingMaintenanceCoachingSocial MediaHow-TosResearchMusicTrip PlanningCRMBooking SystemAll Categories
Generators
AI SoftwareNo-Code AI AppAI App
AI WebsiteAI DashboardAI FormAI AgentClient PortalAI WorkspaceAI ProductivityAI To-Do ListAI WorkflowsAI EducationAI Mind MapsAI FlowchartAI Scrum Project ManagementAI Agile Project ManagementAI MarketingAI Project ManagementAI Social Media ManagementAI BloggingAI Agency WorkflowsAI ContentAI Software DevelopmentAI MeetingAI PersonasAI OutlineAI SalesAI ProgrammingAI DesignAI FreelancingAI ResumeAI Human ResourceAI SOPAI E-CommerceAI EmailAI Public RelationsAI InfluencersAI Content CreatorsAI Customer ServiceAI BusinessAI PromptsAI Tool BuilderAI SEOAI Gantt ChartAI CalendarsAI BoardAI TableAI ResearchAI LegalAI ProposalAI Video ProductionAI Health and WellnessAI WritingAI PublishingAI NonprofitAI DataAI Event PlanningAI Game DevelopmentAI Project Management AgentAI Productivity AgentAI Marketing AgentAI Personal AgentAI Business and Work AgentAI Education and Learning AgentAI Task Management AgentAI Customer Relations AgentAI Programming AgentAI SchemaAI Business PlanAI Pitch DeckAI InvoiceAI Lesson PlanAI Social Media CalendarAI API DocumentationAI Database SchemaAll Categories
Converters
AI Featured ConvertersAI PDF ConvertersAI CSV Converters
AI Markdown ConvertersAI Prompt to App ConvertersAI Data to Dashboard ConvertersAI Workflow to App ConvertersAI Idea to App ConvertersAI Flowcharts ConvertersAI Mind Map ConvertersAI Text ConvertersAI Youtube ConvertersAI Knowledge ConvertersAI Spreadsheet ConvertersAI Email ConvertersAI Web Page ConvertersAI Video ConvertersAI Coding ConvertersAI Task ConvertersAI Kanban Board ConvertersAI Notes ConvertersAI Education ConvertersAI Language TranslatorsAI Business → Backend App ConvertersAI File → App ConvertersAI SOP → Workflow App ConvertersAI Portal → App ConvertersAI Form → App ConvertersAI Schedule → Booking App ConvertersAI Metrics → Dashboard ConvertersAI Game → Playable App ConvertersAI Catalog → Directory App ConvertersAI Creative → Studio App ConvertersAI Agent → Agent App ConvertersAI Audio ConvertersAI DOCX ConvertersAI EPUB ConvertersAI Image ConvertersAI Resume & Career ConvertersAI Presentation ConvertersAI PDF to Spreadsheet ConvertersAI PDF to Database ConvertersAI PDF to Quiz ConvertersAI Image to Notes ConvertersAI Audio to Notes ConvertersAI Email to Tasks ConvertersAI CSV to Dashboard ConvertersAI YouTube to Flashcards ConvertersURL to NotesAll Categories
Prompts
Blog WritingBrandingPersonal Finance
Human ResourcesPublic RelationsTeam CollaborationProduct ManagementSupportAgencyReal EstateMarketingCodingResearchSalesAdvertisingSocial MediaCopywritingContentProject ManagementWebsite CreationDesignStrategyE-commerceEngineeringSEOEducationEmail MarketingUX/UIProductivityInfluencer MarketingAnalyticsEntrepreneurshipLegalVibe Coding PromptAll Categories
Blog
We Gave Our AI Agent 26 Tools. Here's Why That's the Right Number. (2026)11 Best AI Math Tutoring Tools in 2026 (Students, Parents & Teachers)13 Best AI Project Report Generators in 2026 (Status + Weekly)
11 Best AI Study Planner Tools in 2026 (Students + Self-Learners)Durable Execution for AI Workflows: Patterns from Building 3M Automations (2026)Multi-Layer Search: Combining Full-Text, Semantic HNSW, and OCR in One System (2026)The Workspace DNA Architecture: Building Software That Gets Smarter (2026)12 Best AI Agent Platforms in 2026: Build, Deploy & Orchestrate Autonomous Agents13 Best AI Code Snippet Generators in 2026 (Tested + Free)12 Best AI HTML Code Generators in 2026 (Free + Tested)11 Best AI Portfolio Generators in 2026 (For Designers, Devs & Creators)From Prompt to Deployed App: How Genesis Compiles Living Software (2026)Multi-Agent Collaboration in Production: Lessons from 500,000+ Agent Deployments (2026)The Vibe Coding Graveyard: 14 Tools That Died in 2025-2026 (And What Survived)12 Best AI Form Builders in 2026 (Free + Paid, Tested)11 Best AI Robots.txt & SEO Config Generators in 202612 Best AI Wiki & Knowledge Base Tools in 2026Building a Hosted MCP Server: From Protocol to Production (2026)How to Build a SaaS in 24 Hours with AI in 2026 (Real Case Study)
AIAutomationProductivityProject ManagementRemote WorkStartupsKnowledge ManagementCollaborative WorkUpdates
Changelog
Guided Onboarding for Cloned Apps (Apr 14, 2026)Markdown Export, MCP Auth & Ask Questions (Apr 14, 2026)GitHub Export to Existing Repo & Run Details (Apr 13, 2026)
MCP Server Hotfix & Credit Adjustments (Apr 10, 2026)MCP Server (Beta) & Taskade SDK (Apr 10, 2026)Public API v2 & Performance Boost (Apr 9, 2026)Automation Reliability & GitHub Import Auth (Apr 8, 2026)
Wiki
GenesisAI AgentsAutomation
ProjectsLiving DNAPlatformIntegrationsProductivityMethodsProject ManagementAgileScrumAI ConceptsCommunityTerminologyFeatures
© 2026 Taskade.
PrivacyTermsSecurity
Made withTaskade AIforBuilders
Blog›AI›Multi-Layer Search: Combining…

Multi-Layer Search: Combining Full-Text, Semantic HNSW, and OCR in One System (2026)

How Taskade combines OpenSearch full-text, 1536-dim HNSW semantic vectors, and file content OCR into a single permission-aware search system. 70% of queries are still keyword lookups.

April 17, 2026·24 min read·Stan Chang·AI·#engineering#search#semantic-search
On this page (38)
🔍 Why One Search Layer Is Never Enough⚡ The Three-Layer Architecture🏗️ Layer 1: Full-Text Search with OpenSearch BM25How It WorksWhy OpenSearch, Not Elasticsearch?What BM25 Is Good AtWhat BM25 Misses🧬 Layer 2: Semantic Search with HNSW VectorsThe Embedding PipelineWhy HNSW, Not Flat Vector Search?HNSW Tuning ParametersWhat Semantic Search Is Good AtWhat Semantic Search Misses📄 Layer 3: File Content OCRThe OCR PipelineWhy Confidence Scoring Matters🔐 Permission-Aware Search: The BoolQueryBuilderHow Permission Filtering WorksWhy We Pre-Filter (And You Should Too)The Permission Filter BottleneckThe 7-Tier RBAC Model in Search Context🔀 Merging Results: The Hardest Unsolved ProblemThe Score Normalization ChallengeOur ApproachTrade-Offs We Accept⏱️ Real-Time Indexing: Content Should Be Searchable in SecondsThe Indexing PipelineLatency Target: 2-5 Seconds🧪 Production Lessons After Three Years1. Full-Text Is Still King2. HNSW Tuning Matters More Than You Think3. OCR Quality Varies Wildly4. Permission Filtering Is the Bottleneck5. Index Size Management🔗 Search and AI Agents: Why This Matters Beyond Search🔮 What We Are Building NextFrequently Asked Questions🏁 Conclusion: Build for the 70%, Optimize for the 30%

Here is a number that surprised us: 70% of search queries in our production system are simple keyword lookups. Not semantic. Not conceptual. Just "Q1 budget" or "standup notes."

We had spent months building a semantic search layer with 1536-dimensional HNSW vectors, and most users were still typing exact phrases. That did not mean semantic search was useless — the 30% of conceptual queries it handled were disproportionately valuable. But it forced us to rethink our architecture. Instead of replacing keyword search with semantic search, we needed both. And then we needed a third layer for content trapped inside uploaded files.

This post is about what happens when you run three fundamentally different search systems behind one query box and try to merge their results into a single ranked list.

Taskade search

TL;DR: Taskade search combines three layers: OpenSearch BM25 full-text for exact keyword matches, 1536-dimensional HNSW vectors for semantic similarity, and file content OCR for searching inside PDFs and images. Every query is permission-filtered through a BoolQueryBuilder to respect workspace boundaries and 7-tier RBAC. Try it free.


🔍 Why One Search Layer Is Never Enough

When a user types "quarterly revenue" into a search box, they might want three very different things:

  1. A document titled "Quarterly Revenue Report" — an exact keyword match.
  2. A project about "income analysis" — conceptually related but lexically different.
  3. A PDF attachment containing revenue figures — content locked inside a file.

Traditional search handles the first case. Most "AI-powered search" products handle the first two. Almost nobody handles all three with unified ranking and permission awareness.

We built our search system reactively. Each layer was added because the previous ones could not handle emerging query patterns. Here is the timeline:

Year Milestone What We Learned
2017-2022 Basic text search (workspace-level) Full-text was table stakes
2023 AI-generated content in search results AI-generated content needed to be findable too
2023 "Ask AI" from the search bar Users expected search to understand their query
2024 Similarity search (early semantic layer) Keyword search missed 30% of what users wanted
2025 Search across AI actions and knowledge projects Every new feature generated more searchable content
2026 Search-first data browsing Search became the primary navigation model

The lesson is clear: search is not a single algorithm. It is a system. And the system needs to evolve as the product evolves.


⚡ The Three-Layer Architecture

Here is the architecture we run in production. Every query flows through a permission filter first, then fans out to three search layers, and the results are merged into a single ranked list.

User Query:'quarterly revenue analysis' Permission Filter(BoolQueryBuilder: workspace + RBAC) Layer 2: SemanticHNSW 1536-dim vectorsConceptual similarity Layer 3: File ContentOCR pipelinePDF / image text extraction L1 Merge & Re-RankNormalize scores across layersDeduplicate results Ranked Result List

Each layer has a fundamentally different approach to finding content. That diversity is the strength of the system — and also its biggest engineering challenge.

Layer Technology What It Finds When It Wins Score Range
Full-Text (BM25) OpenSearch Exact keyword matches User types specific terms: "Q1 budget" 0 to infinity (relative TF-IDF)
Semantic (HNSW) OpenSearch k-NN, 1536-dim Conceptually similar content User types "revenue analysis," finds "income report" 0 to 1 (distance-based)
File Content (OCR) OCR pipeline + OpenSearch Text inside PDFs, images, docs User searches for text in an uploaded contract BM25 on extracted text

Let me walk through each layer.


🏗️ Layer 1: Full-Text Search with OpenSearch BM25

This is the workhorse. Despite all the excitement around vector search and embeddings, BM25 full-text search handles 70% of our production queries. Users with specific intent type specific words, and BM25 serves them better than any embedding model.

How It Works

OpenSearch (AWS-managed) indexes all workspace content: project titles, task content, document text, and AI agent descriptions. BM25 scoring ranks results by term frequency (how often the word appears in this document) and inverse document frequency (how rare the word is across all documents).

This means a document where "revenue" appears 15 times ranks higher than one where it appears once — but only if "revenue" is a relatively uncommon term across the entire index. Common words like "the" or "project" get appropriately downweighted.

We also handle pre-processing before the query reaches BM25:

  • Fuzzy matching for typos (edit distance 1-2). Searching "reveneu" still finds "revenue."
  • Mention detection extracts @-mention patterns before the BM25 query runs.
  • Hashtag search extracts #tag patterns and queries the tag index.

Why OpenSearch, Not Elasticsearch?

Same engine, fully managed, no licensing concerns. For our use case, they are functionally identical. The decision was pragmatic: AWS manages it, we do not have to think about cluster operations. This freed our team to focus on search quality instead of infrastructure.

What BM25 Is Good At

Finding exactly what you typed. "Q1 Budget Review" returns "Q1 Budget Review" as the top result. There is no ambiguity, no conceptual interpretation, no surprises. For 70% of queries, this is exactly what users want.

What BM25 Misses

"Revenue Analysis" will not find "Income Report." These are semantically identical — they mean the same thing — but they share zero keywords. BM25 has no concept of meaning. It matches strings. For the 30% of queries where users are thinking conceptually rather than searching for specific terms, we need something else.

This is where most search systems stop. We needed to go further.


🧬 Layer 2: Semantic Search with HNSW Vectors

The second layer embeds queries and documents as 1536-dimensional vectors and finds nearest neighbors in vector space. This is the layer that understands meaning.

The Embedding Pipeline

When content is created or updated in Taskade, the following happens:

  1. Chunking: The document is split into semantically meaningful chunks. We do not embed entire documents as single vectors — a 50-page project plan has too much content for a single embedding to represent meaningfully.
  2. Embedding: Each chunk is passed through an embedding model that produces a 1536-dimensional vector. This vector captures the semantic meaning of the chunk in a way that similar concepts end up close together in vector space.
  3. Indexing: The vectors are stored in an OpenSearch k-NN index using the HNSW (Hierarchical Navigable Small World) graph algorithm.

At query time, the user's search query goes through the same embedding model, producing a 1536-dimensional query vector. The HNSW index then finds the document vectors closest to this query vector — the nearest neighbors.

Why HNSW, Not Flat Vector Search?

This is a critical architectural decision. Flat vector search compares the query vector against every single document vector in the index. That is O(n) — linear time. For millions of documents, this takes seconds, which is unacceptable for interactive search.

HNSW provides approximate nearest neighbor search in O(log n) time. It achieves this by building a multi-layer graph where each layer has decreasing density:

Layer 0 — Dense (all nodes) Layer 2 — Sparse (long-range links) Doc A Doc C Doc E Doc G Doc A Doc B Doc C Doc D Doc E Doc F Doc G Doc H Doc A Doc E

A search starts at the top sparse layer (long-range jumps), then descends to denser layers for fine-grained navigation. This is what gives HNSW its logarithmic time complexity. The trade-off is approximate results — you might miss the absolute nearest neighbor — but in practice, HNSW achieves 99.5%+ recall at 100x the speed of exact search.

HNSW Tuning Parameters

Getting these right matters more than most engineering blog posts admit. The defaults work for benchmarks. Production is different.

Parameter What It Controls Trade-Off
ef_construction Index build quality. Higher = better recall, slower indexing Recall vs index build time
M Bidirectional links per node. Higher = better recall, larger index Recall vs memory/disk usage
ef_search Query-time search depth. Higher = better recall, slower queries Recall vs query latency
engine ANN algorithm implementation (nmslib, faiss, lucene) Feature set vs performance profile
space_type Distance metric (L2, cosine, inner product) Must match embedding model's training metric

We use the nmslib engine with L2 (Euclidean) distance. We started with OpenSearch defaults and tuned based on production query patterns over several months. The specific parameter values are proprietary, but the lesson is universal: tune on your real workload, not on benchmarks. ANN-benchmark results do not predict how your specific query distribution will behave.

What Semantic Search Is Good At

Finding conceptually related content. A user searching "revenue analysis" finds documents titled "Income Report," "Sales Dashboard," and "Financial Projections." None of these share keywords with the query. BM25 would return zero results. The semantic layer returns exactly what the user was looking for.

This is also essential for AI agents that need to retrieve context from the workspace. As Barry Zhang from Anthropic noted in his talk on building effective agents, everything the model knows is in a 10-20k token context window. Search determines which tokens fill that window. Better search means better agent context, which means better agent output.

What Semantic Search Misses

Very specific terms. If a user searches for a product code like "SKU-7842" or an internal project ID like "PRJ-2026-Q1," semantic embeddings do not help. These identifiers have no semantic meaning — they are arbitrary strings. BM25 handles them perfectly.


📄 Layer 3: File Content OCR

The third layer solves a problem that most search systems ignore entirely: content locked inside uploaded files.

A user uploads a contract PDF. Without OCR, they can search for the file name — "Vendor_Agreement_2026.pdf." With OCR, they can search for "indemnification clause" and find the exact document. The difference is enormous for teams that rely on uploaded documents, images, and scanned materials.

The OCR Pipeline

When a user uploads a file to a Taskade project, a background workflow handles text extraction:

  1. File type detection — PDF, image (PNG, JPG), Word document, or other supported format.
  2. Text extraction — The pipeline routes the file to the appropriate extractor. PDFs use direct text extraction when possible. Scanned documents and images go through OCR. Office documents use format-specific parsers.
  3. Confidence scoring — OCR quality varies wildly. A cleanly scanned PDF produces near-perfect text. A photo of a whiteboard produces noisy, partial text. We store a confidence score alongside every OCR extraction.
  4. Indexing — Extracted text is indexed in OpenSearch alongside native workspace content, with the confidence score influencing result ranking.

Why Confidence Scoring Matters

Not all OCR is created equal. A high-resolution scan of a typed document produces text with 99%+ accuracy. A phone photo of handwritten notes might produce text with 60% accuracy. If we treated both equally in search ranking, the noisy OCR results would pollute the result list.

By storing and using confidence scores, we can:

  • Boost high-confidence OCR results (they are almost as reliable as native text).
  • Downweight low-confidence OCR results (they might match, but we are less sure).
  • Skip extremely low-confidence extractions entirely to avoid garbage results.

This makes the difference between a search system that occasionally surfaces nonsense from bad OCR and one that surfaces file content reliably.


🔐 Permission-Aware Search: The BoolQueryBuilder

This is the part that does not get enough attention in engineering discussions about search. You can build the best three-layer search architecture in the world, and it is useless if users see results they should not have access to. Or worse — if they do not see results they should have access to.

Taskade's 7-tier RBAC system (Owner, Maintainer, Editor, Commenter, Collaborator, Participant, Viewer) means that permission filtering is not a simple on/off check. Different users have different levels of access to different spaces and projects within the same workspace.

How Permission Filtering Works

Every search query is wrapped in a BoolQueryBuilder that adds permission filters before scoring. The filter structure looks conceptually like this:

User search query BoolQueryBuilder FILTERPermission constraints(applied before scoring) Workspace filterworkspace_id = ws_123 "Space filterspace_id IN [sp_1, sp_2, sp_3 Visibility filter Public projectsvisibility = public Member projectsuser_id IN member_ids MUST Scoring & Ranking(only on permitted documents) Filtered, ranked results

The critical design choice here is pre-filtering vs post-filtering.

Why We Pre-Filter (And You Should Too)

The naive approach is post-filtering: run the search, get the top 100 results, then remove any results the user does not have access to. This works perfectly in demos with 50 documents. It breaks catastrophically at scale.

Here is why. Suppose a user searches for "budget" and the top 10 results include 8 documents from spaces they do not have access to. After post-filtering, the user sees a "page" of 2 results. The pagination says "Page 1 of 10" but only shows 2 items. The user clicks "Page 2" and sees 3 items. Or zero items. The experience is broken.

Approach How It Works Pagination Performance Correctness
Post-filter Search first, remove unauthorized results Broken. Pages have inconsistent counts. Fast initial query Correct but poor UX
Pre-filter Apply permissions before scoring Correct. Every page is full. Permission filter in hot path Correct with good UX

We chose pre-filtering. Every results page contains exactly the requested number of results (or fewer only if there genuinely are not enough matching documents). Pagination works correctly. The cost is that the permission filter is in the search hot path, which brings its own challenges.

The Permission Filter Bottleneck

For users with access to a small number of spaces, the permission filter is tiny and fast. For enterprise users with access to hundreds or thousands of projects, the filter query itself becomes large and expensive. A BoolQuery with 500 space IDs in a terms filter is non-trivial for OpenSearch to evaluate on every search.

We optimize this by pre-computing access lists. When a user's permissions change (they are added to a space, their role changes, a project's visibility is updated), we update a cached access list. The search query references this pre-computed list rather than dynamically computing permissions at query time.

This is one of those engineering trade-offs that does not show up in architecture diagrams: the permission model is the hardest part of enterprise search, not the ranking algorithm.

The 7-Tier RBAC Model in Search Context

Taskade's role-based access system has seven tiers, and each affects search visibility differently:

Role Search Visibility
Owner All content in the workspace
Maintainer All content in assigned spaces
Editor All content in assigned spaces
Commenter All content in assigned spaces (read-only search)
Collaborator Content in specifically shared projects
Participant Content in assigned tasks and threads
Viewer Public content and explicitly shared items

The search system respects these boundaries at every layer. A Viewer searching for "budget" will only see results from public projects and items explicitly shared with them, even if the BM25 score for a private document is higher.


🔀 Merging Results: The Hardest Unsolved Problem

Each search layer produces results with fundamentally incompatible scoring metrics. Combining them into a single ranked list is the hardest problem in multi-layer search — and I want to be honest: there is no clean mathematical solution. Every production system uses heuristics and calls it an algorithm.

The Score Normalization Challenge

Layer Score Type Range Meaning
BM25 TF-IDF 0 to infinity Higher = more keyword matches relative to corpus
HNSW Distance 0 to 1 (after normalization) Lower distance = more semantically similar
OCR BM25 on extracted text 0 to infinity Same as BM25, but on OCR-extracted content

You cannot directly compare a BM25 score of 12.5 with a vector distance of 0.23. They measure different things on different scales. A BM25 score of 12.5 might be exceptional for a short query against a large corpus, or mediocre for a long query against a small corpus.

Our Approach

We normalize scores across layers using several strategies:

  1. Min-max normalization within each layer's result set. The top result in each layer gets a score of 1.0, the bottom gets 0.0, and everything else is linearly interpolated.

  2. Query-adaptive weighting. Short, specific queries (1-2 words, no stop words) get boosted BM25 weight. Longer, more descriptive queries get boosted semantic weight. This reflects the observed pattern: users typing "Q1 budget" want exact matches, while users typing "analysis of quarterly revenue trends across departments" want conceptual matches.

  3. Deduplication. The same document may appear in all three layers. A PDF containing the word "revenue" will match BM25, will be semantically similar to revenue-related queries, and will match OCR if it was uploaded as a file. We merge these into a single result entry with a combined score.

  4. Confidence-weighted OCR. OCR results are weighted by extraction confidence. A 99% confidence OCR match is treated almost like native text. A 60% confidence match is significantly downweighted.

Trade-Offs We Accept

Fully semantic results can be surprising. A user searches for "team meeting" and sees a document about "sprint retrospective" in the results. Semantically, these are related. But the user might think, "I did not search for that." We bias toward BM25 for short queries specifically to minimize this surprise.

The weights are tuned, not derived. There is no theorem that tells you the optimal blend ratio between BM25 and vector similarity for a collaborative workspace product. We arrived at our current weights through iterative testing on production query logs. When the query distribution shifts (and it does — new features generate new search patterns), the weights need re-evaluation.

"Search is the most important feature nobody talks about. If users cannot find their data, the data does not exist."


⏱️ Real-Time Indexing: Content Should Be Searchable in Seconds

Building a great search system means nothing if new content takes minutes or hours to appear in results. When a user creates a task in a Taskade project, searches for it 3 seconds later, and gets no results — that is a broken experience.

The Indexing Pipeline

When content changes in Taskade — a new task is created, a document is edited, a file is uploaded — the indexing pipeline activates:

  1. Change event — The content change produces an event that is durably queued.
  2. Event ordering — Events are processed in the correct order. If a user creates a task and then edits it, the index must reflect the edit, not the original version. Event ordering ensures this.
  3. Index update — A durable workflow triggers the appropriate index update activities: BM25 re-indexing for text content, re-embedding for the HNSW layer, and OCR extraction for new file uploads.
  4. Confirmation — The workflow completes when all relevant indexes are updated.

Our automation infrastructure powers this pipeline. The same durable execution engine that runs user-facing workflow automations also runs internal indexing workflows. This gives us retry logic, failure handling, and observability for free.

Latency Target: 2-5 Seconds

We target content being searchable within 2-5 seconds of creation or modification. This is fast enough that users perceive search as "instant" — they create a task, switch to the search bar, type a query, and the task appears. In practice, the gap between creation and the user reaching the search bar is almost always longer than our indexing latency.

Search Feature Implementation
Fuzzy matching OpenSearch fuzziness parameter (edit distance 1-2)
Mention detection @mention pattern matching pre-search
Hashtag search #tag extraction and indexing
Permission filtering BoolQueryBuilder with workspace + RBAC scoping
Real-time indexing Durable workflows + event ordering for async index updates
Confidence scoring OCR quality metric stored alongside extracted text

🧪 Production Lessons After Three Years

We have been running this multi-layer architecture in production for over a year, supporting search across millions of documents for teams using Taskade workspaces. Here is what we have learned.

1. Full-Text Is Still King

I cannot stress this enough. 70% of queries are specific keyword searches. The industry narrative is that semantic/vector search replaces BM25. The data says otherwise.

Users with specific intent type specific words. They know the name of the document they want. They remember a phrase from a task. They are looking for a project by title. BM25 serves these queries faster and more accurately than any embedding model.

Semantic search is additive, not a replacement. If you are building a search system and you have to choose where to invest your first engineering month, invest it in better BM25 — better tokenization, better fuzzy matching, better field boosting. Then add semantic as a second layer.

2. HNSW Tuning Matters More Than You Think

We started with OpenSearch's default HNSW parameters. They produced acceptable results on our staging dataset. In production, with real query distributions and real document collections, the defaults left significant recall on the table.

The two parameters that made the biggest difference:

  • ef_construction: Controls how thoroughly the graph is built during indexing. Higher values produce a more connected graph with better recall, but indexing takes longer. We found that the default was too low for our document distribution.
  • M: Controls the number of bidirectional links per node. Higher values improve recall but increase index size. We experimented with several values before finding our sweet spot.

The takeaway: tune on your real workload. ANN-benchmark results are useful for rough comparisons between algorithms, but they do not predict how your specific query distribution will behave against your specific document collection.

3. OCR Quality Varies Wildly

A cleanly scanned, typed document produces near-perfect OCR. A photo of a whiteboard with markers produces garbled text. A PDF exported from a design tool might contain no extractable text at all because the "text" is actually a raster image.

Without confidence scoring, these all contribute equally to search results. With confidence scoring, we can weight OCR results appropriately and avoid surfacing garbage matches from low-quality extractions.

4. Permission Filtering Is the Bottleneck

For users with access to thousands of projects (common in enterprise workspaces), the BoolQuery permission filter becomes large. A terms filter with 500+ space IDs is computationally expensive for OpenSearch to evaluate on every query.

Pre-computing access lists mitigates this. But it introduces a cache invalidation problem: when permissions change, the cached access list must be updated before the next search query, or the user will see stale results (either seeing content they should not, or missing content they should).

This is the kind of problem that does not appear in architecture diagrams or conference talks. It is mundane, unsexy, and absolutely critical.

5. Index Size Management

1536-dimensional vectors are not small. Each vector is ~6 KB. Multiply by millions of document chunks, and the HNSW index becomes significant.

We learned to be selective about what gets embedded. Single-word tasks do not need vector embeddings. A task titled "Review" or "Follow up" has no semantic content worth embedding. We apply a minimum length threshold — content below a certain length is only indexed for BM25, not for semantic search. This reduces index size substantially without meaningfully affecting recall, because short content items are almost always found by exact keyword match anyway.


🔗 Search and AI Agents: Why This Matters Beyond Search

Search is not just a user-facing feature. It is the foundation for AI agents that operate within the workspace.

When an AI agent needs to find relevant context to answer a question, write a document, or complete a task, it searches the workspace. The quality of that search directly determines the quality of the agent's output. As Barry Zhang from Anthropic noted in his talk on building effective agents, agents operate within a 10-20k token context window. Search decides which tokens fill that window.

Three search layers give the agent three chances to find relevant context:

  • Keyword catches 70% — the agent often knows the exact terms it needs.
  • Semantic catches the conceptual 20% — the agent can describe what it needs even when it does not know the exact document name.
  • OCR catches what is locked inside files — contracts, reports, and scanned documents that would otherwise be invisible to the agent.

This connects directly to Taskade's Workspace DNA architecture, where Memory (projects) feeds Intelligence (AI agents), and Intelligence triggers Execution (automations). Search is the bridge between Memory and Intelligence. Without reliable search, the agents cannot access the workspace's accumulated knowledge, and the Intelligence layer degrades.

For teams building AI-powered workflows, the search system is not a feature — it is infrastructure. Every automation that retrieves context, every agent that answers a question, every workflow that routes tasks based on content — they all depend on search working correctly, quickly, and within permission boundaries.


🔮 What We Are Building Next

Our multi-layer search architecture is stable and serving production traffic well, but we are far from done. Here is what is on our roadmap:

Better score normalization. The merge problem is never truly solved. We are experimenting with learned ranking models that use query-level features to predict optimal layer weights, rather than relying on static heuristics. Early results are promising but not yet production-ready.

Multi-modal search. Today we extract text from images via OCR. Tomorrow, we want to understand the visual content of images directly — search for "architecture diagram" and find actual diagrams, not just documents that mention "architecture." This requires image embeddings alongside text embeddings, adding a fourth dimension to our multi-layer system.

Agent-powered search. Instead of returning a list of results, the search system could use AI agents to refine queries, explain why results were returned, and synthesize answers from multiple documents. This transforms search from "here are 10 links" to "here is the answer, synthesized from these 3 sources."

Cross-workspace search for enterprise. Currently, search is scoped to a single workspace. Enterprise customers with multiple workspaces want to search across all of them, with the same permission-aware filtering applied at the workspace level. This requires a federated search layer on top of our existing three layers.


Frequently Asked Questions

What is multi-layer search and why does it matter?

Multi-layer search combines multiple search approaches in one system. Taskade uses three layers: full-text BM25 for exact keyword matches, 1536-dimensional HNSW vectors for semantic similarity, and OCR for searching inside uploaded files. This ensures users find content whether they remember the exact words, the general concept, or the content is inside an attachment.

What is HNSW and how does it work for semantic search?

HNSW (Hierarchical Navigable Small World) is a graph-based algorithm for approximate nearest neighbor search in high-dimensional vector spaces. Documents are embedded as 1536-dimensional vectors. When a user searches, their query is also embedded, and HNSW finds the most similar document vectors in O(log n) time, enabling semantic search across millions of documents.

How does Taskade handle search permissions across workspaces?

Every search query is wrapped in a BoolQueryBuilder that adds permission filters for workspace, space, project visibility, and user membership before scoring. This pre-filtering approach respects Taskade's 7-tier RBAC system (Owner, Maintainer, Editor, Commenter, Collaborator, Participant, Viewer) and ensures users only see results they have permission to access, with correct pagination.

How quickly does new content become searchable in Taskade?

Taskade targets 2-5 second indexing latency. When content changes, a durable workflow triggers index update activities, with event ordering ensuring updates are processed in the correct sequence. This means newly created tasks and documents are searchable almost immediately.

Why does Taskade still use BM25 full-text search alongside semantic vectors?

Production data shows that 70% of search queries are specific keyword lookups like project names or exact phrases. BM25 handles these faster and more accurately than vector similarity. Semantic search adds value for the remaining 30% of conceptual queries where users describe what they want rather than using exact terms. The two approaches are complementary, not competing.

How does OCR search work for uploaded files in Taskade?

When a user uploads a file such as a PDF, image, or document, a background workflow triggers text extraction via OCR. The extracted text is indexed in OpenSearch alongside workspace content. Users can then search for text inside attachments, not just file names. A confidence score is stored alongside OCR results to weight lower- quality extractions appropriately.

What is the difference between pre-filtering and post-filtering in search?

Post-filtering runs the search first and then removes unauthorized results. This breaks pagination because if 8 of 10 results are filtered out, the user sees a page with only 2 results. Pre-filtering adds permission constraints before scoring, ensuring every results page is full and correctly ranked. Taskade uses pre-filtering via BoolQueryBuilder to avoid this problem.

How does Taskade merge results from three different search layers?

Each search layer produces scores on different scales. BM25 scores range from 0 to infinity, vector similarity scores range from 0 to 1, and OCR matches are near-binary. Taskade normalizes scores across layers, weights them based on query characteristics such as length and specificity, deduplicates documents that appear in multiple layers, and produces a single merged ranking.

What HNSW tuning parameters affect search quality?

The two most important HNSW parameters are ef_construction, which controls index build quality and recall, and M, which sets the number of bidirectional links per node. Higher values improve recall but increase index size and build time. The ef_search parameter controls query-time accuracy versus speed. Tuning these from defaults based on production query patterns is essential for good search quality.

Can Taskade search across multiple workspaces?

Users can search across all spaces they have access to within a workspace. The permission filter scopes results per space based on the user's role in Taskade's 7-tier RBAC system. Cross-workspace search for enterprise customers with multiple workspaces is on the roadmap for future releases.

🏁 Conclusion: Build for the 70%, Optimize for the 30%

The most important thing I have learned building this system is that the industry narrative about search is wrong. Semantic search is not replacing keyword search. Vector databases are not making full-text engines obsolete. The future is not "embed everything."

The future is multi-layer: keyword search for the 70% of queries where users know what they want, semantic search for the 30% where they are exploring, and OCR for the content trapped inside files. Wrapped in a permission layer that respects your workspace's access control model.

If you are building search for a collaborative product, start with BM25. Make it excellent. Then add semantic search as an additive layer. Then add file content extraction. And throughout all of it, solve the permission problem first — it will be harder than you think.

Try Taskade's multi-layer search across your projects, documents, and uploaded files. Create a free workspace and see how the three layers work together — search for exact terms, conceptual queries, and text inside uploaded documents.

Further reading from the Taskade engineering blog:

  • What Is Retrieval Augmented Generation (RAG)?
  • Build AI Agents Without Code
  • Context Engineering for Teams
  • Agentic Engineering Without Code
  • How Large Language Models Work: Transformers Explained
  • AI-Native vs AI-Bolted-On
  • What Is Agentic AI?
  • Stop Worshipping Prompts, Start Building Workflows
0%

On this page

🔍 Why One Search Layer Is Never Enough⚡ The Three-Layer Architecture🏗️ Layer 1: Full-Text Search with OpenSearch BM25How It WorksWhy OpenSearch, Not Elasticsearch?What BM25 Is Good AtWhat BM25 Misses🧬 Layer 2: Semantic Search with HNSW VectorsThe Embedding PipelineWhy HNSW, Not Flat Vector Search?HNSW Tuning ParametersWhat Semantic Search Is Good AtWhat Semantic Search Misses📄 Layer 3: File Content OCRThe OCR PipelineWhy Confidence Scoring Matters🔐 Permission-Aware Search: The BoolQueryBuilderHow Permission Filtering WorksWhy We Pre-Filter (And You Should Too)The Permission Filter BottleneckThe 7-Tier RBAC Model in Search Context🔀 Merging Results: The Hardest Unsolved ProblemThe Score Normalization ChallengeOur ApproachTrade-Offs We Accept⏱️ Real-Time Indexing: Content Should Be Searchable in SecondsThe Indexing PipelineLatency Target: 2-5 Seconds🧪 Production Lessons After Three Years1. Full-Text Is Still King2. HNSW Tuning Matters More Than You Think3. OCR Quality Varies Wildly4. Permission Filtering Is the Bottleneck5. Index Size Management🔗 Search and AI Agents: Why This Matters Beyond Search🔮 What We Are Building NextFrequently Asked Questions🏁 Conclusion: Build for the 70%, Optimize for the 30%

Related Articles

/static_images/Visual taxonomy of 26 AI agent tools grouped by system boundary categories
April 17, 2026AI

We Gave Our AI Agent 26 Tools. Here's Why That's the Right Number. (2026)

Vercel removed 80% of their agent's tools. We kept 26. How to design AI agent tool sets — when more tools are better and...

/static_images/Shield icon representing durable execution and fault-tolerant AI workflow architecture
April 17, 2026AI

Durable Execution for AI Workflows: Patterns from Building 3M Automations (2026)

How Taskade runs reliable AI agent orchestration and automation pipelines on a durable execution foundation — patterns, ...

/static_images/Diagram of the Workspace DNA feedback loop showing Memory, Intelligence, and Execution strands
April 17, 2026AI

The Workspace DNA Architecture: Building Software That Gets Smarter (2026)

How Taskade's Workspace DNA — Memory, Intelligence, and Execution — creates a self-reinforcing loop where software gets ...

/static_images/Diagram of the Genesis 5-stage compilation pipeline from prompt to deployed app
April 16, 2026AI

From Prompt to Deployed App: How Genesis Compiles Living Software (2026)

How Taskade Genesis turns a single prompt into a deployed app with AI agents, automations, and databases. The 5-stage co...

/static_images/Multi-agent collaboration architecture with memory types and orchestration patterns
April 16, 2026AI

Multi-Agent Collaboration in Production: Lessons from 500,000+ Agent Deployments (2026)

How Taskade orchestrates multi-agent collaboration with 5 memory types, credit-based model selection, and agentic loop p...

/static_images/Building a hosted MCP server — protocol to production architecture and auth
April 15, 2026AI

Building a Hosted MCP Server: From Protocol to Production (2026)

How Taskade built a hosted MCP v2 server in 22 days with OpenAPI codegen, workspace context routing, and production auth...

View All Articles
Multi-Layer Search: Full-Text + HNSW + OCR Architecture (2026) | Taskade Blog