Search Systems at Scale: Avoiding Local Maxima and Other Engineering Lessons | S2 E12 – How AI Is Built – Podcast

Folgen

Search in 5 lines of code. Building a search database from first principles | S2 E29
13 Mär· How AI Is Built
Modern search is broken. There are too many pieces that are glued together.
Vector databases for semantic searchText engines for keywordsRerankers to fix the resultsLLMs to understand queriesMetadata filters for precision
Each piece works well alone.
Together, they often become a mess.
When you glue these systems together, you create:
Data Consistency Gaps Your vector store knows about documents your text engine doesn't. Which is right?Timing Mismatches New content appears in one system before another. Users see different results depending on which path their query takes.Complexity Explosion Every new component doubles your integration points. Three components means three connections. Five means ten.Performance Bottlenecks Each hop between systems adds latency. A 200ms search becomes 800ms after passing through four components.Brittle Chains When one system fails, your entire search breaks. More pieces mean more breaking points.
I recently built a system where we had query specific post-filters but the requirement to deliver a fixed number of results to the user.
A lot of times, the query had to be run multiple times to achieve the desired amount.
So we had an unpredictable latency. A high load on the backend, where some queries hammered the database 10+ times. A relevance cliff, where results 1-6 look great, but the later ones were poor matches.
Today on How AI Is Built, we are talking to Marek Galovic from TopK.
We talk about how they built a new search database with modern components. "How would search work if we built it today?”
Cloud storage is cheap. Compute is fast. Memory is plentiful.
One system that handles vectors, text, and filters together - not three systems duct-taped into one.
One pass handles everything:
Vector search + Text search + Filters → Single sorted result
Built with hand-optimized Rust kernels for both x86 and ARM, the system scales to 100M documents with 200ms P99 latency.
The goal is to do search in 5 lines of code.
Marek Galovic:
LinkedInWebsiteTopK WebsiteTopK Docs
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to TopK and Snowflake Comparison
00:35 Architectural Patterns and Custom Formats
01:30 Query Execution Engine Explained
02:56 Distributed Systems and Rust
04:12 Query Execution Process
06:56 Custom File Formats for Search
11:45 Handling Distributed Queries
16:28 Consistency Models and Use Cases
26:47 Exploring Database Versioning and Snapshots
27:27 Performance Benchmarks: Rust vs. C/C++
29:02 Scaling and Latency in Large Datasets
29:39 GPU Acceleration and Use Cases
31:04 Optimizing Search Relevance and Hybrid Search
34:39 Advanced Search Features and Custom Scoring
38:43 Future Directions and Research in AI
47:11 Takeaways for Building AI Applications
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
RAG is two things. Prompt Engineering and Search. Keep it Separate | S2 E28
6 Mär· How AI Is Built
John Berryman moved from aerospace engineering to search, then to ML and LLMs. His path: Eventbrite search → GitHub code search → data science → GitHub Copilot. He was drawn to more math and ML throughout his career.
RAG Explained
"RAG is not a thing. RAG is two things." It breaks into:
Search - finding relevant informationPrompt engineering - presenting that information to the model
These should be treated as separate problems to optimize.
The Little Red Riding Hood Principle
When prompting LLMs, stay on the path of what models have seen in training. Use formats, structures, and patterns they recognize from their training data:
For code, use docstrings and proper formattingFor financial data, use SEC report structuresUse Markdown for better formatting
Models respond better to familiar structures.
Testing Prompts
Testing strategies:
Start with "vibe testing" - human evaluation of outputsDevelop systematic tests based on observed failure patternsUse token probabilities to measure model confidenceFor few-shot prompts, watch for diminishing returns as examples increase
Managing Token Limits
When designing prompts, divide content into:
Static elements (boilerplate, instructions)Dynamic elements (user inputs, context)
Prioritize content by:
Must-have informationNice-to-have informationOptional if space allows
Even with larger context windows, efficiency remains important for cost and latency.
Completion vs. Chat Models
Chat models are winning despite initial concerns about their constraints:
Completion models allow more flexibility in document formatChat models are more reliable and aligned with common use casesMost applications now use chat models, even for completion-like tasks
Applications: Workflows vs. Assistants
Two main LLM application patterns:
Assistants: Human-in-the-loop interactions where users guide and correctWorkflows: Decomposed tasks where LLMs handle well-defined steps with safeguards
Breaking Down Complex Problems
Two approaches:
Horizontal: Split into sequential steps with clear inputs/outputsVertical: Divide by case type, with specialized handling for each scenario
Example: For SOX compliance, break horizontally (understand control, find evidence, extract data, compile report) and vertically (different audit types).
On Agents
Agents exist on a spectrum from assistants to workflows, characterized by:
Having some autonomy to make decisionsUsing tools to interact with the environmentUsually requiring human oversight
Best Practices
For building with LLMs:
Start simple: API key + Jupyter notebookBuild prototypes and iterate quicklyAdd evaluation as you scaleKeep users in the loop until models prove reliability
John Berryman:
LinkedInX (Twitter)Arcturus LabsPrompt Engineering for LLMs (Book)
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to RAG: Retrieval and Generation
00:19 Optimizing Retrieval Systems
01:11 Introducing John Berryman
02:31 John's Journey from Search to Prompt Engineering
04:05 Understanding RAG: Search and Prompt Engineering
05:39 The Little Red Riding Hood Principle in Prompt Engineering
14:14 Balancing Static and Dynamic Elements in Prompts
25:52 Assistants vs. Workflows: Choosing the Right Approach
30:15 Defining Agency in AI
30:35 Spectrum of Assistance and Workflows
34:35 Breaking Down Problems Horizontally and Vertically
37:57 SOX Compliance Case Study
40:56 Integrating LLMs into Existing Applications
44:37 Favorite Tools and Missing Features
46:37 Exploring Niche Technologies in AI
52:52 Key Takeaways and Future Directions
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Fehlende Folgen?

Hier klicken, um den Feed zu aktualisieren.
Graphs aren't just for specialists anymore. They are one import away | S2 E27
28 Feb· How AI Is Built
Kuzu is an embedded graph database that implements Cypher as a library.
It can be easily integrated into various environments—from scripts and Android apps to serverless platforms.
Its design supports both ephemeral, in-memory graphs (ideal for temporary computations) and large-scale persistent graphs where traditional systems struggle with performance and scalability.
Key Architectural Decisions:
Columnar Storage:Kuzu stores node and relationship properties in separate, contiguous columns. This design reduces I/O by allowing queries to scan only the needed columns, unlike row-based systems (e.g., Neo4j) that read full records even when only a subset of properties is required.Efficient Join Indexing with CSR:The join index is maintained using a Compressed Sparse Row (CSR) format. By sorting and compressing relationship data, Kuzu ensures that adjacent node relationships are stored contiguously, minimizing random I/O and speeding up traversals.Vectorized Query Processing:Instead of processing one tuple at a time, Kuzu processes blocks (vectors) of tuples. This block-based (or vectorized) approach reduces function-call overhead and improves cache locality, boosting performance for analytic queries.Factorization and ASP Join:For many-to-many queries that can generate enormous intermediate results, Kuzu uses factorization to represent data compactly. Its ASP join algorithm integrates factorization, sequential scanning, and sideways information passing to avoid unnecessary full scans and materializations.
Kuzu is optimized for read-heavy, analytic workloads. While batched writes are efficient, the system is less tuned for high-frequency, small transactions. Upcoming features include:
A WebAssembly (Wasm) version for running in browsers.Enhanced vector and full-text search indices.Built-in graph data science algorithms for tasks like PageRank and centrality analysis.
Kuzu can be a powerful backend for AI applications in several ways:
Knowledge Graphs:Store and query complex relationships between entities to support natural language understanding, semantic search, and reasoning tasks.Graph Data Science:Run built-in graph algorithms (like PageRank, centrality, or community detection) that help uncover patterns and insights, which can improve recommendation systems, fraud detection, and other AI-driven analyses.Retrieval-Augmented Generation (RAG):Integrate with large language models by efficiently retrieving relevant, structured graph data. Kuzu’s vector search capabilities and fast query processing make it ideal for augmenting AI responses with contextual information.Graph Embeddings & ML Pipelines:Serve as the foundation for generating graph embeddings, which are used in downstream machine learning tasks—such as clustering, classification, or link prediction—to enhance model performance.
Semih Salihoğlu:
LinkedInKuzu GitHubKuzu Docs
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Graph Databases
00:18 Introducing Kuzu: A Modern Graph Database
01:48 Use Cases and Applications of Kuzu
03:03 Kuzu's Research Origins and Scalability
06:18 Columnar Storage vs. Row-Oriented Storage
10:27 Query Processing Techniques in Kuzu
22:22 Compressed Sparse Row (CSR) Storage
27:25 Vectorization in Graph Databases
31:24 Optimizing Query Processors with Vectorization
33:25 Common Wisdom in Graph Databases
35:13 Introducing ASP Join in Kuzu
35:55 Factorization and Efficient Query Processing
39:49 Challenges and Solutions in Graph Databases
45:26 Write Path Optimization in Kuzu
54:10 Future Developments in Kuzu
57:51 Key Takeaways and Final Thoughts
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Knowledge Graphs Won't Fix Bad Data | S2 E26
20 Feb· How AI Is Built
Metadata is the foundation of any enterprise knowledge graph.
By organizing both technical and business metadata, organizations create a “brain” that supports advanced applications like AI-driven data assistants.
The goal is to achieve economies of scale—making data reusable, traceable, and ultimately more valuable.
Juan Sequeda is a leading expert in enterprise knowledge graphs and metadata management. He has spent years solving the challenges of integrating diverse data sources into coherent, accessible knowledge graphs. As Principal Scientist at data.world, Juan provides concrete strategies for improving data quality, streamlining feature extraction, and enhancing model explainability. If you want to build AI systems on a solid data foundation—one that cuts through the noise and delivers reliable, high-performance insights—you need to listen to Juan’s proven methods and real-world examples.
Terms like ontologies, taxonomies, and knowledge graphs aren’t new inventions. Ontologies and taxonomies have been studied for decades—even since ancient Greece. Google popularized “knowledge graphs” in 2012 by building on decades of semantic web research. Despite current buzz, these concepts build on established work.
Traditionally, data lives in siloed applications—each with its own relational databases, ETL processes, and dashboards. When cross-application queries and consistent definitions become painful, organizations face metadata management challenges. The first step is to integrate technical metadata (table names, columns, code lineage) into a unified knowledge graph. Then, add business metadata by mapping business glossaries and definitions to that technical layer.
A modern data catalog should:
Integrate Multiple Sources: Automatically ingest metadata from databases, ETL tools (e.g., dbt, Fivetran), and BI tools.Bridge Technical and Business Views: Link technical definitions (e.g., table “CUST_123”) with business concepts (e.g., “Customer”).Enable Reuse and Governance: Support data discovery, impact analysis, and proper governance while facilitating reuse across teams.
Practical Approaches & Use Cases:
Start with a Clear Problem: Whether it’s reducing churn, improving operational efficiency, or meeting compliance needs, begin by solving a specific pain point.Iron Thread Method: Follow one query end-to-end—from identifying a business need to tracing it back to source systems—to gradually build and refine the graph.Automation vs. Manual Oversight: Technical metadata extraction is largely automated. For business definitions or text-based entity extraction (e.g., via LLMs), human oversight is key to ensuring accuracy and consistency.
Technical Considerations:
Entity vs. Property: If you need to attach additional details or reuse an element across contexts, model it as an entity (with a unique identifier). Otherwise, keep it as a simple property.Storage Options: The market offers various graph databases—Neo4j, Amazon Neptune, Cosmos DB, TigerGraph, Apache Jena (for RDF), etc. Future trends point toward multimodel systems that allow querying in SQL, Cypher, or SPARQL over the same underlying data.
Juan Sequeda:
LinkedIndata.worldSemantic Web for the Working OntologistDesigning and Building Enterprise Knowledge Graphs (before you buy, send Juan a message, he is happy to send you a copy)Catalog & Cocktails (Juan’s podcast)
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Knowledge Graphs 00:45 The Role of Metadata in AI 01:06 Building Knowledge Graphs: First Steps 01:42 Interview with Juan Sequira 02:04 Understanding Buzzwords: Ontologies, Taxonomies, and More 05:05 Challenges and Solutions in Data Management 08:04 Practical Applications of Knowledge Graphs 15:38 Governance and Data Engineering 34:42 Setting the Stage for Data-Driven Problem Solving 34:58 Understanding Consumer Needs and Data Challenges 35:33 Foundations and Advanced Capabilities in Data Management 36:01 The Role of AI and Metadata in Data Maturity 37:56 The Iron Thread Approach to Problem Solving 40:12 Constructing and Utilizing Knowledge Graphs 54:38 Trends and Future Directions in Knowledge Graphs 59:17 Practical Advice for Building Knowledge Graphs
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Temporal RAG: Embracing Time for Smarter, Reliable Knowledge Graphs | S2 E25
13 Feb· How AI Is Built
Daniel Davis is an expert on knowledge graphs. He has a background in risk assessment and complex systems—from aerospace to cybersecurity. Now he is working on “Temporal RAG” in TrustGraph.
Time is a critical—but often ignored—dimension in data. Whether it’s threat intelligence, legal contracts, or API documentation, every data point has a temporal context that affects its reliability and usefulness. To manage this, systems must track when data is created, updated, or deleted, and ideally, preserve versions over time.
Three Types of Data:
Observations:Definition: Measurable, verifiable recordings (e.g., “the hat reads ‘Sunday Running Club’”).Characteristics: Require supporting evidence and may be updated as new data becomes available.Assertions:Definition: Subjective interpretations (e.g., “the hat is greenish”).Characteristics: Involve human judgment and come with confidence levels; they may change over time.Facts:Definition: Immutable, verified information that remains constant.Characteristics: Rare in dynamic environments because most data evolves; serve as the “bedrock” of trust.
By clearly categorizing data into these buckets, systems can monitor freshness, detect staleness, and better manage dependencies between components (like code and its documentation).
Integrating Temporal Data into Knowledge Graphs:
Challenge:Traditional knowledge graphs and schemas (e.g., schema.org) rarely integrate time beyond basic metadata. Long documents may only provide a single timestamp, leaving the context of internal details untracked.Solution:Attach detailed temporal metadata (such as creation, update, and deletion timestamps) during data ingestion. Use versioning to maintain historical context. This allows systems to:Assess whether data is current or stale.Detect conflicts when updates occur.Employ Bayesian methods to adjust trust metrics as more information accumulates.
Key Takeaways:
Focus on Specialization:Build tools that do one thing well. For example, design a simple yet extensible knowledge graph rather than relying on overly complex ontologies.Integrate Temporal Metadata:Always timestamp data operations and version records. This is key to understanding data freshness and evolution.Adopt Robust Infrastructure:Use scalable, proven technologies to connect specialized modules via APIs. This reduces maintenance overhead compared to systems overloaded with connectors and extra features.Leverage Bayesian Updates:Start with initial trust metrics based on observed data and refine them as new evidence arrives.Mind the Big Picture:Avoid working in isolated silos. Emphasize a holistic system design that maintains in situ context and promotes collaboration across teams.
Daniel Davis
Cognitive CoreTrustGraphYouTubeLinkedInDiscord
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Temporal Dimensions in Data 00:53 Timestamping and Versioning Data 01:35 Introducing Daniel Davis and Temporal RAG 01:58 Three Buckets of Data: Observations, Assertions, and Facts 03:22 Dynamic Data and Data Freshness 05:14 Challenges in Integrating Time in Knowledge Graphs 09:41 Defining Observations, Assertions, and Facts 12:57 The Role of Time in Data Trustworthiness 46:58 Chasing White Whales in AI 47:58 The Problem with Feature Overload 48:43 Connector Maintenance Challenges 50:02 The Swiss Army Knife Analogy 51:16 API Meshes and Glue Code 54:14 The Importance of Software Infrastructure 01:00:10 The Need for Specialized Tools 01:13:25 Outro and Future Plans
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Context is King: How Knowledge Graphs Help LLMs Reason
6 Feb· How AI Is Built
Robert Caulk runs Emergent Methods, a research lab building news knowledge graphs. With a Ph.D. in computational mechanics, he spent 12 years creating open-source tools for machine learning and data analysis. His work on projects like Flowdapt (model serving) and FreqAI (adaptive modeling) has earned over 1,000 academic citations.
His team built AskNews, which he calls "the largest news knowledge graph in production." It's a system that doesn't just collect news - it understands how events, people, and places connect.
Current AI systems struggle to connect information across sources and domains. Simple vector search misses crucial relationships. But building knowledge graphs at scale brings major technical hurdles around entity extraction, relationship mapping, and query performance.
Emergent Methods built a hybrid system combining vector search and knowledge graphs:
Vector DB (Quadrant) handles initial broad retrievalCustom knowledge graph processes relationshipsTranslation pipeline normalizes multi-language contentEntity extraction model identifies key elementsContext engineering pipeline structures data for LLMs
Implementation Details:
Data Pipeline:
All content normalized to English for consistent embeddingsEntity names preserved in original language when untranslatableCustom Gleiner News model handles entity extractionRetrained every 6 months on fresh dataHuman review validates entity accuracy
Entity Management:
Base extraction uses BERT-based Gleiner architectureTrained on diverse data across topics/regionsDisambiguation system merges duplicate entitiesManual override options for analystsMetadata tracking preserves relationship context
Knowledge Graph:
Selective graph construction from vector resultsOn-demand relationship processingGraph queries via standard CypherBuilt for specific use cases vs general coverageIntegration with S3 and other data stores
System Validation:
Custom "Context is King" benchmark suiteRAGAS metrics track retrieval accuracyTime-split validation prevents data leakageManual review of entity extractionProduction monitoring of query patterns
Engineering Insights:

Key Technical Decisions:
English normalization enables consistent embeddingsHybrid vector + graph approach balances speed/depthSelective graph construction keeps costs downHuman-in-loop validation maintains quality
Dead Ends Hit:
Full multi-language entity system too complexReal-time graph updates not feasible at scalePure vector or pure graph approaches insufficient
Top Quotes:
"At its core, context engineering is about how we feed information to AI. We want clear, focused inputs for better outputs. Think of it like talking to a smart friend - you'd give them the key facts in a way they can use, not dump raw data on them." - Robert"Strong metadata paints a high-fidelity picture. If we're trying to understand what's happening in Ukraine, we need to know not just what was said, but who said it, when they said it, and what voice they used to say it. Each piece adds color to the picture." - Robert"Clean data beats clever models. You can throw noise at an LLM and get something that looks good, but if you want real accuracy, you need to strip away the clutter first. Every piece of noise pulls the model in a different direction." - Robert"Think about how the answer looks in the real world. If you're comparing apartments, you'd want a table. If you're tracking events, you'd want a timeline. Match your data structure to how humans naturally process that kind of information." - Nico"Building knowledge graphs isn't about collecting everything - it's about finding the relationships that matter. Most applications don't need a massive graph. They need the right connections for their specific problem." - Robert"The quality of your context sets the ceiling for what your AI can do. You can have the best model in the world, but if you feed it noisy, unclear data, you'll get noisy, unclear answers. Garbage in, garbage out still applies." - Robert"When handling multiple languages, it's better to normalize everything to one language than to try juggling many. Yes, you lose some nuance, but you gain consistency. And consistency is what makes these systems reliable." - Robert"The hard part isn't storing the data - it's making it useful. Anyone can build a database. The trick is structuring information so an AI can actually reason with it. That's where context engineering makes the difference." - Robert"Start simple, then add complexity only when you need it. Most teams jump straight to sophisticated solutions when they could get better results by just cleaning their data and thinking carefully about how they structure it." - Nico"Every token in your context window is precious. Don't waste them on HTML tags or formatting noise. Save that space for the actual signal - the facts, relationships, and context that help the AI understand what you're asking." - Nico
Robert Caulk:
LinkedInEmergent MethodsAsknews
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Context Engineering 00:24 Curating Input Signals 01:01 Structuring Raw Data 03:05 Refinement and Iteration 04:08 Balancing Breadth and Precision 06:10 Interview Start 08:02 Challenges in Context Engineering 20:25 Optimizing Context for LLMs 45:44 Advanced Cypher Queries and Graphs 46:43 Enrichment Pipeline Flexibility 47:16 Combining Graph and Semantic Search 49:23 Handling Multilingual Entities 52:57 Disambiguation and Deduplication Challenges 55:37 Training Models for Diverse Domains 01:04:43 Dealing with AI-Generated Content 01:17:32 Future Developments and Final Thoughts
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Inside Vector Database Quantization: Product, Binary, and Scalar | S2 E23
31 Jan· How AI Is Built
When you store vectors, each number takes up 32 bits.
With 1000 numbers per vector and millions of vectors, costs explode.
A simple chatbot can cost thousands per month just to store and search through vectors.
The Fix: Quantization
Think of it like image compression. JPEGs look almost as good as raw photos but take up far less space. Quantization does the same for vectors.

Today we are back continuing our series on search with Zain Hasan, a former ML engineer at Weaviate and now a Senior AI/ ML Engineer at Together. We talk about the different types of quantization, when to use them, how to use them, and their tradeoff.
Three Ways to Quantize:
Binary Quantization Turn each number into just 0 or 1Ask: "Is this dimension positive or negative?"Works great for 1000+ dimensionsCuts memory by 97%Best for normally distributed dataProduct Quantization Split vector into chunksGroup similar chunksStore cluster IDs instead of full numbersGood when binary quantization failsMore complex but flexibleScalar Quantization Use 8 bits instead of 32Simple middle groundKeeps more precision than binaryLess savings than binary

Key Quotes:
"Vector databases are pretty much the commercialization and the productization of representation learning.""I think quantization, it builds on the assumption that there is still noise in the embeddings. And if I'm looking, it's pretty similar as well to the thought of Matryoshka embeddings that I can reduce the dimensionality.""Going from text to multimedia in vector databases is really simple.""Vector databases allow you to take all the advances that are happening in machine learning and now just simply turn a switch and use them for your application."
Zain Hasan:
LinkedInX (Twitter)WeaviateTogether
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
vector databases, quantization, hybrid search, multi-vector support, representation learning, cost reduction, memory optimization, multimodal recommender systems, brain-computer interfaces, weather prediction models, AI applications
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Local-First Search: How to Push Search To End-Devices | S2 E22
23 Jan· How AI Is Built
Alex Garcia is a developer focused on making vector search accessible and practical. As he puts it: "I'm a SQLite guy. I use SQLite for a lot of projects... I want an easier vector search thing that I don't have to install 10,000 dependencies to use.”
Core Mantra: "Simple, Local, Scalable"

Why SQLite Vec?
"I didn't go along thinking, 'Oh, I want to build vector search, let me find a database for it.' It was much more like: I use SQLite for a lot of projects, I want something lightweight that works in my current workflow."
SQLiteVec uses row-oriented storage with some key design choices:
Vectors are stored in large chunks (megabytes) as blobsData is split across 4KB SQLite pages, which affects analytical performanceCurrently uses brute force linear search without ANN indexingSupports binary quantization for 32x size reductionHandles tens to hundreds of thousands of vectors efficiently
Practical limits:
500ms search time for 500K vectors (768 dimensions)Best performance under 100ms for user experienceBinary quantization enables scaling to ~1M vectorsMetadata filtering and partitioning coming soon
Key advantages:
Fast writes for transactional workloadsSimple single-file databaseEasy integration with existing SQLite applicationsLeverages SQLite's mature storage engine
Garcia's preferred tools for local AI:
Sentence Transformers models converted to GGUF formatLlama.cpp for inferenceSmall models (30MB) for basic embeddingsLarger models like Arctic Embed (hundreds of MB) for recent topicsSQLite L-Embed extension for text embeddingsTransformers.js for browser-based implementations
1. Choose Your Storage
"There's two ways of storing vectors within SQLiteVec. One way is a manual way where you just store a JSON array... [second is] using a virtual table."
Traditional row storage: Simple, flexible, good for small vectorsVirtual table storage: Optimized chunks, better for large datasetsPerformance sweet spot: Up to 500K vectors with 500ms search time
2. Optimize Performance
"With binary quantization it's 1/32 of the space... and holds up at 95 percent quality"
Binary quantization reduces storage 32x with 95% qualityDefault page size is 4KB - plan your vector storage accordinglyMetadata filtering dramatically improves search speed
3. Integration Patterns
"It's a single file, right? So you can like copy and paste it if you want to make a backup."
Two storage approaches: manual columns or virtual tablesEasy backups: single file databaseCross-platform: desktop, mobile, IoT, browser (via WASM)
4. Real-World Tips
"I typically choose the really small model... it's 30 megabytes. It quantizes very easily... I like it because it's very small, quick and easy."
Start with smaller, efficient models (30MB range)Use binary quantization before trying complex solutionsPlan for partitioning when scaling beyond 100K vectors
Alex Garcia
LinkedInX (Twitter)GitHubsqlite-vecsqllite-vssWebsite
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
AI-Powered Search: Context Is King, But Your RAG System Ignores Two-Thirds of It | S2 E21
9 Jan· How AI Is Built
Today, I (Nicolay Gerold) sit down with Trey Grainger, author of the book AI-Powered Search. We discuss the different techniques for search and recommendations and how to combine them.
While RAG (Retrieval-Augmented Generation) has become a buzzword in AI, Trey argues that the current understanding of "RAG" is overly simplified – it's actually a bidirectional process he calls "GARRAG," where retrieval and generation continuously enhance each other.
Trey uses a three context framework for search architecture:
Content Context: Traditional document understanding and retrievalUser Context: Behavioral signals driving personalization and recommendationsDomain Context: Knowledge graphs and semantic understanding
Trey shares insights on:
Why collecting and properly using user behavior signals is crucial yet often overlookedHow to implement "light touch" personalization without trapping users in filter bubblesThe evolution from simple vector similarity to sophisticated late interaction modelsWhy treating search as a non-linear pipeline with feedback loops leads to better results
For engineers building search systems, Trey offers practical advice on choosing the right tools and techniques, from traditional search engines like Solr and Elasticsearch to modern approaches like ColBERT.
Also how to layer different techniques to make search tunable and debuggable.
Quotes:
"I think of whether it's search or generative AI, I think of all of these systems as nonlinear pipelines.""The reason we use retrieval when we're working with generative AI is because A generative AI model these LLMs will take your query, your request, whatever you're asking for. They will then try to interpret them and without access to up to date information, without access to correct information, they will generate a response from their highly compressed understanding of the world. And so we use retrieval to augment them with information.""I think the misconception is that, oh, hey, for RAG I can just, plug in a vector database and a couple of libraries and, a day or two later everything's magically working and I'm off to solve the next problem. Because search and information retrieval is one of those problems that you never really solve. You get it, good enough and quit, or you find so much value in it, you just continue investing to constantly make it better.""To me, they're, search and recommendations are fundamentally the same problem. They're just using different contexts.""Anytime you're building a search system, whether it's traditional search, whether it's RAG for generative AI, you need to have all three of those contexts in order to effectively get the most relevant results to solve solve the problem.""There's no better way to make your users really angry with you than to stick them in a bucket and get them stuck in that bucket, which is not their actual intent."
Trey Grainger:
LinkedInAI Powered Search (Community)AI Powered Search (Book)
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Search Challenges 00:50 Layered Approach to Ranking 01:00 Personalization and Signal Boosting 02:25 Broader Principles in Software Engineering 02:51 Interview with Trey Greinger 03:32 Understanding RAG and Retrieval 04:35 Nonlinear Pipelines in Search 06:01 Generative AI and Retrieval 08:10 Search Renaissance and AI 10:27 Misconceptions in AI-Powered Search 18:12 Search vs. Recommendation Systems 22:26 Three Buckets of Relevance 38:19 Traditional Learning to Rank 39:11 Semantic Relevance and User Behavior 39:53 Layered Ranking Algorithms 41:40 Personalization in Search 43:44 Technological Setup for Query Understanding 48:21 Personalization and User Behavior Vectors 52:10 Choosing the Right Search Engine 56:35 Future of AI-Powered Search 01:00:48 Building Effective Search Applications 01:06:50 Three Critical Context Frameworks 01:12:08 Modern Search Systems and Contextual Understanding 01:13:37 Conclusion and Recommendations
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces | S2 E20
3 Jan· How AI Is Built
Today we are back continuing our series on search. We are talking to Brandon Smith, about his work for Chroma. He led one of the largest studies in the field on different chunking techniques. So today we will look at how we can unfuck our RAG systems from badly chosen chunking hyperparameters.
The biggest lie in RAG is that semantic search is simple. The reality is that it's easy to build, it's easy to get up and running, but it's really hard to get right. And if you don't have a good setup, it's near impossible to debug. One of the reasons it's really hard is actually chunking. And there are a lot of things you can get wrong.
And even OpenAI boggled it a little bit, in my opinion, using an 800 token length for the chunks. And this might work for legal, where you have a lot of boilerplate that carries little semantic meaning, but often you have the opposite. You have very information dense content and imagine fitting an entire Wikipedia page into the size of a tweet There will be a lot of information that's actually lost and that's what happens with long chunks The next is overlap openai uses a foreign token overlap or used to And what this does is actually we try to bring the important context into the chunk, but in reality, we don't really know where the context is coming from.
It could be from a few pages prior, not just the 400 tokens before. It could also be from a definition that's not even in the document at all. There is a really interesting solution actually from Anthropic Contextual Retrieval, where you basically pre process all the chunks to see whether there is any missing information and you basically try to reintroduce it.
Brandon Smith:
LinkedInX (Twitter)WebsiteChunking Article
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)Website
00:00 The Biggest Lie in RAG: Semantic Search Simplified 00:43 Challenges in Chunking and Overlap 01:38 Introducing Brandon Smith and His Research 02:05 The Motivation and Mechanics of Chunking 04:40 Issues with Current Chunking Methods 07:04 Optimizing Chunking Strategies 23:04 Introduction to Chunk Overlap 24:23 Exploring LLM-Based Chunking 24:56 Challenges with Initial Approaches 28:17 Alternative Chunking Methods 36:13 Language-Specific Considerations 38:41 Future Directions and Best Practices
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
How AI Can Start Teaching Itself - Synthetic Data Deep Dive | S2 E18
19 Dez 2024· How AI Is Built
Most LLMs you use today already use synthetic data.
It’s not a thing of the future.
The large labs use a large model (e.g. gpt-4o) to generate training data for a smaller one (gpt-4o-mini).
This lets you build fast, cheap models that do one thing well.
This is “distillation”.
But the vision for synthetic data is much bigger.
Enable people to train specialized AI systems without having a lot of training data.
Today we are talking to Adrien Morisot, an ML engineer at Cohere.
We talk about how Cohere uses synthetic data to train their models, their learnings, and how you can use synthetic data in your training.
We are slightly diverging from our search focus, but I wanted to create a deeper dive into synthetic data after our episode with Saahil.
You could use it in a lot of places: generate hard negatives, generate training samples for classifiers and rerankers and much more.
Scaling Synthetic Data Creation: https://arxiv.org/abs/2406.20094
Adrien Morisot:
LinkedInX (Twitter)Cohere
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Synthetic Data in LLMs 00:18 Distillation and Specialized AI Systems 00:39 Interview with Adrien Morisot 02:00 Early Challenges with Synthetic Data 02:36 Breakthroughs and Rediscovery 03:54 The Evolution of AI and Synthetic Data 07:51 Data Harvesting and Internet Scraping 09:28 Generating Diverse Synthetic Data 15:37 Manual Review and Quality Control 17:28 Automating Data Evaluation 18:54 Fine-Tuning Models with Synthetic Data 21:45 Avoiding Behavioral Cloning 23:47 Ensuring Model Accuracy with Verification 24:31 Adapting Models to Specific Domains 26:41 Challenges in Financial and Legal Domains 28:10 Improving Synthetic Data Sets 30:45 Evaluating Model Performance 32:21 Using LLMs as Judges 35:42 Practical Tips for AI Practitioners 41:26 Synthetic Data in Training Processes 43:51 Quality Control in Synthetic Data 45:41 Domain Adaptation Strategies 46:51 Future of Synthetic Data Generation 47:30 Conclusion and Next Steps
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
A Search System That Learns As You Use It (Agentic RAG) | S2 E18
13 Dez 2024· How AI Is Built
Modern RAG systems build on flexibility.
At their core, they match each query with the best tool for the job.
They know which tool fits each task. When you ask about sales numbers, they reach for SQL. When you need to company policies, they use vector search or BM25. The key is switching tools smoothly.
A question about sales figures might need SQL, while a search through policy documents works better with vector search. The key is building systems that can switch between these tools smoothly.
But all types of retrieval start with metadata. By tagging documents with key details during processing, we narrow the search space before diving in.
The best systems use a mix of approaches: they might keep full documents for context, summaries for quick scanning, and metadata for filtering. They cast a wide net at first, then use neural ranking to zero in on the most relevant results.
The quality of embeddings can make or break a system. General-purpose models often fall short in specialized fields. Testing different embedding models on your specific data pays off - what works for general text might fail for legal documents or technical manuals. Sometimes, fine-tuning a model for your domain is worth the effort.
When building search systems, think modular. Start with pieces that can be swapped out as needs change or better tools emerge. Add metadata processing early - it's harder to add later. Break the retrieval process into steps: first find possible matches quickly, then rank them carefully. For complex documents with tables or images, add tools that can handle different types of content.
The best systems also check their work. They ask: "Did I actually answer the question?" If not, they try a different approach. But they also know when to stop - endless loops help no one. In the end, RAG isn't just about finding information. It's about finding the right information, in the right way, at the right time.
Stephen Batifol:
X (Twitter)ZillizLinkedIn
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Agentic RAG 00:04 Understanding Control Flow in Agentic RAG 00:33 Decision Making with LLMs 01:11 Exploring Agentic RAG with Stephen Batifol 03:35 Comparing RAG and GAR 06:31 Implementing Agentic RAG Workflows 22:36 Filtering with Prefix, Suffix, and Midfix 24:15 Breaking Mechanisms in Workflows 28:00 Evaluating Agentic Workflows 30:31 Multimodal and VLLMs in Document Processing 33:51 Challenges and Innovations in Parsing 34:51 Overrated and Underrated Aspects in LLMs 39:52 Building Effective Search Applications
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Rethinking Search Inside Postgres, From Lexemes to BM25 | S2 E17
5 Dez 2024· How AI Is Built
Many companies use Elastic or OpenSearch and use 10% of the capacity.
They have to build ETL pipelines.
Get data Normalized.
Worry about race conditions.
All in all. At the moment, when you want to do search on top of your transactional data, you are forced to build a distributed systems.
Not anymore.
ParadeDB is building an open-source PostgreSQL extension to enable search within your database.
Today, I am talking to Philippe Noël, the founder and CEO of ParadeDB.
We talk about how they build it, how they integrate into the Postgres Query engines, and how you can build search on top of Postgres.
Key Insights:
Search is changing. We're moving from separate search clusters to search inside databases. Simpler architecture, stronger guarantees, lower costs up to a certain scale.
Most search engines force you to duplicate data. ParadeDB doesn't. You keep data normalized and join at query time. It hooks deep into Postgres's query planner. It doesn't just bolt on search - it lets Postgres optimize search queries alongside SQL ones.
Search indices can work with ACID. ParadeDB's BM25 index keeps Lucene-style components (term frequency, normalization) but adds Postgres metadata for transactions. Search + ACID is possible.
Two storage types matter: inverted indices for text, columnar "fast fields" for analytics. Pick the right one or queries get slow. Integers now default to columnar to prevent common mistakes.
Mixing query engines looks tempting but fails. The team tried using DuckDB and DataFusion inside Postgres. Both were fast but broke ACID compliance. They had to rebuild features natively.
Philippe Noël:
LinkedInBlueskyParadeDB
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)Bluesky
00:00 Introduction to ParadeDB 00:53 Building ParadeDB with Rust 01:43 Integrating Search in Postgres 03:04 ParadeDB vs. Elastic 05:48 Technical Deep Dive: Postgres Integration 07:27 Challenges and Solutions 09:35 Transactional Safety and Performance 11:06 Composable Data Systems 15:26 Columnar Storage and Analytics 20:54 Case Study: Alibaba Cloud 21:57 Data Warehouse Context 23:24 Custom Indexing with BM25 24:01 Postgres Indexing Overview 24:17 Fast Fields and Columnar Format 24:52 Lucene Inspiration and Data Storage 26:06 Setting Up and Managing Indexes 27:43 Query Building and Complex Searches 30:21 Scaling and Sharding Strategies 35:27 Query Optimization and Common Mistakes 38:39 Future Developments and Integrations 39:24 Building a Full-Fledged Search Application 42:53 Challenges and Advantages of Using ParadeDB 46:43 Final Thoughts and Recommendations
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
RAG's Biggest Problems & How to Fix It (ft. Synthetic Data) | S2 E16
28 Nov 2024· How AI Is Built
RAG isn't a magic fix for search problems. While it works well at first, most teams find it's not good enough for production out of the box. The key is to make it better step by step, using good testing and smart data creation.
Today, we are talking to Saahil Ognawala from Jina AI to start to understand RAG.
To build a good RAG system, you need three things: ways to test it, methods to create training data, and plans to make it better over time. Testing starts with a set of example searches that users might make. These should include common searches that happen often, medium-rare searches, and rare searches that only happen now and then. This mix helps you measure if changes make your system better or worse.
Creating synthetic data helps make the system stronger, especially in spotting wrong answers that look right. Think of someone searching for a "gluten-free chocolate cake." A "sugar-free chocolate cake" might look like a good answer because it shares many words, but it's wrong.
These tricky examples help the system learn the difference between similar but different things.
When creating synthetic data, you need rules. The best way is to show the AI a few real examples and give it a list of topics to work with. Most teams find that using half real data and half synthetic data works best. This gives you enough variety while keeping things real.
Getting user feedback is hard with RAG. In normal search, you can see if users click on results. But with RAG, the system creates an answer from many pieces. A good answer might come from both good and bad pieces, making it hard to know which parts helped. This means you need smart ways to track which pieces of information actually helped make good answers.
One key rule: don't make things harder than they need to be. If simple keyword search (called BM25) works well enough, adding fancy AI search might not be worth the extra work.
Success with RAG comes from good testing, careful data creation, and steady improvements based on real use. It's not about using the newest AI models. It's about building good systems and processes that work reliably.
"It isn’t a magic wand you can place on your catalog and expect results you didn’t get before."

“Most of our users are enterprise users who have seen the most success in their RAG systems are the ones that very early implemented a continuous feedback mechanism.“
“If you can't tell in real time usage whether an answer is a bad answer or a right answer because the LLM just makes it look like the right answer then you only have your retrieval dataset to blame”
Saahil Ognawala:
LinkedInJina AI
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Retrieval Augmented Generation (RAG) 00:29 Interview with Saahil Ognawala 00:52 Synthetic Data in Language Generation 01:14 Understanding the E5 Mistral Instructor Embeddings Paper 03:15 Challenges and Evolution in Synthetic Data 05:03 User Intent and Retrieval Systems 11:26 Evaluating RAG Systems 14:46 Setting Up Evaluation Frameworks 20:37 Fine-Tuning and Embedding Models 22:25 Negative and Positive Examples in Retrieval 26:10 Synthetic Data for Hard Negatives 29:20 Case Study: Marine Biology Project 29:54 Addressing Errors in Marine Biology Queries 31:28 Ensuring Query Relevance with Human Intervention 31:47 Few Shot Prompting vs Zero Shot Prompting 35:09 Balancing Synthetic and Real World Data 37:17 Improving RAG Systems with User Feedback 39:15 Future Directions for Jina and Synthetic Data 40:44 Building and Evaluating Embedding Models 41:24 Getting Started with Jina and Open Source Tools 51:25 The Importance of Hard Negatives in Embedding Models
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
From Ambiguous to AI-Ready: Improving Documentation Quality for RAG Systems | S2 E15
21 Nov 2024· How AI Is Built
Documentation quality is the silent killer of RAG systems. A single ambiguous sentence might corrupt an entire set of responses. But the hardest part isn't fixing errors - it's finding them.
Today we are talking to Max Buckley on how to find and fix these errors.
Max works at Google and has built a lot of interesting experiments with LLMs on using them to improve knowledge bases for generation.
We talk about identifying ambiguities, fixing errors, creating improvement loops in the documents and a lot more.
Some Insights:
A single ambiguous sentence can systematically corrupt an entire knowledge base's responses. Fixing these "documentation poisons" often requires minimal changes but identifying them is challenging.Large organizations develop their own linguistic ecosystems that evolve over time. This creates unique challenges for both embedding models and retrieval systems that need to bridge external and internal vocabularies.Multiple feedback loops are crucial - expert testing, user feedback, and system monitoring each catch different types of issues.
Max Buckley: (All opinions are his own and not of Google)
LinkedIn
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Understanding LLM Hallucinations 00:02 Challenges with Temporal Inconsistencies 00:43 Issues with Document Structure and Terminology 01:05 Introduction to Retrieval Augmented Generation (RAG) 01:49 Interview with Max Buckley 02:27 Anthropic's Approach to Document Chunking 02:55 Contextualizing Chunks for Better Retrieval 06:29 Challenges in Chunking and Search 07:35 LLMs in Internal Knowledge Management 08:45 Identifying and Fixing Documentation Errors 10:58 Using LLMs for Error Detection 15:35 Improving Documentation with User Feedback 24:42 Running Processes on Retrieved Context 25:19 Challenges of Terminology Consistency 26:07 Handling Definitions and Glossaries 30:10 Addressing Context Misinterpretation 31:13 Improving Documentation Quality 36:00 Future of AI and Search Technologies 42:29 Ensuring Documentation Readiness for AI
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
BM25 is the workhorse of search; vectors are its visionary cousin | S2 E14
15 Nov 2024· How AI Is Built
Ever wondered why vector search isn't always the best path for information retrieval?
Join us as we dive deep into BM25 and its unmatched efficiency in our latest podcast episode with David Tippett from GitHub.
Discover how BM25 transforms search efficiency, even at GitHub's immense scale.
BM25, short for Best Match 25, use term frequency (TF) and inverse document frequency (IDF) to score document-query matches. It addresses limitations in TF-IDF, such as term saturation and document length normalization.
Search Is About User Expectations
Search isn't just about relevance but aligning with what users expect: GitHub users, for example, have diverse use cases—finding security vulnerabilities, exploring codebases, or managing repositories. Each requires a different prioritization of fields, boosting strategies, and possibly even distinct search workflows.Key Insight: Search is deeply contextual and use-case driven. Understanding your users' intent and tailoring search behavior to their expectations matters more than chasing state-of-the-art technology.
The Challenge of Vector Search at Scale
Vector search systems require in-memory storage of vectorized data, making them costly for datasets with billions of documents (e.g., GitHub’s 100 billion documents).IVF and HNSW offer trade-offs: IVF: Reduces memory requirements by bucketing vectors but risks losing relevance due to bucket misclassification.HNSW: Offers high relevance but demands high memory, making it impractical for massive datasets.Architectural Insight: When considering vector search, focus on niche applications or subdomains with manageable dataset sizes or use hybrid approaches combining BM25 with sparse/dense vectors.
Vector Search vs. BM25: A Trade-off of Precision vs. Cost
Vector search is more precise and effective for semantic similarity, but its operational costs and memory requirements make it prohibitive for massive datasets like GitHub’s corpus of over 100 billion documents.BM25’s scaling challenges (e.g., reliance on disk IOPS) are manageable compared to the memory-bound nature of vector search engines like HNSW and IVF.Key Insight: BM25’s scalability allows for broader adoption, while vector search is still a niche solution requiring high specialization and infrastructure.
David Tippett:
LinkedInPodcast (For the Sake of Search)X (Twitter)Tippybits.comBluesky
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)Bluesky
00:00 Introduction to RAG and Vector Search Challenges 00:28 Introducing BM25: The Efficient Search Solution 00:43 Guest Introduction: David Tippett 01:16 Comparing Search Engines: Vespa, Weaviate, and More 07:53 Understanding BM25 and Its Importance 09:10 Deep Dive into BM25 Mechanics 23:46 Field-Based Scoring and BM25F 25:49 Introduction to Zero Shot Retrieval 26:03 Vector Search vs BM25 26:22 Combining Search Techniques 26:56 Favorite BM25 Adaptations 27:38 Postgres Search and Term Proximity 31:49 Challenges in GitHub Search 33:59 BM25 in Large Scale Systems 40:00 Technical Deep Dive into BM25 45:30 Future of Search and Learning to Rank 47:18 Conclusion and Future Plans
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Vector Search at Scale: Why One Size Doesn't Fit All | S2 E13
7 Nov 2024· How AI Is Built
Ever wondered why your vector search becomes painfully slow after scaling past a million vectors? You're not alone - even tech giants struggle with this.
Charles Xie, founder of Zilliz (company behind Milvus), shares how they solved vector database scaling challenges at 100B+ vector scale:
Key Insights:
Multi-tier storage strategy: GPU memory (1% of data, fastest)RAM (10% of data)Local SSDObject storage (slowest but cheapest)Real-time search solution: New data goes to buffer (searchable immediately)Index builds in background when buffer fillsCombines buffer & main index resultsPerformance optimization: GPU acceleration for 10k-50k queries/secondCustomizable trade-offs between: CostLatencySearch relevanceFuture developments: Self-learning indicesHybrid search methods (dense + sparse)Graph embedding supportColbert integration
Perfect for teams hitting scaling walls with their current vector search implementation or planning for future growth.
Worth watching if you're building production search systems or need to optimize costs vs performance.
Charles Xie:
LinkedInZillizMilvusMilvus Discord
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Search System Challenges 00:26 Introducing Milvus: The Open Source Vector Database 00:58 Interview with Charles: Founder of Zilliz 02:20 Scalability and Performance in Vector Databases 03:35 Challenges in Distributed Systems 05:46 Data Consistency and Real-Time Search 12:12 Hierarchical Storage and GPU Acceleration 18:34 Emerging Technologies in Vector Search 23:21 Self-Learning Indexes and Future Innovations 28:44 Key Takeaways and Conclusion
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Search Systems at Scale: Avoiding Local Maxima and Other Engineering Lessons | S2 E12
31 Okt 2024· How AI Is Built
Modern search systems face a complex balancing act between performance, relevancy, and cost, requiring careful architectural decisions at each layer.
While vector search generates buzz, hybrid approaches combining traditional text search with vector capabilities yield better results.
The architecture typically splits into three core components:
ingestion/indexing (requiring decisions between batch vs streaming)query processing (balancing understanding vs performance)analytics/feedback loops for continuous improvement.
Critical but often overlooked aspects include query understanding depth, systematic relevancy testing (avoid anecdote-driven development), and data governance as search systems naturally evolve into organizational data hubs.
Performance optimization requires careful tradeoffs between index-time vs query-time computation, with even 1-2% improvements being significant in mature systems.
Success requires testing against production data (staging environments prove unreliable), implementing proper evaluation infrastructure (golden query sets, A/B testing, interleaving), and avoiding the local maxima trap where improving one query set unknowingly damages others.
The end goal is finding an acceptable balance between corpus size, latency requirements, and cost constraints while maintaining system manageability and relevance quality.
"It's quite easy to end up in local maxima, whereby you improve a query for one set and then you end up destroying it for another set."
"A good marker of a sophisticated system is one where you actually see it's getting worse... you might be discovering a maxima."
"There's no free lunch in all of this. Often it's a case that, to service billions of documents on a vector search, less than 10 millis, you can do those kinds of things. They're just incredibly expensive. It's really about trying to manage all of the overall system to find what is an acceptable balance."
Search Pioneers:
WebsiteGitHub
Stuart Cam:
LinkedIn
Russ Cam:
GithubLinkedInX (Twitter)
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Search Systems 00:13 Challenges in Search: Relevancy vs Latency 00:27 Insights from Industry Experts 01:00 Evolution of Search Technologies 03:16 Storage and Compute in Search Systems 06:22 Common Mistakes in Building Search Systems 09:10 Evaluating and Improving Search Systems 19:27 Architectural Components of Search Systems 29:17 Understanding Search Query Expectations 29:39 Balancing Speed, Cost, and Corpus Size 32:03 Trade-offs in Search System Design 32:53 Indexing vs Querying: Key Considerations 35:28 Re-ranking and Personalization Challenges 38:11 Evaluating Search System Performance 44:51 Overrated vs Underrated Search Techniques 48:31 Final Thoughts and Contact Information
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Training Multi-Modal AI: Inside the Jina CLIP Embedding Model | S2 E11
25 Okt 2024· How AI Is Built
Today we are talking to Michael Günther, a senior machine learning scientist at Jina about his work on JINA Clip.
Some key points:
Uni-modal embeddings convert a single type of input (text, images, audio) into vectorsMultimodal embeddings learn a joint embedding space that can handle multiple types of input, enabling cross-modal search (e.g., searching images with text)Multimodal models can potentially learn richer representations of the world, including concepts that are difficult or impossible to put into words
Types of Text-Image Models
CLIP-like ModelsSeparate vision and text transformer modelsEach tower maps inputs to a shared vector spaceOptimized for efficient retrievalVision-Language ModelsProcess image patches as tokensUse transformer architecture to combine image and text informationBetter suited for complex document matchingHybrid ModelsCombine separate encoders with additional transformer componentsAllow for more complex interactions between modalitiesExample: Google's Magic Lens model
Training Insights from Jina CLIP
Key LearningsFreezing the text encoder during training can significantly hinder performanceShort image captions limit the model's ability to learn rich text representationsLarge batch sizes are crucial for training embedding models effectivelyTraining ProcessThree-stage training approach: Stage 1: Training on image captions and text pairsStage 2: Adding longer image captionsStage 3: Including triplet data with hard negatives
Practical Considerations
Similarity ScalesDifferent modalities can produce different similarity value scalesImportant to consider when combining multiple embedding typesCan affect threshold-based filteringModel SelectionEvaluate models based on relevant benchmarksConsider the domain similarity between training data and intended use caseAssessment of computational requirements and efficiency needs
Future Directions
Areas for DevelopmentMore comprehensive benchmarks for multimodal tasksBetter support for semi-structured dataImproved handling of non-photographic imagesUpcoming Developments at Jina AIMultilingual support for Jina ColBERTNew version of text embedding modelsFocus on complex multimodal search applications
Practical Applications
E-commerceProduct search and recommendationsCombined text-image embeddings for better resultsSynthetic data generation for fine-tuningFine-tuning StrategiesUsing click data and query logsGenerative pseudo-labeling for creating training dataDomain-specific adaptations
Key Takeaways for Engineers
Be aware of similarity value scales and their implicationsEstablish quantitative evaluation metrics before optimizationConsider model limitations (e.g., image resolution, text length)Use performance optimizations like flash attention and activation checkpointingUniversal embedding models might not be optimal for specific use cases
Michael Guenther
LinkedInX (Twitter)Jina AINew Multilingual Embedding Modal
Nicolay Gerold:
⁠LinkedIn⁠⁠X (Twitter)
00:00 Introduction to Uni-modal and Multimodal Embeddings 00:16 Exploring Multimodal Embeddings and Their Applications 01:06 Training Multimodal Embedding Models 02:21 Challenges and Solutions in Embedding Models 07:29 Advanced Techniques and Future Directions 29:19 Understanding Model Interference in Search Specialization 30:17 Fine-Tuning Jina CLIP for E-Commerce 32:18 Synthetic Data Generation and Pseudo-Labeling 33:36 Challenges and Learnings in Embedding Models 40:52 Future Directions and Takeaways
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Building the database for AI, Multi-modal AI, Multi-modal Storage | S2 E10
23 Okt 2024· How AI Is Built
Imagine a world where data bottlenecks, slow data loaders, or memory issues on the VM don't hold back machine learning.
Machine learning and AI success depends on the speed you can iterate. LanceDB is here to to enable fast experiments on top of terabytes of unstructured data. It is the database for AI. Dive with us into how LanceDB was built, what went into the decision to use Rust as the main implementation language, the potential of AI on top of LanceDB, and more.
"LanceDB is the database for AI...to manage their data, to do a performant billion scale vector search."
“We're big believers in the composable data systems vision."
"You can insert data into LanceDB using Panda's data frames...to sort of really large 'embed the internet' kind of workflows."
"We wanted to create a new generation of data infrastructure that makes their [AI engineers] lives a lot easier."
"LanceDB offers up to 1,000 times faster performance than Parquet."

Change She:
LinkedInX (Twitter)
LanceDB:
X (Twitter)GitHubWebDiscordVectorDB Recipes
Nicolay Gerold:
LinkedInX (Twitter)
00:00 Introduction to Multimodal Embeddings
00:26 Challenges in Storage and Serving
02:51 LanceDB: The Solution for Multimodal Data
04:25 Interview with Chang She: Origins and Vision
10:37 Technical Deep Dive: LanceDB and Rust
18:11 Innovations in Data Storage Formats
19:00 Optimizing Performance in Lakehouse Ecosystems
21:22 Future Use Cases for LanceDB
26:04 Building Effective Recommendation Systems
32:10 Exciting Applications and Future Directions
- Hören Erneut hören Fortsetzen Abspielen...
- Später hören Später hören
Mehr anzeigen

Folgen

Search in 5 lines of code. Building a search database from first principles | S2 E29

RAG is two things. Prompt Engineering and Search. Keep it Separate | S2 E28

Graphs aren't just for specialists anymore. They are one import away | S2 E27

Knowledge Graphs Won't Fix Bad Data | S2 E26

Temporal RAG: Embracing Time for Smarter, Reliable Knowledge Graphs | S2 E25

Context is King: How Knowledge Graphs Help LLMs Reason

Inside Vector Database Quantization: Product, Binary, and Scalar | S2 E23

Local-First Search: How to Push Search To End-Devices | S2 E22

AI-Powered Search: Context Is King, But Your RAG System Ignores Two-Thirds of It | S2 E21

Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces | S2 E20

How AI Can Start Teaching Itself - Synthetic Data Deep Dive | S2 E18

A Search System That Learns As You Use It (Agentic RAG) | S2 E18

Rethinking Search Inside Postgres, From Lexemes to BM25 | S2 E17

RAG's Biggest Problems & How to Fix It (ft. Synthetic Data) | S2 E16

From Ambiguous to AI-Ready: Improving Documentation Quality for RAG Systems | S2 E15

BM25 is the workhorse of search; vectors are its visionary cousin | S2 E14

Vector Search at Scale: Why One Size Doesn't Fit All | S2 E13

Search Systems at Scale: Avoiding Local Maxima and Other Engineering Lessons | S2 E12

Training Multi-Modal AI: Inside the Jina CLIP Embedding Model | S2 E11

Building the database for AI, Multi-modal AI, Multi-modal Storage | S2 E10