Understanding Vector Search and LLMs: From Embeddings to RAG Systems

Sushant Gaurav
8 min readDec 29, 2024

The advent of embeddings and semantic search has transformed the overall idea of information retrieval. Unlike traditional keyword-based search systems, modern approaches understand the meaning behind words, enabling more intuitive and powerful search capabilities.

If you’re new to the field of Machine Learning and Natural Language Processing, I previously wrote an article on The Evolution of Machine Learning and Natural Language Processing to Transformers which can provide a great starting point.

Embeddings: The Foundation

Embeddings are the mathematical backbone of modern natural language processing and information retrieval systems. They transform words, sentences, or any piece of information into dense numerical vectors in a high-dimensional space (typically 256 to 1024 dimensions). What makes embeddings revolutionary is their ability to capture semantic meaning in these vectors:

Semantic Meaning: Similar concepts end up close to each other in the vector space. For example, “dog” and “puppy” will have similar vector representations.

Compositional Properties: Embeddings can be combined and manipulated mathematically. The classic example is “King — Man + Woman = Queen”, showing how embeddings capture relationships between concepts.

Dimensionality: While words in one-hot encoding would require vectors as large as the vocabulary (millions of dimensions), embeddings compress this information into much smaller, dense vectors.

Contextual Understanding: Modern embedding systems consider the context in which words appear, allowing for nuanced representations where the same word can have different embeddings based on its usage.

Let’s visualize how words get mapped to vectors:

from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentences
sentences = [
"The cat sat on the mat",
"A feline rested on the rug",
"Dogs are playful pets",
"The weather is sunny today"
]

# Generate embeddings
embeddings = model.encode(sentences)

# Calculate cosine similarity between first two sentences
def cosine_similarity(v1, v2):
dot_product = np.dot(v1, v2)
norm_v1 = np.linalg.norm(v1)
norm_v2 = np.linalg.norm(v2)
return dot_product / (norm_v1 * norm_v2)

similarity = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity between similar sentences: {similarity:.4f}")
print(f"Embedding dimension: {embeddings.shape[1]}")

Vector Databases and Efficient Search

Vector databases are specialized systems designed to store and query high-dimensional vectors efficiently. They solve several critical challenges:

Scale Challenge: Traditional databases can’t efficiently handle similarity searches in high-dimensional spaces. Vector databases use specialized indexing structures.

Approximate Nearest Neighbor (ANN) Search: Instead of examining every vector (which would be too slow), vector databases use approximation algorithms that trade perfect accuracy for dramatic speed improvements.

Indexing Structures:

  • HNSW (Hierarchical Navigable Small World): Creates a layered graph structure for fast navigation
  • IVF (Inverted File Index): Partitions vectors into clusters for coarse-to-fine search
  • PQ (Product Quantization): Compresses vectors while maintaining search capability

Performance Optimization: Modern vector databases can search millions of vectors in milliseconds using these techniques.

They use sophisticated indexing techniques like:

Cosine Similarity

Cosine similarity is the backbone of vector similarity search. It measures the cosine of the angle between two vectors, providing a similarity score between -1 and 1.

Here’s a visual and mathematical explanation:

import React from 'react';
import { LineChart, Line, XAxis, YAxis, CartesianGrid, ReferenceLine } from 'recharts';

const CosineSimilarityDemo = () => {
// Generate points for visualization
const data = [];
const vector1 = [3, 4];
const vector2 = [4, 3];

// Calculate points for vectors
const points = [
{ x: 0, y: 0 },
{ x: vector1[0], y: vector1[1] },
{ x: 0, y: 0 },
{ x: vector2[0], y: vector2[1] }
];

return (
<div className="w-full max-w-2xl p-4">
<div className="mb-4">
<h3 className="text-lg font-bold mb-2">Cosine Similarity Visualization</h3>
<LineChart width={400} height={400} margin={{ top: 20, right: 20, bottom: 20, left: 20 }}>
<CartesianGrid />
<XAxis type="number" domain={[-1, 5]} />
<YAxis type="number" domain={[-1, 5]} />
<ReferenceLine x={0} stroke="#666" />
<ReferenceLine y={0} stroke="#666" />
<Line
data={[points[0], points[1]]}
type="linear"
dataKey="y"
stroke="#8884d8"
dot={true}
/>
<Line
data={[points[2], points[3]]}
type="linear"
dataKey="y"
stroke="#82ca9d"
dot={true}
/>
</LineChart>
</div>
<div className="text-sm">
Vector 1 (blue): [3, 4]<br/>
Vector 2 (green): [4, 3]<br/>
Cosine Similarity: 0.96
</div>
</div>
);
};

export default CosineSimilarityDemo;

Word2Vec: The Building Blocks

Word2Vec, introduced in 2013, revolutionized how we represent words numerically. Its key innovations include:

Training Paradigms:

  • CBOW (Continuous Bag of Words): Predicts a word from its context
  • Skip-gram: Predicts context words from a target word

Negative Sampling: Makes training feasible by only updating a small subset of weights

Context Windows: Captures local relationships between words by considering nearby words during training

Emergent Properties: Without explicit programming, Word2Vec learns analogies and relationships between words

Legacy: While newer models have superseded Word2Vec, its principles form the foundation of modern embedding systems

Let’s visualize how it works:

Here’s a practical implementation showing how Word2Vec captures semantic relationships:

from gensim.models import Word2Vec
import numpy as np

# Example training data
sentences = [
['king', 'queen', 'palace', 'royal'],
['man', 'woman', 'child', 'family'],
['computer', 'keyboard', 'mouse', 'screen']
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, workers=4)

def analyze_word_relationships(word1, word2, word3):
"""Find word that completes the analogy: word1 is to word2 as word3 is to ???"""
try:
result = model.wv.most_similar(
positive=[word2, word3],
negative=[word1],
topn=1
)
return result[0]
except KeyError:
return None

# Example analogies
analogies = [
('man', 'woman', 'king'), # Expected: queen
('computer', 'keyboard', 'phone') # Expected: touchscreen
]

for w1, w2, w3 in analogies:
result = analyze_word_relationships(w1, w2, w3)
if result:
print(f"{w1} : {w2} :: {w3} : {result[0]} (Score: {result[1]:.2f})")

Modern Language Models (LLMs)

Large Language Models represent the current pinnacle of natural language processing. Large Language Models (LLMs), such as GPT and BERT, use deep learning to understand and generate human-like text. They are trained on massive datasets and fine-tuned for specific tasks like translation, summarization, and semantic search.

Architecture

  • Built on the Transformer architecture with self-attention mechanisms
  • Multiple layers of neural networks processing information in parallel
  • Massive parameter counts (hundreds of billions) enable complex pattern recognition

Capabilities

  • Context understanding across long sequences
  • Zero-shot and few-shot learning
  • Generation of human-like text
  • Multi-task ability without specific training

Training Process

  • Pre-training on vast amounts of text data
  • Fine-tuning for specific tasks
  • Instruction tuning for better alignment with human intent

Limitations

High computational requirements, the potential for hallucinations, context window constraints

Let’s break down how modern LLMs work with a comprehensive visualization:

Retrieval-Augmented Generation (RAG)

RAG combines the power of retrieval systems with generative AI. Here’s a complete system architecture:

Let’s implement a basic RAG system:

from sentence_transformers import SentenceTransformer
from typing import List, Dict
import numpy as np
import faiss

class SimpleRAG:
def __init__(self, embedding_model_name: str = 'all-MiniLM-L6-v2'):
self.embedding_model = SentenceTransformer(embedding_model_name)
self.document_store: List[str] = []
self.index = None

def add_documents(self, documents: List[str], chunk_size: int = 512):
"""Process and index documents"""
# Simple chunking strategy
chunks = []
for doc in documents:
# Split into sentences/chunks (simplified)
doc_chunks = [doc[i:i+chunk_size] for i in range(0, len(doc), chunk_size)]
chunks.extend(doc_chunks)

self.document_store = chunks

# Generate embeddings
embeddings = self.embedding_model.encode(chunks)

# Create FAISS index
dimension = embeddings.shape[1]
self.index = faiss.IndexFlatL2(dimension)
self.index.add(np.array(embeddings).astype('float32'))

def retrieve(self, query: str, k: int = 3) -> List[Dict]:
"""Retrieve relevant documents for a query"""
# Generate query embedding
query_embedding = self.embedding_model.encode([query])[0]

# Search in FAISS
distances, indices = self.index.search(
np.array([query_embedding]).astype('float32'), k
)

# Return relevant documents with scores
results = []
for dist, idx in zip(distances[0], indices[0]):
results.append({
'content': self.document_store[idx],
'score': float(dist)
})

return results

def generate_response(self, query: str, retrieved_docs: List[Dict]) -> str:
"""Generate response using retrieved documents (simplified)"""
# In a real implementation, this would call an LLM API
context = "\n".join([doc['content'] for doc in retrieved_docs])
prompt = f"Context:\n{context}\n\nQuery: {query}\nResponse:"
return f"Generated response based on {len(retrieved_docs)} retrieved documents"

# Example usage
documents = [
"RAG systems combine retrieval with generation.",
"Vector databases store embeddings efficiently.",
"LLMs process text using attention mechanisms."
]

rag = SimpleRAG()
rag.add_documents(documents)
results = rag.retrieve("How do RAG systems work?")
response = rag.generate_response("How do RAG systems work?", results)

Performance and Scalability

Modern semantic search systems must handle massive scales efficiently:

System Design Considerations:

  • Distributed architecture for handling large datasets
  • Caching strategies for frequent queries
  • Load balancing across multiple servers
  • Sharding strategies for vector databases

Optimization Techniques:

  • Vector compression through quantization
  • Batch processing for efficiency
  • Asynchronous updates and lazy loading
  • Caching of popular query results

Performance Metrics:

  • Query latency (typically milliseconds)
  • Throughput (queries per second)
  • Recall accuracy vs. speed tradeoffs
  • Resource utilization and cost efficiency

Let’s visualize the search performance with different index types:

import React from 'react';
import { BarChart, Bar, XAxis, YAxis, CartesianGrid, Tooltip, Legend } from 'recharts';

const SearchPerformanceComparison = () => {
const data = [
{
name: 'Brute Force',
'1M Vectors': 1000,
'10M Vectors': 10000,
'100M Vectors': 100000,
},
{
name: 'HNSW',
'1M Vectors': 10,
'10M Vectors': 15,
'100M Vectors': 25,
},
{
name: 'IVF',
'1M Vectors': 20,
'10M Vectors': 30,
'100M Vectors': 45,
},
{
name: 'PQ',
'1M Vectors': 15,
'10M Vectors': 25,
'100M Vectors': 40,
},
];

return (
<div className="w-full max-w-2xl p-4">
<h3 className="text-lg font-bold mb-4">Search Latency Comparison (ms)</h3>
<BarChart width={500} height={300} data={data}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="name" />
<YAxis type="log" />
<Tooltip />
<Legend />
<Bar dataKey="1M Vectors" fill="#8884d8" />
<Bar dataKey="10M Vectors" fill="#82ca9d" />
<Bar dataKey="100M Vectors" fill="#ffc658" />
</BarChart>
</div>
);
};

export default SearchPerformanceComparison;

Conclusion

The integration of embeddings, vector databases, and large language models has transformed how we interact with and process information. This ecosystem enables:

  1. Semantic Understanding: Systems now grasp meaning beyond simple keyword matching
  2. Scalable Solutions: Efficient handling of billions of documents and real-time queries
  3. Flexible Applications: From search engines to recommendation systems to AI assistants
  4. Future Potential: Continuing advances in model architecture and hardware enabling new applications

The field continues to evolve rapidly, with improvements in:

  • Model efficiency and compression
  • Search algorithm optimization
  • Integration of multimodal data
  • Reduced computational requirements
  • Enhanced accuracy and relevance

These technologies together form the backbone of modern AI systems, enabling increasingly sophisticated applications while maintaining practical performance requirements for real-world deployment.

Sign up to discover human stories that deepen your understanding of the world.

Sushant Gaurav
Sushant Gaurav

Written by Sushant Gaurav

A pragmatic programmer 👨🏻‍💻 with a sarcastic mind 😉 who loves travelling and food :)

No responses yet

Write a response