ContextRAG
A scalable vector database system for semantic search with context-aware processing
Introduction
ContextRAG is a context-aware retrieval-augmented generation system designed to address fundamental limitations in LLM applications when processing documents of varying lengths and complexities. By implementing intelligent document classification and adaptive processing strategies, the system achieves improved semantic coherence and retrieval precision compared to traditional RAG approaches.
Research Motivation
Large language models have revolutionized many NLP tasks but face significant constraints when processing lengthy documents due to context window limitations. These constraints create several challenges for real-world LLM applications:
- Loss of semantic coherence across document chunks when traditional uniform chunking strategies are applied
- Inefficient token utilization in embedding models, leading to higher computational costs
- Poor retrieval performance for complex hierarchical documents where context and document structure are critical
ContextRAG addresses these limitations through a novel adaptive processing approach based on document characteristics, enabling more nuanced and contextually relevant information retrieval while optimizing computational resources.
System Architecture
The system implements a three-stage pipeline:
- Document Ingestion & Classification: Documents are processed and classified based on length and complexity metrics
- Semantic Analysis & Embedding Generation: Context-aware embedding strategies are applied based on document classification
- Vector Storage & Retrieval: Optimized vector representations are stored and retrieved using similarity-based search
This architecture provides a flexible framework for handling diverse document collections while maintaining retrieval efficiency and semantic precision.
Technical Implementation
The technical implementation of ContextRAG focuses on three key innovations: length-based document classification, optimized semantic similarity computation, and context-aware processing strategies. Each component addresses specific challenges in building effective RAG systems.
Length-Based Document Classification
A fundamental innovation in ContextRAG is its adaptive approach to documents based on token length. Rather than applying a one-size-fits-all processing strategy, the system analyzes document length and complexity to determine the optimal processing approach.
This classification is implemented in src/data_processing/html_to_markdown.py
:
def _get_target_folder(self, content: str) -> Path:
"""
Determine the target folder based on the content length.
"""
if count_tokens(str(content)) <= 3500:
return self.folder_path / "short"
elif 3500 < count_tokens(str(content)) < 15000:
return self.folder_path / "medium"
else:
return self.folder_path / "long"
This classification system enables:
- Processing Optimization - Different strategies for different document lengths
- Model Selection - Appropriate model choice based on token constraints
- Storage Organization - Efficient document categorization for retrieval
The token counting is implemented using the tiktoken
library for accurate token estimation:
def count_tokens(text: str) -> int:
"""Count tokens in a text string using tiktoken."""
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
By accurately estimating token counts, the system can make informed decisions about document processing strategies, ensuring optimal use of computational resources while maintaining semantic coherence.
Semantic Similarity Computation
ContextRAG implements sophisticated similarity computation using OpenAI embeddings and cosine similarity. The implementation balances embedding quality with computational efficiency through several optimization techniques.
The core similarity computation is implemented in src/markdown_grouping/file_grouping.py
:
def compute_similarity(files_dict, checksums, cache):
"""Compute similarity between files using OpenAI embeddings."""
embeddings = []
client = OpenAI()
for filename, content in files_dict.items():
checksum = checksums[filename]
if checksum in cache:
embeddings.append(cache[checksum])
else:
processed_text = preprocess_text(content)
token_count = count_tokens(processed_text)
if token_count <= 8000:
response = client.embeddings.create(
model="text-embedding-3-large",
input=processed_text,
encoding_format="float",
dimensions=3072,
)
embedding = response.data[0].embedding
embeddings.append(embedding)
cache[checksum] = embedding
# Normalize the embeddings before computing the cosine similarity
normalized_embeddings = normalize(np.array(embeddings))
similarity_matrix = cosine_similarity(normalized_embeddings)
return similarity_matrix
This implementation includes several key optimizations:
- Embedding Caching - Reuses embeddings for identical content, reducing API calls and computational overhead
- Token Limit Handling - Ensures content fits within model constraints, preventing truncation issues
- Normalization - Improves consistency of similarity scores across documents of varying lengths
- High-Dimensional Embeddings - Uses 3072-dimension embeddings for greater precision in semantic representation
These optimizations enable efficient similarity computation even for large document collections, making the system viable for real-world applications with diverse content types.
Context-Aware Processing Strategy
A core innovation in ContextRAG is its differentiated processing strategies based on document characteristics. The system employs distinct approaches for documents of varying lengths, optimizing both computational efficiency and semantic coherence.
Document Type | Token Range | Processing Approach | Implementation Details |
---|---|---|---|
Short | ≤3,500 | Full-context embedding | Direct embedding with text-embedding-3-large (3072 dims) |
Medium | 3,500-15,000 | Model adaptation | Uses GPT-3.5-Turbo-16K for processing |
Long | >15,000 | Specialized handling | Custom chunking and hierarchical embedding |
This approach is reflected in the model selection logic in src/markdown_grouping/category_assignment.py
:
# Determine the model based on the token count
if token_count <= 3500:
model = ChatModels.GPT_3_5_TURBO_1106
elif 3500 < token_count < 15000:
model = ChatModels.GPT_3_5_TURBO_16K
else:
print(f"Skipped {filename} due to excessive token count ({token_count} tokens).")
continue
By selecting models and processing strategies based on document characteristics, ContextRAG achieves better performance and cost efficiency compared to uniform processing approaches.
Vector Database Integration
Efficient storage and retrieval of vector embeddings is essential for system performance. ContextRAG integrates ChromaDB, a vector database optimized for similarity search, to enable fast and accurate document retrieval.
The vector database integration is implemented in src/vector_db/main.py
:
def query(self, query_texts, n_results=3):
"""Query the collection for similar documents."""
results = self.collection.query(
query_texts=query_texts,
n_results=n_results,
include=["documents", "distances"],
)
return results
This implementation enables:
- Scalable vector storage - Efficient handling of large document collections
- Fast similarity search - Optimized retrieval of relevant documents
- Distance-based ranking - Documents ranked by semantic similarity
The vector database provides a scalable foundation for the system, enabling efficient retrieval even as document collections grow in size and complexity.
Experimental Analysis
While formal benchmarks are still in development, initial testing of ContextRAG demonstrates several qualitative improvements over traditional RAG approaches:
- Retrieval Precision - More accurate document retrieval by preserving semantic context
- Processing Efficiency - Optimized token usage and model selection
- Scalability - Effective handling of documents regardless of length or complexity
These improvements are particularly noticeable when processing complex technical documentation, research papers, and hierarchical content where context preservation is critical for accurate retrieval.
The system shows particular promise for applications requiring high retrieval precision, such as:
- Technical documentation search
- Research literature review
- Legal document analysis
- Knowledge base management
Limitations and Evolution of Context Windows
While ContextRAG provides effective solutions for context management, several limitations and evolving factors should be acknowledged:
Model Selection Trade-offs
The system currently uses GPT-3.5 Turbo models (4K and 16K context variants) due to favorable cost-performance trade-offs during initial development. This represents a deliberate engineering decision balancing:
- Latency Requirements - Larger models typically have higher inference times
- Cost Considerations - Significant cost differences between model tiers
- Retrieval Quality - Diminishing returns beyond certain context sizes
# Current model selection approach
if token_count <= 3500:
model = ChatModels.GPT_3_5_TURBO_1106 # Lower cost, faster inference
elif 3500 < token_count < 15000:
model = ChatModels.GPT_3_5_TURBO_16K # Higher cost, manageable latency
Context Window Evolution
Recent advances in model architecture have dramatically increased context windows:
Time Period | Leading Models | Typical Context Windows |
---|---|---|
2022–2023 | GPT-3.5, Claude 1, LLaMA 1 | 2K–16K tokens |
2023–2024 | GPT-4, Claude 2, LLaMA 2 | 4K–32K tokens |
2024–2025 | GPT-4o, Gemini 1.5 Pro, LLaMA 3, DeepSeek-V3 | 128K–2M tokens |
Despite these advances, context limitations remain relevant for several reasons:
- Scale Gap - Many real-world document collections exceed even million-token context windows
- Attention Mechanism Limitations - Performance degradation with extremely long contexts
- Inference Cost - Quadratic scaling of computational costs with context length
- Retrieval Precision - Diminishing quality of retrievals in extremely large contexts
These limitations highlight the continued relevance of intelligent context management systems even as context windows expand.
Future Research Directions
Current work is focused on:
- Adaptive Model Selection - Dynamic selection of models based on document characteristics and query requirements
- Hierarchical Embedding Strategies - Developing multi-level embeddings for very long documents
- Cross-Document Context Preservation - Maintaining relationships between related documents
- Query Optimization - Intelligent query reformulation based on document characteristics
- Evaluation Framework - Comprehensive benchmarking against traditional RAG systems
Conclusion
ContextRAG demonstrates that context-aware processing strategies can significantly improve retrieval performance in RAG systems. By adapting processing approaches based on document characteristics, the system achieves better semantic coherence, more efficient resource utilization, and improved retrieval precision compared to uniform processing approaches.
As language models continue to evolve, the principles of context-aware document processing will remain relevant for optimizing retrieval performance and computational efficiency in real-world applications.
The project is open source and available on GitHub. Contributions and feedback from the community is welcome as I continue to develop and refine the system.