ContextRAG

A scalable vector database system for semantic search with context-aware processing

Introduction

ContextRAG is a context-aware retrieval-augmented generation system designed to address fundamental limitations in LLM applications when processing documents of varying lengths and complexities. By implementing intelligent document classification and adaptive processing strategies, the system achieves improved semantic coherence and retrieval precision compared to traditional RAG approaches.

Research Motivation

Large language models have revolutionized many NLP tasks but face significant constraints when processing lengthy documents due to context window limitations. These constraints create several challenges for real-world LLM applications:

  • Loss of semantic coherence across document chunks when traditional uniform chunking strategies are applied
  • Inefficient token utilization in embedding models, leading to higher computational costs
  • Poor retrieval performance for complex hierarchical documents where context and document structure are critical

ContextRAG addresses these limitations through a novel adaptive processing approach based on document characteristics, enabling more nuanced and contextually relevant information retrieval while optimizing computational resources.

System Architecture

ContextRAG's modular architecture enables efficient processing of documents regardless of size through specialized pipelines.

The system implements a three-stage pipeline:

  1. Document Ingestion & Classification: Documents are processed and classified based on length and complexity metrics
  2. Semantic Analysis & Embedding Generation: Context-aware embedding strategies are applied based on document classification
  3. Vector Storage & Retrieval: Optimized vector representations are stored and retrieved using similarity-based search

This architecture provides a flexible framework for handling diverse document collections while maintaining retrieval efficiency and semantic precision.

Technical Implementation

The technical implementation of ContextRAG focuses on three key innovations: length-based document classification, optimized semantic similarity computation, and context-aware processing strategies. Each component addresses specific challenges in building effective RAG systems.

Length-Based Document Classification

A fundamental innovation in ContextRAG is its adaptive approach to documents based on token length. Rather than applying a one-size-fits-all processing strategy, the system analyzes document length and complexity to determine the optimal processing approach.

This classification is implemented in src/data_processing/html_to_markdown.py:

def _get_target_folder(self, content: str) -> Path:
    """
    Determine the target folder based on the content length.
    """
    if count_tokens(str(content)) <= 3500:
        return self.folder_path / "short"
    elif 3500 < count_tokens(str(content)) < 15000:
        return self.folder_path / "medium"
    else:
        return self.folder_path / "long"

This classification system enables:

  1. Processing Optimization - Different strategies for different document lengths
  2. Model Selection - Appropriate model choice based on token constraints
  3. Storage Organization - Efficient document categorization for retrieval

The token counting is implemented using the tiktoken library for accurate token estimation:

def count_tokens(text: str) -> int:
    """Count tokens in a text string using tiktoken."""
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

By accurately estimating token counts, the system can make informed decisions about document processing strategies, ensuring optimal use of computational resources while maintaining semantic coherence.

Semantic Similarity Computation

ContextRAG implements sophisticated similarity computation using OpenAI embeddings and cosine similarity. The implementation balances embedding quality with computational efficiency through several optimization techniques.

The core similarity computation is implemented in src/markdown_grouping/file_grouping.py:

def compute_similarity(files_dict, checksums, cache):
    """Compute similarity between files using OpenAI embeddings."""
    embeddings = []
    client = OpenAI()
    for filename, content in files_dict.items():
        checksum = checksums[filename]
        if checksum in cache:
            embeddings.append(cache[checksum])
        else:
            processed_text = preprocess_text(content)
            token_count = count_tokens(processed_text)

            if token_count <= 8000:
                response = client.embeddings.create(
                    model="text-embedding-3-large",
                    input=processed_text,
                    encoding_format="float",
                    dimensions=3072,
                )
                embedding = response.data[0].embedding
                embeddings.append(embedding)
                cache[checksum] = embedding

    # Normalize the embeddings before computing the cosine similarity
    normalized_embeddings = normalize(np.array(embeddings))
    similarity_matrix = cosine_similarity(normalized_embeddings)
    return similarity_matrix

This implementation includes several key optimizations:

  1. Embedding Caching - Reuses embeddings for identical content, reducing API calls and computational overhead
  2. Token Limit Handling - Ensures content fits within model constraints, preventing truncation issues
  3. Normalization - Improves consistency of similarity scores across documents of varying lengths
  4. High-Dimensional Embeddings - Uses 3072-dimension embeddings for greater precision in semantic representation

These optimizations enable efficient similarity computation even for large document collections, making the system viable for real-world applications with diverse content types.

Context-Aware Processing Strategy

A core innovation in ContextRAG is its differentiated processing strategies based on document characteristics. The system employs distinct approaches for documents of varying lengths, optimizing both computational efficiency and semantic coherence.

Document Type Token Range Processing Approach Implementation Details
Short ≤3,500 Full-context embedding Direct embedding with text-embedding-3-large (3072 dims)
Medium 3,500-15,000 Model adaptation Uses GPT-3.5-Turbo-16K for processing
Long >15,000 Specialized handling Custom chunking and hierarchical embedding

This approach is reflected in the model selection logic in src/markdown_grouping/category_assignment.py:

# Determine the model based on the token count
if token_count <= 3500:
    model = ChatModels.GPT_3_5_TURBO_1106
elif 3500 < token_count < 15000:
    model = ChatModels.GPT_3_5_TURBO_16K
else:
    print(f"Skipped {filename} due to excessive token count ({token_count} tokens).")
    continue

By selecting models and processing strategies based on document characteristics, ContextRAG achieves better performance and cost efficiency compared to uniform processing approaches.

Vector Database Integration

Efficient storage and retrieval of vector embeddings is essential for system performance. ContextRAG integrates ChromaDB, a vector database optimized for similarity search, to enable fast and accurate document retrieval.

The vector database integration is implemented in src/vector_db/main.py:

def query(self, query_texts, n_results=3):
    """Query the collection for similar documents."""
    results = self.collection.query(
        query_texts=query_texts,
        n_results=n_results,
        include=["documents", "distances"],
    )
    return results

This implementation enables:

  1. Scalable vector storage - Efficient handling of large document collections
  2. Fast similarity search - Optimized retrieval of relevant documents
  3. Distance-based ranking - Documents ranked by semantic similarity

The vector database provides a scalable foundation for the system, enabling efficient retrieval even as document collections grow in size and complexity.

Experimental Analysis

While formal benchmarks are still in development, initial testing of ContextRAG demonstrates several qualitative improvements over traditional RAG approaches:

  1. Retrieval Precision - More accurate document retrieval by preserving semantic context
  2. Processing Efficiency - Optimized token usage and model selection
  3. Scalability - Effective handling of documents regardless of length or complexity

These improvements are particularly noticeable when processing complex technical documentation, research papers, and hierarchical content where context preservation is critical for accurate retrieval.

The system shows particular promise for applications requiring high retrieval precision, such as:

  • Technical documentation search
  • Research literature review
  • Legal document analysis
  • Knowledge base management

Limitations and Evolution of Context Windows

While ContextRAG provides effective solutions for context management, several limitations and evolving factors should be acknowledged:

Model Selection Trade-offs

The system currently uses GPT-3.5 Turbo models (4K and 16K context variants) due to favorable cost-performance trade-offs during initial development. This represents a deliberate engineering decision balancing:

  • Latency Requirements - Larger models typically have higher inference times
  • Cost Considerations - Significant cost differences between model tiers
  • Retrieval Quality - Diminishing returns beyond certain context sizes
# Current model selection approach
if token_count <= 3500:
    model = ChatModels.GPT_3_5_TURBO_1106  # Lower cost, faster inference
elif 3500 < token_count < 15000:
    model = ChatModels.GPT_3_5_TURBO_16K   # Higher cost, manageable latency

Context Window Evolution

Recent advances in model architecture have dramatically increased context windows:

Time Period Leading Models Typical Context Windows
2022–2023 GPT-3.5, Claude 1, LLaMA 1 2K–16K tokens
2023–2024 GPT-4, Claude 2, LLaMA 2 4K–32K tokens
2024–2025 GPT-4o, Gemini 1.5 Pro, LLaMA 3, DeepSeek-V3 128K–2M tokens

Despite these advances, context limitations remain relevant for several reasons:

  1. Scale Gap - Many real-world document collections exceed even million-token context windows
  2. Attention Mechanism Limitations - Performance degradation with extremely long contexts
  3. Inference Cost - Quadratic scaling of computational costs with context length
  4. Retrieval Precision - Diminishing quality of retrievals in extremely large contexts

These limitations highlight the continued relevance of intelligent context management systems even as context windows expand.

Future Research Directions

Current work is focused on:

  1. Adaptive Model Selection - Dynamic selection of models based on document characteristics and query requirements
  2. Hierarchical Embedding Strategies - Developing multi-level embeddings for very long documents
  3. Cross-Document Context Preservation - Maintaining relationships between related documents
  4. Query Optimization - Intelligent query reformulation based on document characteristics
  5. Evaluation Framework - Comprehensive benchmarking against traditional RAG systems

Conclusion

ContextRAG demonstrates that context-aware processing strategies can significantly improve retrieval performance in RAG systems. By adapting processing approaches based on document characteristics, the system achieves better semantic coherence, more efficient resource utilization, and improved retrieval precision compared to uniform processing approaches.

As language models continue to evolve, the principles of context-aware document processing will remain relevant for optimizing retrieval performance and computational efficiency in real-world applications.

The project is open source and available on GitHub. Contributions and feedback from the community is welcome as I continue to develop and refine the system.