Tiledesk Hybrid Search RAG Architecture

Published by Andrea on 4 August 2025

Core Components of Our Hybrid Search System

Our RAG architecture is built on a set of modular and interchangeable components, giving the user complete control over the AI models used in the pipeline.

Dense Vector Generation

Dense vectors are crucial for capturing the semantic meaning of text. Our system allows users to select from a wide range of embedding models to generate these vectors. This includes popular options like OpenAI’s text-embedding series or powerful open-source alternatives such as bge-m3. This flexibility is managed by a factory pattern that dynamically loads the chosen model, making it easy to experiment and optimize for different use cases.

Sparse Vector Generation

To complement dense vectors, we generate sparse vectors that excel at keyword matching. Sparse vectors represent text by mapping tokens to dimensions in a high-dimensional space, effectively weighting the importance of each term. Our implementation supports cutting-edge sparse models like SPLADE and BGE-M3, which can generate both dense and sparse vectors in a single pass.

To efficiently manage these models, which can be resource-intensive, we implemented the TiledeskSparseEncoders class. This class acts as a factory and a cache, ensuring that models are loaded only once and reused across requests, which significantly reduces latency and memory consumption. It uses a Least Recently Used (LRU) cache policy to manage the loaded models.

### sparse_encoders.py

class TiledeskSparseEncoders:
    # LRU Cache with a maximum size of 2 models
    _encoder_cache = OrderedDict()
    _max_cache_size = 2
    _logger = logging.getLogger(__name__)
    _cache_lock = Lock()

    def __init__(self, model_name: str):
        self.model_name = model_name.lower()
        self.encoder = self._get_cached_encoder(self.model_name)
        # ...

    @classmethod
    def _get_cached_encoder(cls, model_name: str) -> Union[TiledeskSpladeEncoder, TiledeskBGEM3]:
        with cls._cache_lock:
            # Check if the model is already in the cache
            if model_name in cls._encoder_cache:
                cls._logger.info(f"Reusing cached instance of: {model_name}")
                # Move to the end (most recent)
                encoder = cls._encoder_cache.pop(model_name)
                cls._encoder_cache[model_name] = encoder
                return encoder

            # Create a new encoder if not in the cache
            if model_name == "splade":
                cls._logger.info(f"Creating new SpladeEncoder instance")
                encoder = TiledeskSpladeEncoder()
            elif model_name == "bge-m3":
                cls._logger.info(f"Creating new BGEM3 instance")
                encoder = TiledeskBGEM3()
            else:
                raise ValueError(f"Unsupported model: {model_name}. Use 'splade' or 'bge-m3'.")
            
            # ... (LRU cache management)
            cls._encoder_cache[model_name] = encoder
        return encoder

Vector Store Abstraction

A key design feature is the ability to switch between different vector store backends without altering the core logic. We currently support Pinecone and Qdrant, two leading vector databases. This abstraction is achieved through a dependency injection mechanism using the @inject_repo decorator, which dynamically instantiates the correct repository class based on the user’s configuration.

The Indexing Process: A Tale of Two Vector Stores

The way we index data for hybrid search differs significantly depending on the vector store, as each has its own method for handling dense and sparse vectors.

Indexing with Pinecone

Pinecone supports hybrid search by storing dense and sparse vectors for the same data point and combining their scores at query time. Our indexing process for Pinecone is handled by the add_item_hybrid method.

The key steps are:

Chunk the document: The input text is divided into smaller, manageable chunks.
Generate Dense Embeddings: A dense vector is created for each chunk using the user-selected embedding model.
Generate Sparse Vectors: The TiledeskSparseEncoders class is used to create a sparse vector for each chunk.
Upsert to Pinecone: The dense vector, sparse vector, and metadata are combined and uploaded to the Pinecone index. The index metric must be set to dotproduct to support sparse values.

The upsert_vector_store_hybrid function prepares each data point as a dictionary containing the dense vector (values) and the sparse vector (sparse_values).

### pinecone_repository_serverless.py

async def add_item_hybrid(self, item, embedding_obj=None, embedding_dimension=None):
    # ... (document loading and chunking) ...
    contents = [chunk.page_content for chunk in chunks]
    
    # Initialize the sparse encoder based on user's choice
    sparse_encoder = TiledeskSparseEncoders(item.sparse_encoder)
    doc_sparse_vectors = sparse_encoder.encode_documents(contents, batch_size=item.hybrid_batch_size)

    # Upsert data to the index
    async with vector_store.async_index as indice:
        await self.upsert_vector_store_hybrid(indice,
                                              contents,
                                              chunks,
                                              item.id,
                                              namespace=item.namespace,
                                              engine=item.engine,
                                              embeddings=embedding_obj,
                                              sparse_vectors=doc_sparse_vectors)
    # ...

@staticmethod
async def upsert_vector_store_hybrid(indice, contents, chunks, metadata_id, engine, namespace, embeddings, sparse_vectors):
    # ... (batching logic) ...
    for i in range(0, len(contents), embedding_chunk_size):
        # ... (prepare batches) ...
        embedding_values = await embeddings.aembed_documents(chunk_texts)
        sparse_values = sparse_vectors[i: i + embedding_chunk_size]

        vector_tuples = [
            {'id': idr,
             'values': embedding,
             'metadata': chunk,
             'sparse_values': sparse_value}
            for idr, embedding, chunk, sparse_value in
            zip(chunk_ids, embedding_values, chunk_metadatas, sparse_values)
        ]

        resp = await indice.upsert(vectors=vector_tuples, namespace=namespace)

Indexing with Qdrant

Qdrant has native support for multiple named vectors per point, which makes storing dense and sparse vectors for hybrid search very intuitive.

When creating a Qdrant collection, we define separate configurations for dense and sparse vectors, named text-dense and text-sparse respectively.

### qdrant_repository_local.py

@staticmethod
async def create_index(engine, embeddings, emb_dimension) -> QdrantVectorStore:
    # ... (client setup) ...
    client.create_collection(
        collection_name=collection_name,
        vectors_config={
            "text-dense": models.VectorParams(
                size=emb_dimension,
                distance=metric_distance
            )
        },
        sparse_vectors_config={
            "text-sparse": models.SparseVectorParams(
                index=models.SparseIndexParams(on_disk=True)
            )
        }
    )
    # ...

The upsert_vector_store_hybrid function for Qdrant then creates PointStruct objects where the vector field is a dictionary containing both the dense and sparse representations.

### qdrant_repository_local.py

@staticmethod
async def upsert_vector_store_hybrid(vector_store: QdrantVectorStore, contents, chunks, metadata_id, engine, namespace, embeddings, sparse_vectors):
    # ... (prepare batches and generate vectors) ...
    resp = vector_store.client.upsert(
        collection_name=engine.index_name,
        points=[
            models.PointStruct(
                id=idr,
                vector={
                    "text-dense": embedding,
                    "text-sparse": sparse_value
                },
                payload={"metadata": chunk, "page_content": page_content}
            )
            for idr, embedding, chunk, sparse_value, page_content in
            zip(chunk_ids, embedding_values, chunk_metadatas, sparse_values, chunk_texts)
        ]
    )

The Retrieval Process: Aggregating Scores

The retrieval stage is where the magic of hybrid search happens. After generating dense and sparse vectors for the user’s query, the system queries the vector store. The method for combining the scores from both search types is again specific to the chosen backend.

Retrieval with Pinecone: Linear Combination

With Pinecone, score aggregation must be handled on the client side before sending the query. We use a weighted linear combination of the dense and sparse scores, controlled by an alpha parameter.

alpha = 1.0 for pure dense (semantic) search.
alpha = 0.0 for pure sparse (keyword) search.
0 < alpha < 1 for a hybrid combination.

The hybrid_score_norm function implements this logic, which is then used in the perform_hybrid_search method before calling the Pinecone client.

### sparse_util.py

def hybrid_score_norm(dense, sparse, alpha: float):
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    # Scale sparse vector values
    hs = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    # Scale dense vector values
    return [v * alpha for v in dense], hs

### pinecone_repository_serverless.py

async def perform_hybrid_search(self, question_answer, index, dense_vector, sparse_vector):
    # Normalize scores with the alpha parameter
    dense, sparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=question_answer.alpha)
    
    results = await index.query(
        top_k=question_answer.top_k,
        vector=dense,
        sparse_vector=sparse,
        namespace=question_answer.namespace,
        include_metadata=True
    )
    return results

Retrieval with Qdrant: Reciprocal Rank Fusion (RRF)

Qdrant offers a more sophisticated, built-in method for combining search results: Reciprocal Rank Fusion (RRF). RRF is a powerful algorithm that combines results from multiple ranked lists without needing to normalize scores or tune weighting parameters like alpha. It evaluates results based on their rank in each list, giving higher importance to items that rank consistently high across different retrieval methods.

Our Qdrant implementation leverages this by issuing a FusionQuery with the fusion mode set to RRF. We use prefetch to execute searches against both the text-dense and text-sparse indices in parallel, and Qdrant’s engine handles the fusion automatically

### qdrant_repository_local.py

async def perform_hybrid_search(self, question_answer, index, dense_vector, sparse_vector):
    # No manual score normalization is needed
    dense = dense_vector
    sparse = sparse_vector
    # ... (filter setup) ...

    search_result = index.query_points(
        collection_name=question_answer.engine.index_name,
        query=models.FusionQuery(
            fusion=models.Fusion.RRF  # Use Reciprocal Rank Fusion
        ),
        prefetch=[
            models.Prefetch(
                query=dense,
                using="text-dense"
            ),
            models.Prefetch(
                query=sparse,
                using="text-sparse"
            ),
        ],
        query_filter=filter_qdrant,
        limit=question_answer.top_k,
    ).points
    
    # ... (format results) ...
    return results

The results

This hybrid search architecture provides a powerful and flexible foundation for our Knowledge base module. By abstracting the choice of AI models and vector databases, we empower users to build highly optimized and accurate retrieval pipelines tailored to their specific needs. The different strategies for indexing and retrieval in Pinecone and Qdrant highlight the importance of understanding the capabilities of the underlying vector store. While Pinecone’s linear combination offers simple, tunable control, Qdrant’s built-in Reciprocal Rank Fusion provides a sophisticated, parameter-free method for result aggregation. This modular design ensures Tiledesk’s RAG remains at the forefront of information retrieval technology.

Andrea

Tiledesk founder. Coding chatbots as hobby. Working with my team to create the best Conversational Apps Development Platform - aka Tiledesk

Tiledesk Hybrid Search RAG Architecture

Core Components of Our Hybrid Search System

Dense Vector Generation

Sparse Vector Generation

Vector Store Abstraction

The Indexing Process: A Tale of Two Vector Stores

Indexing with Pinecone

Indexing with Qdrant

The Retrieval Process: Aggregating Scores

Retrieval with Pinecone: Linear Combination

Retrieval with Qdrant: Reciprocal Rank Fusion (RRF)

The results

Andrea

Related posts

Control what your self-learning AI adds to your knowledge base

Tiledesk Knowledge Base: Semantic & Hybrid RAG Engine

Human in the Loop: Agentic-AI needs humans

Leave a ReplyCancel reply

Discover more from Tiledesk