Retrieval-Augmented Generation (RAG) has become a fundamental paradigm for building intelligent systems that provide context-aware and factually grounded answers. Standard RAG systems, which often rely solely on semantic (dense) vector search, can sometimes fall short. While they excel at understanding the meaning behind a query, they can miss important keyword-specific details.
To overcome this limitation, we have implemented an advanced hybrid search mechanism. This approach combines the strengths of dense vectors (for semantic understanding) and sparse vectors (for keyword precision), ensuring a more robust and accurate retrieval process.

This approach is especially useful when dealing with documents that contain a lot of codes, acronyms, numbers, and names. For example, the semantic approach often fails when searching for a contact by name or phone number in a contact list. Another common scenario is in manufacturing services, where users search for a product or component using a serial code or alphanumeric identifier. In all these cases, a full-text search yields better results and leads to higher user satisfaction.
At Tiledesk, we currently rely on two main vector search engines. The first is Pinecone, a top-tier solution that powers our production SaaS distribution. The second is Qdrant, a newer open-source vector database that also offers a cloud version. Qdrant is the default vector store bundled with our open-source distribution. To support hybrid search, we designed our system to work with both engines, adopting an adaptive approach that allows for compatibility not only with these two but also with other vector stores in the future.

This article details our hybrid search architecture, emphasizing its core principles: flexibility, modularity, and performance. We will explore how the system is designed to be agnostic to the choice of embedding models and vector stores, and we will dive into the technical specifics of the indexing and retrieval processes, using code snippets to illustrate key functions. All inference is performed locally on a machine equipped with an NVIDIA T4 GPU, ensuring efficient processing.
Core Components of Our Hybrid Search System
Our RAG architecture is built on a set of modular and interchangeable components, giving the user complete control over the AI models used in the pipeline.
Dense Vector Generation
Dense vectors are crucial for capturing the semantic meaning of text. Our system allows users to select from a wide range of embedding models to generate these vectors. This includes popular options like OpenAI’s text-embedding series or powerful open-source alternatives such as bge-m3. This flexibility is managed by a factory pattern that dynamically loads the chosen model, making it easy to experiment and optimize for different use cases.
Sparse Vector Generation
To complement dense vectors, we generate sparse vectors that excel at keyword matching. Sparse vectors represent text by mapping tokens to dimensions in a high-dimensional space, effectively weighting the importance of each term. Our implementation supports cutting-edge sparse models like SPLADE and BGE-M3, which can generate both dense and sparse vectors in a single pass.
To efficiently manage these models, which can be resource-intensive, we implemented the TiledeskSparseEncoders class. This class acts as a factory and a cache, ensuring that models are loaded only once and reused across requests, which significantly reduces latency and memory consumption. It uses a Least Recently Used (LRU) cache policy to manage the loaded models.
### sparse_encoders.py
class TiledeskSparseEncoders:
# LRU Cache with a maximum size of 2 models
_encoder_cache = OrderedDict()
_max_cache_size = 2
_logger = logging.getLogger(__name__)
_cache_lock = Lock()
def __init__(self, model_name: str):
self.model_name = model_name.lower()
self.encoder = self._get_cached_encoder(self.model_name)
# ...
@classmethod
def _get_cached_encoder(cls, model_name: str) -> Union[TiledeskSpladeEncoder, TiledeskBGEM3]:
with cls._cache_lock:
# Check if the model is already in the cache
if model_name in cls._encoder_cache:
cls._logger.info(f"Reusing cached instance of: {model_name}")
# Move to the end (most recent)
encoder = cls._encoder_cache.pop(model_name)
cls._encoder_cache[model_name] = encoder
return encoder
# Create a new encoder if not in the cache
if model_name == "splade":
cls._logger.info(f"Creating new SpladeEncoder instance")
encoder = TiledeskSpladeEncoder()
elif model_name == "bge-m3":
cls._logger.info(f"Creating new BGEM3 instance")
encoder = TiledeskBGEM3()
else:
raise ValueError(f"Unsupported model: {model_name}. Use 'splade' or 'bge-m3'.")
# ... (LRU cache management)
cls._encoder_cache[model_name] = encoder
return encoder
Vector Store Abstraction
A key design feature is the ability to switch between different vector store backends without altering the core logic. We currently support Pinecone and Qdrant, two leading vector databases. This abstraction is achieved through a dependency injection mechanism using the @inject_repo decorator, which dynamically instantiates the correct repository class based on the user’s configuration.
The Indexing Process: A Tale of Two Vector Stores
The way we index data for hybrid search differs significantly depending on the vector store, as each has its own method for handling dense and sparse vectors.
Indexing with Pinecone
Pinecone supports hybrid search by storing dense and sparse vectors for the same data point and combining their scores at query time. Our indexing process for Pinecone is handled by the add_item_hybrid method.
The key steps are:
- Chunk the document: The input text is divided into smaller, manageable chunks.
- Generate Dense Embeddings: A dense vector is created for each chunk using the user-selected embedding model.
- Generate Sparse Vectors: The
TiledeskSparseEncodersclass is used to create a sparse vector for each chunk. - Upsert to Pinecone: The dense vector, sparse vector, and metadata are combined and uploaded to the Pinecone index. The index metric must be set to
dotproductto support sparse values.
The upsert_vector_store_hybrid function prepares each data point as a dictionary containing the dense vector (values) and the sparse vector (sparse_values).
### pinecone_repository_serverless.py
async def add_item_hybrid(self, item, embedding_obj=None, embedding_dimension=None):
# ... (document loading and chunking) ...
contents = [chunk.page_content for chunk in chunks]
# Initialize the sparse encoder based on user's choice
sparse_encoder = TiledeskSparseEncoders(item.sparse_encoder)
doc_sparse_vectors = sparse_encoder.encode_documents(contents, batch_size=item.hybrid_batch_size)
# Upsert data to the index
async with vector_store.async_index as indice:
await self.upsert_vector_store_hybrid(indice,
contents,
chunks,
item.id,
namespace=item.namespace,
engine=item.engine,
embeddings=embedding_obj,
sparse_vectors=doc_sparse_vectors)
# ...
@staticmethod
async def upsert_vector_store_hybrid(indice, contents, chunks, metadata_id, engine, namespace, embeddings, sparse_vectors):
# ... (batching logic) ...
for i in range(0, len(contents), embedding_chunk_size):
# ... (prepare batches) ...
embedding_values = await embeddings.aembed_documents(chunk_texts)
sparse_values = sparse_vectors[i: i + embedding_chunk_size]
vector_tuples = [
{'id': idr,
'values': embedding,
'metadata': chunk,
'sparse_values': sparse_value}
for idr, embedding, chunk, sparse_value in
zip(chunk_ids, embedding_values, chunk_metadatas, sparse_values)
]
resp = await indice.upsert(vectors=vector_tuples, namespace=namespace)
Indexing with Qdrant
Qdrant has native support for multiple named vectors per point, which makes storing dense and sparse vectors for hybrid search very intuitive.
When creating a Qdrant collection, we define separate configurations for dense and sparse vectors, named text-dense and text-sparse respectively.
### qdrant_repository_local.py
@staticmethod
async def create_index(engine, embeddings, emb_dimension) -> QdrantVectorStore:
# ... (client setup) ...
client.create_collection(
collection_name=collection_name,
vectors_config={
"text-dense": models.VectorParams(
size=emb_dimension,
distance=metric_distance
)
},
sparse_vectors_config={
"text-sparse": models.SparseVectorParams(
index=models.SparseIndexParams(on_disk=True)
)
}
)
# ...
The upsert_vector_store_hybrid function for Qdrant then creates PointStruct objects where the vector field is a dictionary containing both the dense and sparse representations.
### qdrant_repository_local.py
@staticmethod
async def upsert_vector_store_hybrid(vector_store: QdrantVectorStore, contents, chunks, metadata_id, engine, namespace, embeddings, sparse_vectors):
# ... (prepare batches and generate vectors) ...
resp = vector_store.client.upsert(
collection_name=engine.index_name,
points=[
models.PointStruct(
id=idr,
vector={
"text-dense": embedding,
"text-sparse": sparse_value
},
payload={"metadata": chunk, "page_content": page_content}
)
for idr, embedding, chunk, sparse_value, page_content in
zip(chunk_ids, embedding_values, chunk_metadatas, sparse_values, chunk_texts)
]
)
The Retrieval Process: Aggregating Scores
The retrieval stage is where the magic of hybrid search happens. After generating dense and sparse vectors for the user’s query, the system queries the vector store. The method for combining the scores from both search types is again specific to the chosen backend.
Retrieval with Pinecone: Linear Combination
With Pinecone, score aggregation must be handled on the client side before sending the query. We use a weighted linear combination of the dense and sparse scores, controlled by an alpha parameter.
alpha = 1.0for pure dense (semantic) search.alpha = 0.0for pure sparse (keyword) search.0 < alpha < 1for a hybrid combination.
The hybrid_score_norm function implements this logic, which is then used in the perform_hybrid_search method before calling the Pinecone client.
### sparse_util.py
def hybrid_score_norm(dense, sparse, alpha: float):
if alpha < 0 or alpha > 1:
raise ValueError("Alpha must be between 0 and 1")
# Scale sparse vector values
hs = {
'indices': sparse['indices'],
'values': [v * (1 - alpha) for v in sparse['values']]
}
# Scale dense vector values
return [v * alpha for v in dense], hs
### pinecone_repository_serverless.py
async def perform_hybrid_search(self, question_answer, index, dense_vector, sparse_vector):
# Normalize scores with the alpha parameter
dense, sparse = hybrid_score_norm(dense_vector, sparse_vector, alpha=question_answer.alpha)
results = await index.query(
top_k=question_answer.top_k,
vector=dense,
sparse_vector=sparse,
namespace=question_answer.namespace,
include_metadata=True
)
return results
Retrieval with Qdrant: Reciprocal Rank Fusion (RRF)
Qdrant offers a more sophisticated, built-in method for combining search results: Reciprocal Rank Fusion (RRF). RRF is a powerful algorithm that combines results from multiple ranked lists without needing to normalize scores or tune weighting parameters like alpha. It evaluates results based on their rank in each list, giving higher importance to items that rank consistently high across different retrieval methods.
Our Qdrant implementation leverages this by issuing a FusionQuery with the fusion mode set to RRF. We use prefetch to execute searches against both the text-dense and text-sparse indices in parallel, and Qdrant’s engine handles the fusion automatically
### qdrant_repository_local.py
async def perform_hybrid_search(self, question_answer, index, dense_vector, sparse_vector):
# No manual score normalization is needed
dense = dense_vector
sparse = sparse_vector
# ... (filter setup) ...
search_result = index.query_points(
collection_name=question_answer.engine.index_name,
query=models.FusionQuery(
fusion=models.Fusion.RRF # Use Reciprocal Rank Fusion
),
prefetch=[
models.Prefetch(
query=dense,
using="text-dense"
),
models.Prefetch(
query=sparse,
using="text-sparse"
),
],
query_filter=filter_qdrant,
limit=question_answer.top_k,
).points
# ... (format results) ...
return results
The results
This hybrid search architecture provides a powerful and flexible foundation for our Knowledge base module. By abstracting the choice of AI models and vector databases, we empower users to build highly optimized and accurate retrieval pipelines tailored to their specific needs. The different strategies for indexing and retrieval in Pinecone and Qdrant highlight the importance of understanding the capabilities of the underlying vector store. While Pinecone’s linear combination offers simple, tunable control, Qdrant’s built-in Reciprocal Rank Fusion provides a sophisticated, parameter-free method for result aggregation. This modular design ensures Tiledesk’s RAG remains at the forefront of information retrieval technology.




