Skip to content
Home » Blog Hybrid Search

Blog Hybrid Search

  • by

Overview

In recent years, vector-based search has become incredibly popular. Fine-tuned Large Language Models shown remarkable efficacy in text-to-vector encoding and numerically expressing some phrase semantics. These vectors may be used to simulate a similarity search in the semantic space by doing a K-nearest neighbor search and searching for documents/paragraphs near the query in an n-dimensional vector space.

Limitations

Nowadays, vector-based search is interesting but still has a few drawbacks:

  • It’s somewhat challenging to explain, for example, why document A is returned and why is it appearing at location K.
  • Since most vector embedding models are optimized for sentences of a specific length, it performs best on queries and documents of that length. For this reason, text chunking is highly advised.
  • Users still heavily rely on keyword searches, and it doesn’t care about accurate keyword matching.

Though it’s realistic to assume that these restrictions may be overcome in the future, one approach that is quickly gaining traction is attempting to lessen these restrictions by combining traditional keyword-based search with vector-based search.
What does it mean then to mix these two methods together?
Initially, two sets of candidates are retrieved:

  • First result set derived from lexical matches includes the query terms
  • Second result set obtained using the query vector in the K-Nearest Neighbors search

After that, a ranking that maximizes the relevance for the user query must be created by combining these results.

To get the system ready for Hybrid search, Solr’s schema needs to be configured with traditional fields and solr.DenseVectorField.

Retrieval Step

The Boolean Query Parser, one of the important query parsers can help retrieve the two sets of results.

Boolean clauses, which each have their own query parser and can represent a distinct query type, allow this query parser to combine the queries.
We can use a variety of Boolean conditions with the Boolean Query Parser to create a flexible hybrid candidates result set.

Union

q = {!bool should=$lexicalQuery should=$vectorQuery}&
lexicalQuery = {!type=edismax qf=field_name}search_keyword&
vectorQuery = {!knn f=vector topK=10}[0.081, -0.382, -0.190, …]

The results from the two models are combined to create the hybrid candidate result set:
The results from the lexical (keyword-based) search and the top-K results from the K-Nearest Neighbors search.
The total result set’s cardinality is <= (K + NumFound).
There will be no duplicates in the result set.

Intersection

q = {!bool must=$lexicalQuery must=$vectorQuery}&
lexicalQuery = {!type=edismax qf=field_name}search_keyword&
vectorQuery = {!knn f=vector topK=10}[0.081, -0.382, -0.190, …]

The results from the two models intersect to form the hybrid candidate result set, which only returns the top-K K-Nearest Neighbor results that meet the lexical query.
The merged result set has cardinality <= K.
Let’s examine the implications of this since it effectively post-filters K-Nearest Neighbors results while influencing the score.

Bonus Point: The Pre-Filtering and Post-Filtering Process

When a filter query (fq) was added to a K-Nearest Neighbors vector search using Apache Solr version <9.1, a post filter was performed, resulting in the return of just the subset of K-Nearest Neighbors search results that matched the filter (with a cardinality <= K).

As of Solr version 9.1, filter queries are pre-filters by default, which implies that the filtering condition is enforced throughout the K-Nearest Neighbors retrieval process by Apache Solr.
You will therefore receive K results after if you have at least K documents that meet the criteria.

You can still execute filter queries as post-filters, specifying the filter cost, in the same fashion you would do post-filtering with any other Solr query using cache local parameter.

We advise using either method here, depending on your desire to have the lexical score influence the vector-based score (and thus the ranking), as post-filtering will prevent you from computing the lexical score for the filter clause.

Retrieval Step

After the hybrid candidate set is obtained, we want to compute a score for every document that represents the most appropriate ordering for the user query.

Out of the box, the K-Nearest Neighbors search gives us a score ranging from -1 to 1, which is added up to an unbounded score from the lexical side (which might be well above that scale).

To be truthful, there isn’t a simple way to combine the two scores in a way that ensures the best overall ranking, but let’s look at what Apache Solr has to offer.

Sum Normalised Scores

q = {!bool filter=$retrievalStage must=$rankingStage}&
retrievalStage = {!bool should=$lexicalQuery should=$vectorQuery}&
rankingStage = {!func}sum(query($normalisedLexicalQuery),query($vectorQuery))&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0,1)&
lexicalQuery = {!type=edismax qf=field_name}search_keyword&
vectorQuery = {!knn f=vector topK=10}[0.081, -0.382, -0.190, …]

The filter part constructs the hybrid result set without taking into account any scoring. 
The score is assigned by the must clause using the relevant function query.
The K-Nearest Neighbors score is calculated by adding the lexical score, which is min-max normalized to be scaled between 0 and 1.
This straightforward linear combination of the scores can be a useful place to start.

Multiply Normalised Scores

q = {!bool filter=$retrievalStage must=$rankingStage}&
retrievalStage = {!bool should=$lexicalQuery should=$vectorQuery}&
rankingStage = {!func}product(query($normalisedLexicalQuery),query($vectorQuery))&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0.1,1)&
lexicalQuery = {!type=edismax qf=field_name}search_keyword&
vectorQuery = {!knn f=vector topK=10}[0.081, -0.382, -0.190, …]

The filter part constructs the hybrid result set without taking into account any scoring.
The score is assigned by the must clause using the relevant function query.
The lexical score is multiplied by the K-Nearest Neighbors score after it has been min-max normalized to be scaled between 0.1 and 1.

No evidence suggests that this should produce better results than the simple sum, and it’s always advisable to construct a prototype and validate your hypotheses using actual queries and datasets.

Learning To Rank

There is no simple way to determine the mathematical function to apply when combining features in hybrid search or any other search situation where you need to combine various elements (features) to generate the final ranking.

Sum? Normalised Sum? Product? Linear function? Non-linear Function?

One can opt to use machine learning as the solution to this issue.

Apache Solr has supported learning-to-rank since version 6.4 and since version 9.3 Apache Solr includes basic support for vector similarity function searches, which can be utilized as features in a learning-to-rank model. The concept involves creating a training set outside of Solr and defining a feature vector that characterizes your pair and a rating that indicates the document’s relevance to the query.

For more details on LTR, we will add another blog. Stay Tuned!

Leave a Reply

Your email address will not be published. Required fields are marked *

For AI, Search, Content Management & Data Engineering Services

Get in touch with us