Hybrid Search #
Hybrid search is a powerful way to combine different search methods to get the best results. This is commonly used to combine vector and text search results to get the best of both worlds.
Here are some very useful examples of hybrid search to overcome the limitations of vector search methods:
- terminology: Most domains have specific terms that are not present in the training data. By combining text search with vector search, you can find points that contain the specific terms.
- abbreviations: Similar to terminology, abbreviations are often not present in the training data. If a user searches for B2B, the neural network might not have seen this before. So combining text search with vector search can help a lot.
- outliers: Sometimes the vector search might not find the best results because the documents are all similar to each other. Text search can pick out differentiating keywords to provide more context to an AI agent.
Composite Query #
To create a hybrid query, we need to combine two or more ranking search results. Currently, in SemaDB these are vector or text search results. By ranking, we mean that the search results are ranked accordingly to their relevance to the query, i.e. distance for vector search and score for text search.
We create a hybrid query using the _or
or _and
special query type. By doing so, we might get overlapping results from the two search methods. The results are then combined and ranked using a hybrid score:
{
"query": {
"property": "_or",
"_or": [
{
"property": "productEmbedding",
"vectorVamana": {
"vector": [1, 2],
"operator": "near",
"searchSize": 75,
"limit": 10,
"weight": 0.2
}
},
{
"property": "description",
"text": {
"value": "summer floral",
"operator": "containsAny",
"limit": 10,
"weight": 0.5
}
},
{
"property": "title",
"text": {
"value": "maxi dress",
"operator": "containsAll",
"limit": 10,
"weight": 0.3
}
},
]
},
"select": ["title", "price"],
"limit": 10
}
Hybrid Score #
In the above query, we combine the results of a vector search on productEmbedding
with text searches on description
and title
. The weight
parameter is used to adjust the importance of each search method and defines the initial hybrid score value as:
hybridScore = weight * score
for score based indices such as text search.hybridScore = weight * distance * -1
for distance based indices such as vector search. The distance is negated to ensure lower distances yield higher scores.
The weight is optional and will be set to 1 if not provided. The above query might yield a result containing a point:
{
"points": [
{
"_distance": 8,
"_hybridScore": -1.6802747,
"_id": "9ce6678a-8cb8-4d7d-a367-a9e0a1bbec2e",
"_score": -0.10034334,
"title": "Maxi Dress",
"price": 49.99
}
// ...
]
}
where the hybridScore
is the sum of the weighted scores of the individual search methods if they yield overlapping documents. That is, multiple search methods return the same documents. If a document appears in a single search result, then its hybrid score is carried over. The distance
and score
fields are the raw values from the vector and text search results, respectively.
What happens if there are multiple score or distance results? In this case, the last search that yields the distance or score will be returned in the final result. In the above case, both title
and description
are text searches, so the score from title
will be set as _score
but this does not affect the hybridScore
calculation.