Elasticsearch version 7.13 introduced a new query combined_fields that brings better term-centric scoring to relevance engineers. Under the hood it uses the new Lucene query, CombinedFieldsQuery, (formally known as the BM25FQuery) which implements BM25F, a widely accepted extension of BM25 for multi-field search with weighting. Before 7.13, the multi_match query with “type”: “cross_fields” (which for the remainder of this post will be referenced as cross_fields) was the best option in Elasticsearch. This post discusses term-centric versus field-scoring scoring and does a bake off between the scoring of the old (cross_fields) and the new (combined_fields).
Term vs Field-centric is important for scoring
Term and field-centric are two alternative strategies for token based scoring for ranking. In term-centric scoring the entire document is treated as one large field. This means putting less importance on the sections within the document, the goal being better matching when tokens are spread out or repeated across multiple sections.
In field-centric scoring the original sections are scored independently, each section in its own index with its own term statistics. The goal here is to reflect varying importance of different sections, but can create unevenness as IDF can vary widely between fields.
The behaviour of the commonly used minimum_should_match setting illustrates the difference between each approach. With the setting “minimum_should_match”: ”100%” a field-centric query will require all tokens to match within a single field, whereas a term-centric query would be more relaxed, requiring only that all tokens appear in the document – and these tokens could be in different fields.
Old vs New in Elasticsearch (and Lucene)
In the old days (before v7.13) there was only one way to do term-centric with field weighting, by querying with multi_match{ …, “type”: “cross_field”} a.k.a. cross_fields. In Lucene the scoring for cross_fields was done by the BlendedTermQuery, which would mix the scores from individual fields based on user supplied field weights.
As Elasticsearch expert Mark Harwood writes:
“Searching for Mark Harwood across firstname and lastname fields should certainly favour any firstname:Mark over a lastname:Mark. Cross-fields was originally created because in these sorts of scenarios IDF would (annoyingly) ensure exactly the wrong field for a term was ranked highest.”
The cross_fields query would negate IDF for the most part, in order to ensure that scoring was similar across fields. Because this was originally conceived in the context of multi_match there was also a desire to reward the “correct” field. To achieve this, the scoring function would add 1 to the document frequency of the most frequent field. While this worked in practice, the scoring was confusing and not grounded in theory. Let’s consider some example queries:
cross_fields query
GET tmdb/_search
{
“query”: {
“multi_match”: {
“query”: “green Marvel hero”,
“fields”: [
“title^3”,
“overview^2”,
“tagline”
],
“type”: “cross_fields”
}
}
}
combined_fields query
GET tmdb/_search
{
“query”: {
“combined_fields”: {
“query”: “green Marvel hero”,
“fields”: [
“title^3”,
“overview^2”,
“tagline”
]
}
}
}
The syntax for combined_fields is similar but the scoring is different and done by the new Lucene CombinedFieldsQuery which implements BM25F. This is a variant of BM25 that adds that ability to weight individual fields. The field weights act by multiplying the raw term frequency of a field, before individual field statistics are combined into document level statistics. This does two big things: captures relative field importance and establishes a more generalizable formula for ranking than used by the cross_fields query.
An example query
Using a version of The Movie Database (TMDB) that we have in this Elasticsearch sandbox on Github I want to show the difference between combined_fields and cross_fields. _explain API
First let’s look at what the explain API tells us about queries for “Captain Marvel” in each case:
cross_fields
- Request
- Response
In the _explain response of cross_fields, we can see that scoring is still done by field per term before it is rolled up by term. In the combined_field this doesn’t happen because each term is scored just once on the synthetic field representing a combination of “title”, “tagline” and “overview”. The single per-term scoring against the synthetic field homogenizes the term statistics that may have varied drastically between fields with cross_fields.
First page of results
Next, I compare the first page of results (size: 30) as tables. I added the Jaccard set similarity to show how much overlap there is between the two result sets. A Jaccard similarity of 1.0 is perfect overlap, the same 30 items in both result sets. A Jaccard similarity of 0.0 is no overlap, so 60 different items between the two queries. Remember Jaccard similarity is set based and does not factor in position.
Jaccard similarity: 0.579
combined_fields | rank | |
---|---|---|
score | title | |
20.483114 | Captain Marvel | 1 |
16.441910 | Green Lantern: First Flight | 2 |
15.406511 | Jimmy Vestvood: Amerikan Hero | 3 |
13.150019 | Hulk | 4 |
12.759342 | The Man Who Killed Don Quixote | 5 |
12.038399 | Justice League: War | 6 |
10.916338 | Maverick | 7 |
10.763498 | The Extra Man | 8 |
10.158279 | Green Lantern: Emerald Knights | 9 |
10.123980 | Rambo | 10 |
9.909670 | The Odd Life of Timothy Green | 11 |
9.797913 | The Green Inferno | 12 |
9.777215 | Green Lantern | 13 |
9.688647 | The Green Berets | 14 |
9.402530 | Revenge of the Green Dragons | 15 |
9.362038 | The Punisher | 16 |
9.341401 | Green Book | 17 |
9.081026 | Green Street Hooligans 2 | 18 |
8.764002 | Blinky Bill the Movie | 19 |
8.744556 | Chain Reaction | 20 |
8.648538 | Green Room | 21 |
8.553925 | How Green Was My Valley | 22 |
8.370777 | Fried Green Tomatoes | 23 |
8.282112 | Green Mansions | 24 |
8.211758 | Big Trouble in Little China | 25 |
8.195307 | The Green Mile | 26 |
8.099191 | Hardball | 27 |
7.975277 | Taxi | 28 |
7.816612 | Last Action Hero | 29 |
7.787214 | Green Zone | 30 |
cross_fields | |
---|---|
title | score |
Captain Marvel | 31.48880 |
Jimmy Vestvood: Amerikan Hero | 21.73690 |
Green Lantern: First Flight | 18.18846 |
Hulk | 17.29990 |
Green Mansions | 16.13631 |
The Green Berets | 16.13631 |
Green Zone | 16.13631 |
The Green Hornet | 16.13631 |
Green Room | 16.13631 |
Green Lantern | 16.13631 |
The Green Mile | 16.13631 |
Green Book | 16.13631 |
The Green Inferno | 16.13631 |
Heroes | 15.88856 |
Hero | 15.88856 |
Green Lantern: Emerald Knights | 15.71288 |
Justice League: War | 15.45730 |
Maverick | 15.06839 |
Chain Reaction | 14.37306 |
Blinky Bill the Movie | 13.89266 |
The Extra Man | 13.58447 |
The Punisher | 13.53983 |
Revenge of the Green Dragons | 13.48916 |
Fried Green Tomatoes | 13.48916 |
The Odd Life of Timothy Green | 13.11556 |
Hero Wanted | 12.77054 |
Heroes for Sale | 12.77054 |
Almost Heroes | 12.77054 |
Everyone’s Hero | 12.77054 |
Kelly’s Heroes | 12.77054 |
The Jaccard similarity of 0.579 highlights that a lot of different documents are being surfaced in the combined_fields query compared to cross_fields. In this example 34 results are shared between the queries, but 26 are unique to one of the other. This doesn’t mean the differences are bad (or good) but it does mean there is some major churn in rankings between the two queries.
Another view of that same data, with a scatter plot, better shows the changes in position and scores for individual movies. The x-axis is the score from the cross_fields query and the y-axis is the score from the combined_fields query. Each dot is a document and the dot color represents the positional shift switching from cross_fields to combined_fields. Some documents were not included in the results for both queries, so they are represented as a tick mark along the axis where they were retrieved.
The top several results are consistent and the golden result “Hulk” is returned in position #4 for both queries. Note the score plateau in cross_fields at a score of 16.13. All of those documents got identical scores, so their relative position in the final ranked list is decided by the order they were indexed. This arbitrary tie-breaking doesn’t happen in combined_fields because there isn’t the same plateau effect with a single large field.
Visualizing search data like this is a great way to glean insights you might miss in bigger tables. Tables are great for inspecting individual records or comparing a handful of items, but graphics are a better form of communication when many data points are involved. Search is a “medium” data problem, with lots of queries and lots of results, so getting a good graphic grip on how it is performing will always help
To the future with term-centric scoring
If you were using cross_fields, switching to combined_fields will shake up your results. But the benefits (general acceptance and scoring interpretability) of BM25F might make it worth it.
Besides differences in scoring, introducing combined_fields clarifies the split between term and field-centric in the Elasticsearch API. Now we have multi_match for field-centric and combined_fields for term-centric. Having a clear API is big reason why I think Elasticsearch has been so successful, so I’m really happy to see this trend continue.
I’m also pleased to see the effort Elastic is committing to keeping Elasticsearch (and Lucene) current with the best methods from academic publications. HNSW approximate nearest neighbor search – vector search- is right around the corner for Lucene and Elastic is active in that effort too.
Do join us in Relevance Slack and let me know your comments or feedback – and if we can help you with these tricky scoring issues on your Elasticsearch cluster, get in touch.