Skip to content
Home » Better Term Centric Scoring In Elasticsearch

Better Term Centric Scoring In Elasticsearch

  • by
Elasticsearch Support Services

Elasticsearch version 7.13 introduced a new query combined_fields that brings better term-centric scoring to relevance engineers. Under the hood it uses the new Lucene query, CombinedFieldsQuery, (formally known as the BM25FQuery) which implements BM25F, a widely accepted extension of BM25 for multi-field search with weighting. Before 7.13, the multi_match query with “type”: “cross_fields” (which for the remainder of this post will be referenced as cross_fields) was the best option in Elasticsearch. This post discusses term-centric versus field-scoring scoring and does a bake off between the scoring of the old (cross_fields) and the new (combined_fields).

Term vs Field-centric is important for scoring

Term and field-centric are two alternative strategies for token based scoring for ranking. In term-centric scoring the entire document is treated as one large field. This means putting less importance on the sections within the document, the goal being better matching when tokens are spread out or repeated across multiple sections.

In field-centric scoring the original sections are scored independently, each section in its own index with its own term statistics. The goal here is to reflect varying importance of different sections, but can create unevenness as IDF can vary widely between fields.

The behaviour of the commonly used minimum_should_match setting illustrates the difference between each approach. With the setting “minimum_should_match”: ”100%” a field-centric query will require all tokens to match within a single field, whereas a term-centric query would be more relaxed, requiring only that all tokens appear in the document – and these tokens could be in different fields.

Old vs New in Elasticsearch (and Lucene)

In the old days (before v7.13) there was only one way to do term-centric with field weighting, by querying with multi_match{ …, “type”: “cross_field”} a.k.a. cross_fields. In Lucene the scoring for cross_fields was done by the BlendedTermQuery, which would mix the scores from individual fields based on user supplied field weights.

As Elasticsearch expert Mark Harwood writes:

“Searching for Mark Harwood across firstname and lastname fields should certainly favour any firstname:Mark over a lastname:Mark. Cross-fields was originally created because in these sorts of scenarios IDF would (annoyingly) ensure exactly the wrong field for a term was ranked highest.”
The cross_fields query would negate IDF for the most part, in order to ensure that scoring was similar across fields. Because this was originally conceived in the context of multi_match there was also a desire to reward the “correct” field. To achieve this, the scoring function would add 1 to the document frequency of the most frequent field. While this worked in practice, the scoring was confusing and not grounded in theory. Let’s consider some example queries:

cross_fields query
GET tmdb/_search
{
“query”: {
“multi_match”: {
“query”: “green Marvel hero”,
“fields”: [
“title^3”,
“overview^2”,
“tagline”
],
“type”: “cross_fields”
}
}
}

combined_fields query
GET tmdb/_search
{
“query”: {
“combined_fields”: {
“query”: “green Marvel hero”,
“fields”: [
“title^3”,
“overview^2”,
“tagline”
]
}
}
}

The syntax for combined_fields is similar but the scoring is different and done by the new Lucene CombinedFieldsQuery which implements BM25F. This is a variant of BM25 that adds that ability to weight individual fields. The field weights act by multiplying the raw term frequency of a field, before individual field statistics are combined into document level statistics. This does two big things: captures relative field importance and establishes a more generalizable formula for ranking than used by the cross_fields query.

An example query
Using a version of The Movie Database (TMDB) that we have in this Elasticsearch sandbox on Github I want to show the difference between combined_fields and cross_fields. _explain API

First let’s look at what the explain API tells us about queries for “Captain Marvel” in each case:

cross_fields

  • Request
  • Response

In the _explain response of cross_fields, we can see that scoring is still done by field per term before it is rolled up by term. In the combined_field this doesn’t happen because each term is scored just once on the synthetic field representing a combination of “title”, “tagline” and “overview”. The single per-term scoring against the synthetic field homogenizes the term statistics that may have varied drastically between fields with cross_fields.

First page of results

Next, I compare the first page of results (size: 30) as tables. I added the Jaccard set similarity to show how much overlap there is between the two result sets. A Jaccard similarity of 1.0 is perfect overlap, the same 30 items in both result sets. A Jaccard similarity of 0.0 is no overlap, so 60 different items between the two queries. Remember Jaccard similarity is set based and does not factor in position.

Jaccard similarity: 0.579

combined_fieldsrank
scoretitle
20.483114Captain Marvel1
16.441910Green Lantern: First Flight2
15.406511Jimmy Vestvood: Amerikan Hero3
13.150019Hulk4
12.759342The Man Who Killed Don Quixote5
12.038399Justice League: War6
10.916338Maverick7
10.763498The Extra Man8
10.158279Green Lantern: Emerald Knights9
10.123980Rambo10
9.909670The Odd Life of Timothy Green11
9.797913The Green Inferno12
9.777215Green Lantern13
9.688647The Green Berets14
9.402530Revenge of the Green Dragons15
9.362038The Punisher16
9.341401Green Book17
9.081026Green Street Hooligans 218
8.764002Blinky Bill the Movie19
8.744556Chain Reaction20
8.648538Green Room21
8.553925How Green Was My Valley22
8.370777Fried Green Tomatoes23
8.282112Green Mansions24
8.211758Big Trouble in Little China25
8.195307The Green Mile26
8.099191Hardball27
7.975277Taxi28
7.816612Last Action Hero29
7.787214Green Zone30
cross_fields
titlescore
Captain Marvel31.48880
Jimmy Vestvood: Amerikan Hero21.73690
Green Lantern: First Flight18.18846
Hulk17.29990
Green Mansions16.13631
The Green Berets16.13631
Green Zone16.13631
The Green Hornet16.13631
Green Room16.13631
Green Lantern16.13631
The Green Mile16.13631
Green Book16.13631
The Green Inferno16.13631
Heroes15.88856
Hero15.88856
Green Lantern: Emerald Knights15.71288
Justice League: War15.45730
Maverick15.06839
Chain Reaction14.37306
Blinky Bill the Movie13.89266
The Extra Man13.58447
The Punisher13.53983
Revenge of the Green Dragons13.48916
Fried Green Tomatoes13.48916
The Odd Life of Timothy Green13.11556
Hero Wanted12.77054
Heroes for Sale12.77054
Almost Heroes12.77054
Everyone’s Hero12.77054
Kelly’s Heroes12.77054

The Jaccard similarity of 0.579 highlights that a lot of different documents are being surfaced in the combined_fields query compared to cross_fields. In this example 34 results are shared between the queries, but 26 are unique to one of the other. This doesn’t mean the differences are bad (or good) but it does mean there is some major churn in rankings between the two queries.

Another view of that same data, with a scatter plot, better shows the changes in position and scores for individual movies. The x-axis is the score from the cross_fields query and the y-axis is the score from the combined_fields query. Each dot is a document and the dot color represents the positional shift switching from cross_fields to combined_fields. Some documents were not included in the results for both queries, so they are represented as a tick mark along the axis where they were retrieved.

The top several results are consistent and the golden result “Hulk” is returned in position #4 for both queries. Note the score plateau in cross_fields at a score of 16.13. All of those documents got identical scores, so their relative position in the final ranked list is decided by the order they were indexed. This arbitrary tie-breaking doesn’t happen in combined_fields because there isn’t the same plateau effect with a single large field.

Visualizing search data like this is a great way to glean insights you might miss in bigger tables. Tables are great for inspecting individual records or comparing a handful of items, but graphics are a better form of communication when many data points are involved. Search is a “medium” data problem, with lots of queries and lots of results, so getting a good graphic grip on how it is performing will always help

To the future with term-centric scoring

If you were using cross_fields, switching to combined_fields will shake up your results. But the benefits (general acceptance and scoring interpretability) of BM25F might make it worth it.

Besides differences in scoring, introducing combined_fields clarifies the split between term and field-centric in the Elasticsearch API. Now we have multi_match for field-centric and combined_fields for term-centric. Having a clear API is big reason why I think Elasticsearch has been so successful, so I’m really happy to see this trend continue.

I’m also pleased to see the effort Elastic is committing to keeping Elasticsearch (and Lucene) current with the best methods from academic publications. HNSW approximate nearest neighbor search – vector search- is right around the corner for Lucene and Elastic is active in that effort too.

Do join us in Relevance Slack and let me know your comments or feedback – and if we can help you with these tricky scoring issues on your Elasticsearch cluster, get in touch.

Leave a Reply

Your email address will not be published. Required fields are marked *

For Search, Content Management & Data Engineering Services

Get in touch with us