Leveraging Text Tagger in Solr for Improved Search Accuracy

Text Tagger is a very useful feature available in Solr since 7.x to improve the relevancy of results. It helps identifying the tags in search queries like product name, brand, color, location, place, etc., which can further be used in query to refine the query to produce more relevant results. For example, if a user searches for “black shoes for men”, Text Tagger can identify ‘black’ as a color, ‘shoes’ as a product type and ‘men’ as a gender. This information can be used to refine the search query, either by applying the filters or boost query or other options based on business requirement and provide more relevant results for better user experiences.

Given a dictionary (a Solr index) with a name-like field, you can post text (search term in case of query) to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. It’s used for named entity recognition (NER).

The tagger doesn’t do any natural language processing (NLP) so it’s considered a “naive tagger”, but it’s definitely useful as-is and a more complete NER (named entity recognition) or ERD (entity recognition and disambiguation) system can be built with this as a key component. The SolrTextTagger might be used on queries for query-understanding or large documents as well depending on the use case of the business.

Implementation

As a best practice, one should start by looking into the data available on what users are searching for in the system. Deep analysis will reveal multiple taxonomies such as color, brand, name, categories, model, etc. that the majority of queries are using. There are two options either we can use an existing collection for querying with text tagger, or create a separate index with all added taxonomies like color, brand, name, categories, model, etc. along with their unique identifiers.

Step 1 – Configure fields

The critical part up-front is to define the “tag” field type. There are many many ways to configure text analysis; and we’re not going to get into those choices here. But an important bit is the ConcatenateGraphFilterFactory at the end of the index analyzer chain. Another important bit for performance is postingsFormat=FST50 resulting in a compact FST based in-memory data structure that is especially beneficial for the text tagger.

curl -X POST -H ‘Content-type:application/json’ http://localhost:8983/solr/taxonomies/schema -d ‘{
     “add-field-type”:{
          “name”:”tag”,
          “class”:”solr.TextField”,
          “postingsFormat”:”FST50″,
          “omitNorms”:true,
          “omitTermFreqAndPositions”:true,
          “indexAnalyzer”:{
               “tokenizer”:{“class”:”solr.StandardTokenizerFactory” },
               “filters”:[
                    {“class”:”solr.EnglishPossessiveFilterFactory”},
                    {“class”:”solr.ASCIIFoldingFilterFactory”},
                    {“class”:”solr.LowerCaseFilterFactory”},
                    {“class”:”solr.ConcatenateGraphFilterFactory”, “preservePositionIncrements”:false }
               ]},
          “queryAnalyzer”:{
               “tokenizer”:{“class”:”solr.StandardTokenizerFactory” },
               “filters”:[
                    {“class”:”solr.EnglishPossessiveFilterFactory”},
                    {“class”:”solr.ASCIIFoldingFilterFactory”},
                    {“class”:”solr.LowerCaseFilterFactory”}
               ]}
     },
     “add-field”:{“name”:”name”, “type”:”text_general”},
     “add-field”:{“name”:”name_tag”, “type”:”tag”, “stored”:false },
     “add-field”:{“name”:”type”, “type”:”string”},
     “add-copy-field”:{“source”:”name”, “dest”:[“name_tag”]}
}’

Step 2 – Custom Request Handler

curl -X POST -H ‘Content-type:application/json’ http://localhost:8983/solr/taxonomies/config -d ‘{
     “add-requesthandler” : {
          “name”: “/tag”,
          “class”:”solr.TaggerRequestHandler”,
          “defaults”:{“field”:”name_tag”}
     }
}’

Example document:

{
“id”: “123”,
“name”: “Samsung”,
“name_tag”: “Samsung”,
“type”: “brand “
}

Sample query:

curl -X POST -H ‘Content-Type:text/plain’ ‘http://localhost:8983/solr/taxonomies/tag?overlaps=NO_SUB&fl=id,name,type&wt=json&indent=on’ -d ‘Samsung mobile 5g’

Sample response:

“response”:{“numFound”:1,”start”:0,”docs”:[
     {
     “id”:”123″,
     “name”:[“Samsung”],
     “type”:”brand”
    }]
}

Every search query will pass through the text tagger in the query pipeline and check for tags. If tags match in the collection, then either filter by them or boost them based on the guidance from the business team. This change will add additional call to Solr during the search flow, but now worries as impact on the performance will be negligible.

This will make the customers’ search experience better with the reduce noise in results, provide more relevant outcomes and eventually the conversion rate. With Solr text tagger, it will mainly a lookup, and any new taxonomies whenever require to increase the taxonomy list and hence improvising the search results.

About Nextbrick

AI

Agents

Search

Content Management

Data Engineering

Emerging Technologies

Software Development

ERP

Our Product