If Solr doesn’t return results to user for some queries then system should be capable enough to provide suggestions to modify the query in case of spelling was not correct. Suggestions can be generated based on terms available in any of the three options; field in same Solr collection where query is hitting, externally created text files, or fields in other Solr collection.
To configure SpellCheck Component in Solr, it is required to define the source of terms in solrconfig.xml. There are three approaches available in Solr for configuring the spell check.
- IndexBasedSpellChecker
- DirectSolrSpellChecker
- FileBasedSpellChecker
IndexBasedSpellChecker
This uses a Solr index as the basis for generating a parallel index used for spell checking. It requires defining a field as the basis for the index terms; a common practice is to copy terms from a combination of few fields (such as title, body, etc.) to another field created for spell checking. Here is a simple example of configuring solrconfig.xml with the IndexBasedSpellChecker:
<searchComponent name=”spellcheck” class=”solr.SpellCheckComponent”>
<lst name=”spellchecker”>
<str name=”classname”>solr.IndexBasedSpellChecker</str>
<str name=”spellcheckIndexDir”>./spellchecker</str>
<str name=”field”>content</str>
<str name=”buildOnCommit”>true</str>
<!– optional elements with defaults
<str name=”distanceMeasure”>org.apache.lucene.search.spell.LevenshteinDistance</str>
<str name=”accuracy”>0.5</str>
–>
</lst>
</searchComponent>
The first element defines the searchComponent to use the solr.SpellCheckComponent. The classname is the specific implementation of the SpellCheckComponent, in this case solr.IndexBasedSpellChecker. Defining the classname is optional; if not defined, it will default to IndexBasedSpellChecker.
The spellcheckIndexDir defines the location of the directory that holds the spellcheck index, while the field defines the source field (defined in the Schema) for spell check terms. When choosing a field for the spellcheck index, it’s best to avoid a heavily processed field to get more accurate results. If the field has many word variations from processing synonyms and/or stemming, the dictionary will be created with those variations in addition to more valid spelling data.
Finally, buildOnCommit defines whether to build the spell check index at every commit (that is, every time new documents are added to the index). It is optional, and can be omitted if you would rather set it to false.
DirectSolrSpellChecker
This uses terms from the Solr index without building a parallel index like the IndexBasedSpellChecker. This spell checker has the benefit of not having to be built regularly, meaning that the terms are always up-to-date with terms in the index. Here is how this might be configured in solrconfig.xml.
<searchComponent name=”spellcheck” class=”solr.SpellCheckComponent”>
<lst name=”spellchecker”>
<str name=”name”>default</str>
<str name=”field”>name</str>
<str name=”classname”>solr.DirectSolrSpellChecker</str>
<str name=”distanceMeasure”>internal</str>
<float name=”accuracy”>0.5</float>
<int name=”maxEdits”>2</int>
<int name=”minPrefix”>1</int>
<int name=”maxInspections”>5</int>
<int name=”minQueryLength”>4</int>
<int name=”maxQueryLength”>40</int>
<float name=”maxQueryFrequency”>0.01</float>
<float name=”thresholdTokenFrequency”>.01</float>
</lst>
</searchComponent>
Consider choosing field to query for spellcheck which have less analysis to be performed and combine few fields (such as title, body, etc.) into one dedicated field for spellcheck. Parameters available in configuration mostly defines the query condition to be applied while querying for spellcheck suggestions.
The distanceMeasure defines the metric to use during the spell check query. The value “internal” uses the default Levenshtein metric. The accuracy setting defines the threshold for a valid suggestion, while maxEdits defines the number of changes to the term to allow and value can only be 1 or 2. The minPrefix defines the minimum number of characters the terms should share. Setting this to 1 means that the spelling suggestions will all start with the same letter.
The maxInspections parameter defines the maximum number of possible matches to review before returning results; the default is 5. The minQueryLength defines how many characters must be in the query before suggestions are provided; the default is 4. The maxQueryLength enables the spell checker to skip over very long query terms, which can avoid expensive operations or exceptions. There is no limit to term length by default.
At first, spellchecker analyses incoming query words by looking up them in the index. Only query words which are absent from the index, or too rare (equal to or below maxQueryFrequency) are considered as misspelled and used for finding suggestions. Words which are more frequent than maxQueryFrequency bypass spellchecker unchanged. After suggestions for every misspelled word are found they are filtered for enough frequency with thresholdTokenFrequency as boundary value. These parameters (maxQueryFrequency and thresholdTokenFrequency) can be a percentage represented as a decimal value below 1 (such as 0.01 for or 1%) or an absolute value (such as 4).
FileBasedSpellChecker
The FileBasedSpellChecker uses an external text file as a spelling dictionary. This can be useful if using Solr as a spelling server, or if spelling suggestions don’t need to be based on actual terms in the index. In solrconfig.xml, you would define the searchComponent as below:
<searchComponent name=”spellcheck” class=”solr.SpellCheckComponent”>
<lst name=”spellchecker”>
<str name=”classname”>solr.FileBasedSpellChecker</str>
<str name=”name”>file</str>
<str name=”sourceLocation”>spellings.txt</str>
<str name=”characterEncoding”>UTF-8</str>
<str name=”spellcheckIndexDir”>./spellcheckerFile</str>
<!– optional elements with defaults
<str name=”distanceMeasure”>org.apache.lucene.search.spell.LevenshteinDistance</str>
<str name=”accuracy”>0.5</str>
–>
</lst>
</searchComponent>
The differences from IndexBasedSpellChecker, here are the use of the sourceLocation to define the location of the file of terms and the use of characterEncoding to define the encoding of the terms file.
Add SpellCheck to a Request Handler
Search requests will go to a request handler and to generate a suggestion add the following to the requestHandler that is in use.
<str name=”spellcheck”>true</str>
One of the parameters is the spellcheck.dictionary to use, and multiples can be defined. With multiple dictionaries, all specified dictionaries are consulted and results are interleaved. Collations are created with combinations from the different spellcheckers, with care taken that multiple overlapping corrections do not occur in the same collation.
Here is an example with multiple dictionaries:
<requestHandler name=”spellCheckWithWordbreak” class=”org.apache.solr.handler.component.SearchHandler”>
<lst name=”defaults”>
<str name=”spellcheck.dictionary”>default</str>
<str name=”spellcheck.dictionary”>file</str>
<str name=”spellcheck.count”>20</str>
</lst>
<arr name=”last-components”>
<str>spellcheck</str>
</arr>
</requestHandler>
Spell Check Parameters
The SpellCheck component accepts the parameters described below.
- Spellcheck
The value of this parameter should be true to get SpellCheck suggestions for the request. - spellcheck.q or q
This parameter specifies the query to spellcheck. If spellcheck.q is defined, then it is used; otherwise the original input query is used.
- spellcheck.build
If set to true, this parameter creates the dictionary to be used for spell-checking. - spellcheck.reload
If set to true, this parameter reloads the spellchecker. - spellcheck.count
This parameter specifies the maximum number of suggestions that the spellchecker should return for a term. - spellcheck.queryAnalyzerFieldType
A field type from Solr’s schema. The analyzer configured for the provided field type is used by the QueryConverter to tokenize the value for q parameter. - spellcheck.onlyMorePopular
If true, Solr will return suggestions that result in more hits for the query than the existing query. - spellcheck.maxResultsForSuggest
If, for example, this is set to 5 and the user’s query returns 5 or fewer results, the spellchecker will report “correctlySpelled=false” and also offer suggestions. - spellcheck.alternativeTermCount
Defines the number of suggestions to return for each query term existing in the index and/or dictionary. - spellcheck.extendedResults
If true, this parameter causes to Solr to return additional information about spellcheck results, such as the frequency of each original term in the index (origFreq) as well as the frequency of each suggestion in the index (frequency). - spellcheck.collate
If true, this parameter directs Solr to take the best suggestion for each token (if one exists) and construct a new query from the suggestions. - spellcheck.maxCollations
The maximum number of collations to return. This parameter is ignored if spellcheck.collate is false. - spellcheck.maxCollationEvaluations
This parameter specifies the maximum number of word correction combinations to rank and evaluate prior to deciding which collation candidates to test against the index. - spellcheck.collateExtendedResults
If true, this parameter returns an expanded response format detailing the collations Solr found. This is ignored if spellcheck.collate is false. - spellcheck.collateMaxCollectDocs
This parameter specifies the maximum number of documents that should be collected when testing potential collations against the index. A value of 0 indicates that all documents should be collected, otherwise an estimation is provided as a performance optimization. When spellcheck.collateExtendedResults is false, the optimization is always used as if 1 had been specified. - spellcheck.collateParam.*Prefix
This parameter prefix can be used to specify any additional parameters that you wish to the Spellchecker to use when internally validating collation queries. For example, even if your regular search results allow for loose matching of one or more query terms via parameters like q.op=OR and mm=20% you can specify override parameters such as spellcheck.collateParam.q.op=AND&spellcheck.collateParam.mm=100% to require that only collations consisting of words that are all found in at least one document may be returned. - spellcheck.dictionary
This parameter causes Solr to use the dictionary named in the parameter’s argument. This parameter can be used to invoke a specific spellchecker on a per request basis. - spellcheck.accuracy
Specifies an accuracy value to be used by the spell checking implementation to decide whether a result is worthwhile or not. The value is a float between 0 and 1. - spellcheck.<DICT_NAME>.key
Specifies a key/value pair for the implementation handling a given dictionary. For example, given a dictionary called foo, spellcheck.foo.myKey=myValue would result in myKey=myValue being passed through to the implementation handling the dictionary foo.