Skip to content
Home » NextBrick SOLR FAQs

NextBrick SOLR FAQs

  • by

Q1. Is Solr Stable? Is it “Production Quality?”

Solr is currently being used to power search applications on several high traffic publicly accessible websites.


Q2. Is Solr vulnerable to any security exploits?

Every effort is made to ensure that Solr is not vulnerable to any known exploits. For specific information, see SolrVulnerabilities. Because Solr does not actually have any built-in security features, it should not be installed in a location that can be directly reached by anyone that cannot be trusted.


Q3. Is Solr Schema-less?

Yes. Solr does have a schema to define types, but it’s a “free” schema in that you don’t have to define all of your fields ahead of time. Using %3CdynamicField /%3E declarations, you can configure field types based on field naming convention, and each document you index can have a different set of fields. Also, Solr supports a schemaless mode in which previously unseen fields’ types are detected based on field values, and the resulting typed fields are automatically added to the schema.


Q4. Can I reuse the same ZooKeeper cluster with other applications, or multiple SolrCloud clusters?

Yes, You can use SolrCloud with any existing ZooKeeper installation running other apps, or a ZooKeeper installation powering an existing SolrCloud cluster(s)

For simplicity, Solr creates nodes at the ZooKeeper root, but a distinct chroot option can be specified for each SolrCloud cluster to isolate them.


Q5. Why doesn’t SolrCloud automatically create replicas when I add nodes?

There’s no way that Solr can guess what the user’s intentions are when adding a new node to a SolrCloud cluster.

If Solr automatically creates replicas on a cloud with billions of documents, it might take hours for that replication to complete. Users with very large indexes would be VERY irritated if this were to happen automatically.

The new nodes might be intended for an entirely new collection, not new replicas on existing collections. Users who have this intention would also be unhappy if Solr decided to add new replicas.

Even when the intent is to add new replicas, Solr has no way of knowing which collections should be replicated. On a very large cloud with hundreds of collections, choosing to add a replica to all of them could take a very long time and use up all the disk space on the new node.

Additionally, creating replicas uses a lot of disk and network I/O bandwidth. If a node is added during normal hours and replication starts automatically, it might drastically affect query performance.


Q6. How can indexing be accelerated?

A few ideas:

  • Include multiple documents in a single operations. Note: don’t put a huge number of documents in each add operation. With very large documents, you may only want to index them ten or twenty at a time. For small documents, between 100 and 1000 is more reasonable.
  • Ensure you are not performing until you need to see the updated index. ● If you are reindexing every document in your index, completely removing the index first can substantially speed up the required time and disk space.
  • Solr can do some, but not all, parts of indexing in parallel. Indexing with multiple client threads can be a boon, particularly if you have multiple CPUs in your Solr server and your analysis requirements are considerable.
  • Experiment with different mergeFactor and maxBufferedDocs settings (see http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html).

Q7 .How can I speed up facet counts?

Performance problems can arise when faceting on fields/queries with many unique values. If you are faceting on a tokenized field, consider making it untokenized (field class solr.StrField, or using solr.KeywordTokenizerFactory).

Also, keep in mind that Solr must construct a filter for every unique value on which you request faceting. This only has to be done once, and the results are stored in the filterCache. If you are experiencing slow faceting, check the cache statistics for the filterCache in the Solr admin. If there is a large number of cache misses and evictions, try increasing the capacity of the filterCache.


Q8. Why don’t International Characters Work?

Solr can index any characters expressed in the UTF-8 charset (see SOLR-96). There are no known bugs with Solr’s character handling, but there have been some reported issues with the way different application servers (and different versions of the same application server) treat incoming and outgoing multibyte characters. In particular, people have reported better success with Tomcat than with Jetty…

  • “International Charsets in embedded XML” (Jetty 5.1)
  • “Problem with surrogate characters in utf-8“ (Jetty 6)

If you notice a problem with multibyte characters, the first step to ensure that it is not a true Solr bug would be to write a unit test that bypasses the application server directly using the AbstractSolrTestCase.

The most important points are:

  • The document has to be indexed as UTF-8 encoded on the solr server. For example, if you send an ISO encoded document, then the special ISO characters get a byte added (screwing up the final encoding, only reindexing with UTF-8 can fix this).
  • The client needs UTF-8 URL encoding when forwarding the search request to the solr server.
  • The server needs to support UTF-8 query strings. See e.g. Solr with Apache Tomcat. If you just forward doing:
    String value = request.getParameter(“q”);

to get the query string, it can be that q got encoded in ISO and then solr will not return a search result.

One possible solution is:

String encoding = request.getCharacterEncoding();
if (null == encoding) {
// Set your default encoding here
request.setCharacterEncoding(“UTF-8”);
} else {
request.setCharacterEncoding(encoding);
}
…
String value = request.getParameter(“q”);

Another possibility is to use java.net.URLDecoder/URLEncoder to transform all parameter value to UTF-8.


Q9. What does “exceeded limit of maxWarmingSearchers=X” mean?

Whenever a commit happens in Solr, a new “searcher” (with new caches) is opened, “warmed” up according to your SolrConfigXml settings, and then put in place. The previous searcher is not closed until the “warming” search is ready. If multiple commits happen in rapid succession, then a searcher may still be warming up when another commit comes in. This will cause multiple searchers to be started and warming at the same time, all competing for resources. Only one searcher is ever actively handling queries at a time – all old searchers are thrown away when the latest one finishes warming.

maxWarmingSearchers is a setting in SolrConfigXml that helps you put a safety valve on the number of overlapping warming searchers that can exist at one time. If you see this error it means Solr prevented a new searcher from being opened because there were already X searchers warming up.

The best way to prevent this log message is to reduce the frequency of commit operations. Enough time should pass between commits for each commit to completely finish.

If you encounter this error a lot, you can increase the number for maxWarmingSearchers, but this is usually the wrong thing to do. It requires significant system resources (RAM, CPU, etc…) to do it safely, and it can drastically affect Solr’s performance.

If you only encounter this error infrequently because of an unusually high load on the system, you’ll probably be OK just ignoring it.


Q10. Why doesn’t my index directory get smaller (immediately) when i delete documents? force a merge? optimize?

Because of the “inverted index” data structure, deleting documents only annotates them as deleted for the purpose of searching. The space used by those documents will be reclaimed when the segments they are in are merged.

When segments are merged (either because of the Merge Policy as documents are added, or explicitly because of a forced merge or optimize command) then Solr attempts to delete old segment files, but on some filesystems Notably in Microsoft Windows) it is not possible to delete a file while the file is open for reading (Which is usually true since Solr is still serving requests against the old segments until the new Searcher is ready and has it’s caches warmed). When this happens, the older segment files are left on disk, and Solr will re-attempt to delete them later the next time a merge happens.


Q11. Can I use Lucene to access the index generated by SOLR?

Yes, although this is not recommended. Writing to the index is particularly tricky. However, if you do go down this route, there are a couple of things to keep in mind. Be careful that the analysis chain you use in Lucene matches the one used to index the data or you’ll get surprising results. Also, be aware that if you open a searcher, you won’t see changes that Solr makes to the index unless you reopen the underlying readers.


Q12. Is there a limit on the number of keywords for a Solr query?

No. If you make a GET query, through Solr Web interface for example, you are limited to the maximum URL lenght of the browser.


Q13. How can I efficently search for all documents that contain a value in fieldX ?

If the number of unique terms in fieldX is bounded and relatively small (ie: a “category” or “state” field) or if fieldX is a “Trie” Numeric field with a small precision step then you will probably find it fast enough to do a simple range query on the field – ie: fieldX:[* TO *]. When possible, doing these in cached filter queries (ie: “fq”) will also improve performance.

A more efficient method is to also ensure that your index has an additional field which records wether or not each document has a value – ie: has_fieldX as a boolean field that can be queried with has_fieldX:true, or num_values_fieldX that can be queried with num_values_fieldX:[1 TO *]. This technique requires you to know in advance that you will want to query on this type of information, so that you can add this extra field to your index, but it can be significantly faster.

Adding a field like num_values_fieldX is extremely easy to do automaticly in Solr4.0 by modifying your to include the

CountFieldValuesUpdateProcessorFactory


Q14. Should I use the standard or dismax Query Parser

The standard Query Parser uses SolrQuerySyntax to specify the query via the q parameter, and it must be well formed or an error will be returned. It’s good for specifying exact, arbitrarily complex queries.

The DisMax Query Parser has a more forgiving query parser for the q parameter, useful for directly passing in a user-supplied query string. The other parameters make it easy to search across multiple fields using disjunctions and sloppy phrase queries to return highly relevant results.

For servicing user-entered queries, start by using dismax.


Q15. How can I make “superman” in the title field score higher than in the subject field

For the standard request handler, “boost” the clause on the title field:

q=title:superman^2 subject:superman

Using the dismax request handler, one can specify boosts on fields in parameters such as qf: q=superman&qf=title^2 subject


Q16. Why are search results returned in the order they are? If no other sort order is specified, the default is by relevancy score. How can I see the relevancy scores for search results

Request that the pseudo-field named “score” be returned by adding it to the fl (field list) parameter. The “score” will then appear along with the stored fields in returned documents. q=Justice League&fl=*,score


Q17. Why doesn’t my query of “flash” match a field containing “Flash” (with a capital “F”)

The fieldType for the field containing “Flash” must have an analyzer that lowercases terms. This will cause all searches on that field to be case insensitive.

See AnalyzersTokenizersTokenFilters for more.


Q18. How can I make exact-case matches score higher

Example: a query of “Penguin” should score documents containing “Penguin” higher than docs containing “penguin”.

The general strategy is to index the content twice, using different fields with different fieldTypes (and different analyzers associated with those fieldTypes). One analyzer will contain a lowercase filter for case-insensitive matches, and one will preserve case for exact-case matches.

Use copyField commands in the schema to index a single input field multiple times.

Once the content is indexed into multiple fields that are analyzed differently, query across both fields.


Q19. How can I increase the score for specific documents Query Elevation Component

To raise certain documents to the top of the result list based on a certain query, one can use the QueryElevationComponent.

index-time boosts

To increase the scores for certain documents that match a query, regardless of what that query may be, one can use index-time boosts.

Index-time boosts can be specified per-field also, so only queries matching on that specific field will get the extra boost. An Index-time boost on a value of a multiValued field applies to all values for that field.

Under the covers, solr uses this numeric boost value as a factor that contributes to the “norm” for the field (along with teh length of the field in terms) so it is only valid if ‘omitNorms=”false”‘ for the fields you use it on. It will also be encoded according to the rules of the Similarity class used, which means it may loose precision. (ie: in the DefaultSimilarity, the numeric norm value is encoded as a single byte using a 3-bit mantissa, so differences in boost values of less than 25% may end up being rounded out)

Index-time boosts may also be assigned with the optional attribute “boost” in the section of the XML updating messages – in which case it is equivilent to specifying the boost param on all of the individual fiels. See UpdateXmlMessages for more information.

Using Field and/or Document boosts has been supported since the very early days of Lucene, but is some what limiting and antiquated at this point, and instead people should strongly consider indexing their boost values as numeric fields instead (see next section)

Field Based Boosting

You can structure your queries to include “boosts” based on specific attributes of documents. This might be a simple matter of adding an optional query clause that “boosts” documents for matching an “important:true” query, or by using a function on the value of a numeric field (see the next section)

Leave a Reply

Your email address will not be published. Required fields are marked *

For Search, Content Management & Data Engineering Services

Get in touch with us