Dense Vector Search is a feature of Apache Solr that enables indexing and searching of dense numerical vectors. It is used to produce a vector representation of both the query and the documents in a corpus of information. These neural network-based techniques are usually referred to as neural search, an industry derivation from the academic field of Neural information Retrieval.
A dense vector representation distills approximate semantic meaning into a fixed (and limited) number of dimensions. The number of dimensions in this approach is generally much lower than the sparse case, and the vector for any given document is dense, as most of its dimensions are populated by non-zero values. The task of generating vectors must be handled in application logic external to Apache Solr. There may be cases where it makes sense to directly search data that natively exists as a vector (e.g., scientific data); but in a text search context, it is likely that users will leverage deep learning models such as BERT to encode textual information as dense vectors, supplying the resulting vectors to Apache Solr explicitly at index and query time.
The Dense Vector Field is the Apache Solr field type designed to support dense vector search. It gives the possibility of indexing and searching dense vectors of float elements.
Here are the steps to use Dense Vector Search in Apache Solr:
- Configure the dense vector field in the schema as below:
<fieldType name=”knn_vector” class=”solr.DenseVectorField” vectorDimension=”4″ similarityFunction=”cosine”/>
<field name=”vector” type=”knn_vector” indexed=”true” stored=”true”/>
The vectorDimension attribute specifies the number of dimensions in the vector.
- Generate embeddings and ingest in Solr:
Train a model outside Solr. To generate dense vectors for your documents you can use deep learning models such as BERT.
Create vector embeddings from documents’ fields with a custom script using the model and ingest the vectors in Solr using the DenseVectorField field type.
- Search using dense vectors:
You can then search for documents using dense vectors by passing the query vector to Solr.
Solr will return the documents that are most similar to the query vector.
The similarity between vectors is calculated using the “Dot Product”, “Euclidean distance” or “Cosine” similarity based on what is configured in schema.
Common Use Cases
Dense vectors in Solr are often used in scenarios involving similarity searches, clustering, machine learning applications and anomaly detection.
Similarity Searches: You can use dense vectors to find documents with similar content or characteristics. This is useful in recommendation systems, content similarity analysis, or any application where similarity between documents is relevant.
Clustering: Dense vectors enable clustering documents based on their content or features. This helps in organizing large datasets into meaningful groups, facilitating analysis and understanding of patterns within the data.
Machine Learning Integration: Dense vectors can be used as features for machine learning models within Solr. This integration allows you to perform machine learning tasks like classification or regression directly on the indexed data.
Anomaly Detection: By leveraging dense vectors, you can identify outliers or anomalies in your dataset. This is valuable in various domains, such as fraud detection, where abnormal patterns can be indicative of suspicious activities.
When using dense vectors in Solr, it’s crucial to define the vector field properly in the schema, considering factors like dimensionality and the nature of your data. Additionally, selecting an appropriate similarity metric for your use case is vital for accurate results in similarity searches or clustering, etc.
Current Limitation:
The maximum cardinality of the vector is currently limited to 1024, for no particular reason other than to be performance-conscious; it may be increased in the future, but for now, if you want to use a larger vector size, you need to customize the Lucene build and then set it in Solr.