Thursday, January 3, 2013

Grokking Solr Trie Fields

I've been trying to wrap my head around Solr's trie field types for the past week, and finally made a break through.

Trie


The first thing you need to understand is the idea of a trie data type. Here's a basic outline of the data structure. Let's say you want to index the following list of words.

bad
bag
bar
bin
bit

We can arrange these words in a tree structure as follows.

b--a--d
|  |
|  |--g
|  |
|  \--r
|
\--i--n
   |
   \--t

To reconstruct one of the words you start at the root node of the tree, and work your way towards a leaf node, keeping track of the letters you encounter along the way. Depending on what you're using the trie data structure for, you may store some piece of data in one of the leaf nodes. If you want more information about the trie data structure, I will refer you to the Wikipedia page, which is fairly complete and easy to understand.

Solr's version of the trie


Another term for a trie is a "prefix tree". Solr uses this idea of prefixes to index numbers so that it can perform range queries efficiently. Just like we can organize words into tries, we can also organize numbers into tries.

Let's say I want to index the integer 3735928559. For clarity, let's rewrite that in hexadecimal, 0xDEADBEEF. When we index this using a TrieIntField, Solr stores the integer four times at different levels of precision.

0xDE000000
0xDEAD0000
0xDEADBE00
0xDEADBEEF

What Solr is doing here is constructing numbers with different length prefixes. This would be equivalent to a trie with this structure.

DE--AD--BE--EF

The reason that this allows for fast range queries is because of what the prefixes represent. The prefixes represent a range of values. It might be better to think of them indexed like this, instead.

0xDExxxxxx
0xDEADxxxx
0xDEADBExx
0xDEADBEEF

Each "x" represents an unset digit. That means the entry 0xDEADxxxx represents every number from 0xDEAD0000 to 0xDEADFFFF. You can get a better feel for this if you play around in the analysis section of the Solr admin console.

Precision Step


The option to set the precision step was the part that I understood the least. The available documentation is rather dense and unhelpful. The precision step lets you tune your index, trading range query speed for index size. A smaller precision step will result in a larger index and faster range queries. A larger precision step will result in a smaller index and slower range queries.

In the example above, I was using a precision step of 8, the default. What the precision step means is how many bits get pruned off the end of the number. Let's see what would happen if we indexed 0xDEADBEEF with a precision step of 12.

0xDExxxxxx
0xDEADBxxx
0xDEADBEEF

And here with a precision step of 4.

0xDxxxxxxx
0xDExxxxxx
0xDEAxxxxx
0xDEADxxxx
0xDEADBxxx
0xDEADBExx
0xDEADBEEx
0xDEADBEEF

As you can see, compared to the default precision step of 8, a precision step of 4 doubled the number of entries in the index. The way it speeds up range searches is by allowing better granularity. If I wanted to search for the documents matching the range 0xDEADBEE0 to 0xDEADBEEF with the default precision step, I would have to check all 16 records in the index and merge the results. With the precision step of 4, I can check the one record for 0xDEADBEEx and get the results I want.

That's a bit of a cherry picked example, but arbitrary range queries will be faster. How that works is left as an exercise for the reader.

May 5, 2015 - I originally thought that the trie data type was not supported directly by Lucene, but this is incorrect. Tokenization of numeric types into the various prefixes is the default behavior of Lucene when indexing numbers.