Thursday, March 21, 2013

Compiling a custom dictionary for Kuromoji and Solr

The user dictionary functionality of Solr's Kuromoji tokenizer is extremely useful and easy to use, but it isn't always the right tool for the job. In my case, I'm migrating our system off of the MeCab tokenizer. MeCab also allows you customize the tokenization, but the two models are completely different. In Kuromoji's user dictionary, you take an untokenized phrase and provide the custom tokenization. In MeCab, you just supply a list of words to augment its dictionary. The only way to migrate MeCab's custom dictionary to Kuromoji's user dictionary is by hand.

Fortunately, Kuromoji uses the same base data set as MeCab to build its statistical model for tokenization, and all the data in the base data set is in the same format as the custom MeCab dictionary. Getting Kuromoji to use the custom MeCab dictionary just requires recompiling Kuromoji's dictionary. Figuring out how to do this was surprisingly painless.

In order to compile your new dictionary, you will need...
  1. A copy of the MeCab-IPADIC data
  2. A copy of the Solr source code
  3. A Solr distribution of the same version as your source code
  4. A servlet container in which to run Solr

Download MeCab-IPADIC

A tarball of the MeCab dictionary can be downloaded from SourceForge at the following link.

Unpack the tarball to a directory of your choice. From now on I will be referring to this directory as the "dictionary source" directory, or as $DICTSRC in code examples. If you look inside the dictionary source directory, you will see several CSV files. These files use the EUC-JP character encoding scheme. Any custom dictionary will need to be in the same format.

If you open up one of the CSV files, you will see something like this.


The data in each field is roughly as follows.

Field 1 - A word to be used for tokenization
Field 2 - Left cost
Field 3 - Right cost
Field 4 - Word cost
Fields 5-10 - Part of speech
Field 11 - Base form
Field 12 - Reading
Field 13 - Pronunciation

Fields 2, 3, and 4 have to do with the statistical model for tokenization. For the purposes of constructing your custom dictionary, treat fields 2 and 3 as magic numbers mapping to part of speech. Column 4 is the "cost" of the word itself. The lower the cost of the word, the more likely it is to be used in a tokenization. Fields 5-10 should be copied from the appropriate MeCab CSV files. I don't know enough about Japanese to know the differences between fields 11, 12, and 13.

Once you have your custom dictionary ready, drop it into the dictionary source directory along with the rest of the MeCab CSV files.

Set up Solr

Set up your servlet container and deploy the Solr WAR file. Make sure that your servlet container expands the war file so that you can access its contents. The expanded Solr webapp directory will be referred to as $WEBAPP. If the directory $WEBAPP/WEB-INF/classes does not exist, create it.

Open a terminal and find the Solr source code you downloaded. This directory will be referred to as $SOLRSRC. Run the following commands.

> cd $SOLRSRC/lucene/analysis/kuromoji
> ant compile-tools

Compile your dictionary

At this point, we should have everything necessary to compile the custom dictionary. Run the following commands.

> cd $SOLRSRC/lucene/build/analysis/kuromoji/classes/tools/
> java -cp ".:$WEBAPP/WEB-INF/lib/*" \
org.apache.lucene.analysis.ja.util.DictionaryBuilder \
ipadic $DICTSRC $WEBAPP/WEB-INF/classes euc-jp false

Once this completes, you can inspect the files that it created in $WEBAPP/WEB-INF/classes. There will be a deep hierarchy of directories, and then nine binary files that make up your dictionary. One of the JAR files in the lib directory contains a set of files with the same names as these, but the Java servlet spec says that the servlet container should first look in the classes directory, then look in the lib directory. Having your dictionary in the classes directory will override the dictionary packaged with your Solr distribution.

You should now have a Solr instance with a custom Japanese dictionary for tokenization. Start up your servlet container and test it out.

Tuesday, March 19, 2013

Custom Japanese tokenization in Solr 4.0

Solr 4.0 (really, it's been there since 3.6) has a new analysis module for handling Japanese, called Kuromoji. Kuromoji was developed by Atilika, Inc., who donated it to Solr. I don't speak Japanese myself, but I've been doing some preliminary tests with a Japanese coworker, and it seems to work fairly well.

However, it's not perfect. It will miss an occasional phrase, and is especially problematic with domain specific phrases. To get around this, the Japanese tokenizer accepts a user dictionary, where you can list custom tokenizations. Unfortunately, I couldn't find any documentation on the format for the user dictionary. Fortunately, there is a sample user dictionary in the unit tests for the Japanese tokenizer.

# Custom segmentation for long entries
日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,テスト名詞

# Custom reading for sumo wrestler

# Silly entry:
abcd,a b cd,foo1 foo2 foo3,bar
abcdefg,ab cd efg,foo1 foo2 foo4,bar

It's a fairly straight forward CSV format. A hash character (#) starts a comment that continues to the end of a line. Empty lines are ignored. Each non-empty line has four fields separated by commas. After some testing, I was able to figure out what each field in the CSV was.

  1. Untokenized phrase
  2. Tokenized phrase
  3. Reading, or pronunciation
  4. Part of speech

There are a couple particulars you need to be aware of when putting together your user dictionary.

Every field is required - If you do not have all four fields your core will not load properly.

Tokenized phrase and reading - These fields are lists of words delimited by spaces. It is important that both the tokenized phrase and the reading have the same number of words. If you don't have this, your core will not load properly.

Spaces around commas - The CSV parser is very picky about format. You should never have any spaces surrounding the commas separating fields. Your core may or may not load, but can get other strange errors during tokenization.

I haven't tested this extensively, but I don't believe there is any way to escape a comma or a hash character. The CSV parser will accept fields surrounded by quote marks, but putting a comma or hash inside a quoted string does not seem to change how it is interpreted. Fortunately, this seems like it would be a very rare use case.