Thursday, March 21, 2013

Compiling a custom dictionary for Kuromoji and Solr

The user dictionary functionality of Solr's Kuromoji tokenizer is extremely useful and easy to use, but it isn't always the right tool for the job. In my case, I'm migrating our system off of the MeCab tokenizer. MeCab also allows you customize the tokenization, but the two models are completely different. In Kuromoji's user dictionary, you take an untokenized phrase and provide the custom tokenization. In MeCab, you just supply a list of words to augment its dictionary. The only way to migrate MeCab's custom dictionary to Kuromoji's user dictionary is by hand.

Fortunately, Kuromoji uses the same base data set as MeCab to build its statistical model for tokenization, and all the data in the base data set is in the same format as the custom MeCab dictionary. Getting Kuromoji to use the custom MeCab dictionary just requires recompiling Kuromoji's dictionary. Figuring out how to do this was surprisingly painless.

In order to compile your new dictionary, you will need...
  1. A copy of the MeCab-IPADIC data
  2. A copy of the Solr source code
  3. A Solr distribution of the same version as your source code
  4. A servlet container in which to run Solr

Download MeCab-IPADIC

A tarball of the MeCab dictionary can be downloaded from SourceForge at the following link.

Unpack the tarball to a directory of your choice. From now on I will be referring to this directory as the "dictionary source" directory, or as $DICTSRC in code examples. If you look inside the dictionary source directory, you will see several CSV files. These files use the EUC-JP character encoding scheme. Any custom dictionary will need to be in the same format.

If you open up one of the CSV files, you will see something like this.


The data in each field is roughly as follows.

Field 1 - A word to be used for tokenization
Field 2 - Left cost
Field 3 - Right cost
Field 4 - Word cost
Fields 5-10 - Part of speech
Field 11 - Base form
Field 12 - Reading
Field 13 - Pronunciation

Fields 2, 3, and 4 have to do with the statistical model for tokenization. For the purposes of constructing your custom dictionary, treat fields 2 and 3 as magic numbers mapping to part of speech. Column 4 is the "cost" of the word itself. The lower the cost of the word, the more likely it is to be used in a tokenization. Fields 5-10 should be copied from the appropriate MeCab CSV files. I don't know enough about Japanese to know the differences between fields 11, 12, and 13.

Once you have your custom dictionary ready, drop it into the dictionary source directory along with the rest of the MeCab CSV files.

Set up Solr

Set up your servlet container and deploy the Solr WAR file. Make sure that your servlet container expands the war file so that you can access its contents. The expanded Solr webapp directory will be referred to as $WEBAPP. If the directory $WEBAPP/WEB-INF/classes does not exist, create it.

Open a terminal and find the Solr source code you downloaded. This directory will be referred to as $SOLRSRC. Run the following commands.

> cd $SOLRSRC/lucene/analysis/kuromoji
> ant compile-tools

Compile your dictionary

At this point, we should have everything necessary to compile the custom dictionary. Run the following commands.

> cd $SOLRSRC/lucene/build/analysis/kuromoji/classes/tools/
> java -cp ".:$WEBAPP/WEB-INF/lib/*" \
org.apache.lucene.analysis.ja.util.DictionaryBuilder \
ipadic $DICTSRC $WEBAPP/WEB-INF/classes euc-jp false

Once this completes, you can inspect the files that it created in $WEBAPP/WEB-INF/classes. There will be a deep hierarchy of directories, and then nine binary files that make up your dictionary. One of the JAR files in the lib directory contains a set of files with the same names as these, but the Java servlet spec says that the servlet container should first look in the classes directory, then look in the lib directory. Having your dictionary in the classes directory will override the dictionary packaged with your Solr distribution.

You should now have a Solr instance with a custom Japanese dictionary for tokenization. Start up your servlet container and test it out.

No comments:

Post a Comment