Tuesday, March 19, 2013

Custom Japanese tokenization in Solr 4.0

Solr 4.0 (really, it's been there since 3.6) has a new analysis module for handling Japanese, called Kuromoji. Kuromoji was developed by Atilika, Inc., who donated it to Solr. I don't speak Japanese myself, but I've been doing some preliminary tests with a Japanese coworker, and it seems to work fairly well.

However, it's not perfect. It will miss an occasional phrase, and is especially problematic with domain specific phrases. To get around this, the Japanese tokenizer accepts a user dictionary, where you can list custom tokenizations. Unfortunately, I couldn't find any documentation on the format for the user dictionary. Fortunately, there is a sample user dictionary in the unit tests for the Japanese tokenizer.

# Custom segmentation for long entries
日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,テスト名詞

# Custom reading for sumo wrestler

# Silly entry:
abcd,a b cd,foo1 foo2 foo3,bar
abcdefg,ab cd efg,foo1 foo2 foo4,bar

It's a fairly straight forward CSV format. A hash character (#) starts a comment that continues to the end of a line. Empty lines are ignored. Each non-empty line has four fields separated by commas. After some testing, I was able to figure out what each field in the CSV was.

  1. Untokenized phrase
  2. Tokenized phrase
  3. Reading, or pronunciation
  4. Part of speech

There are a couple particulars you need to be aware of when putting together your user dictionary.

Every field is required - If you do not have all four fields your core will not load properly.

Tokenized phrase and reading - These fields are lists of words delimited by spaces. It is important that both the tokenized phrase and the reading have the same number of words. If you don't have this, your core will not load properly.

Spaces around commas - The CSV parser is very picky about format. You should never have any spaces surrounding the commas separating fields. Your core may or may not load, but can get other strange errors during tokenization.

I haven't tested this extensively, but I don't believe there is any way to escape a comma or a hash character. The CSV parser will accept fields surrounded by quote marks, but putting a comma or hash inside a quoted string does not seem to change how it is interpreted. Fortunately, this seems like it would be a very rare use case.

No comments:

Post a Comment