Monday, August 11, 2014

Hollywood doesn't know how to computer

I thought I would share a funny clip from "Law and Order: SVU" I found.

In this clip, computer tech Reuben Morales shows detectives Benson and Stabler how he discovered a hidden message on a flash drive belonging to a pedophile. He finds "computer code hidden in a pixel" of a picture of a rainbow. He "cracks" the computer code to find a "hidden file" with thousands of pornographic images. In other pixels, he finds PDFs listing the names of the people in the pictures.

What's so disappointing about this scene is that the writers were a hair's breadth away from real methods for hiding information. There is a form of steganography where you use the low order bit of each pixel to hide a message. In an uncompressed image using 24-bit color, each pixel is encoded with three bytes, one byte each for red, green and blue. The human eye is not sensitive enough to distinguish colors that are adjacent to each other in this color space. That means an image with a message encoded into the lowest bit of each byte should not look strange when viewing it in a normal image viewer.

If the writers had consulted an actual engineer, they could have made some very minor tweaks to make that scene actually make sense. First, the hidden message should have been spread across multiple pixels. Steganography works by breaking up the secret message into tiny parts and sprinkling it throughout whatever message you're hiding your payload in.

The next mistake they made was to say that the thousands of images were actually stored within the one rainbow picture. One large image could probably hold a fair amount of data, but you are not going to hide a thousand images inside one image (let alone one pixel). It would have made more sense to hide something like a cryptographic key. Cryptographic keys are on the order of a couple hundred bytes and could easily be hidden, even in very small images. The cryptographic key could then be used to decrypt a hidden volume containing all the images and PDFs.

There you have it. Two small changes that could have made the world of difference. The writers could have even thrown in the word steganography to make themselves look extra smart. Instead, we have this.

Friday, February 7, 2014

Fixing a bonehead mistake in Solr

I was poking around in one of our Solr cores at work when I got this output from a query.
{
  "responseHeader": {
    "status": 0,
    "QTime": 248,
    "params": {
      "indent": "true",
      "q": "*:*",
      "_": "1391802519673",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 36529751,
    "start": 0,
    "docs": [
      {
        "userId": "ERROR:SCHEMA-INDEX-MISMATCH,stringValue=3304997"
      },
      {
        "userId": "ERROR:SCHEMA-INDEX-MISMATCH,stringValue=3645477"
      },
      {
        "userId": "ERROR:SCHEMA-INDEX-MISMATCH,stringValue=3645478"
      },
      {
        "userId": "ERROR:SCHEMA-INDEX-MISMATCH,stringValue=3645479"
      },
      {
        "userId": "ERROR:SCHEMA-INDEX-MISMATCH,stringValue=3645480"
      },
      {
        "userId": "ERROR:SCHEMA-INDEX-MISMATCH,stringValue=3496486"
      },
      {
        "userId": "ERROR:SCHEMA-INDEX-MISMATCH,stringValue=3645481"
      },
      {
        "userId": "ERROR:SCHEMA-INDEX-MISMATCH,stringValue=3645482"
      },
      {
        "userId": "ERROR:SCHEMA-INDEX-MISMATCH,stringValue=3645484"
      },
      {
        "userId": "ERROR:SCHEMA-INDEX-MISMATCH,stringValue=3645485"
      }
    ]
  }
}
The reason for the error is that I reworked the schema of the core. I changed this field from a string to a long, and I forgot to delete all the existing records before reindexing.

Oops.

Okay, so how to fix it? I was able to determine that searching on the userId field with a numerical value would not return any of the corrupted records. A query like userId:[0 TO *] would select all valid records and exclude all corrupted records. I could invert that by doing *:* -userId:[0 TO *] to select all the corrupted records. A quick delete by query, and all the corrupted records disappeared.

Monday, November 18, 2013

Orange-ginger cranberry sauce

I think a lot of people don't realize how easy it is to make cranberry sauce from scratch. Here is my paraphrase of the cranberry sauce recipe that you will find on most packages of fresh cranberries.

  • 1 pound fresh cranberries
  • 1 cup sugar
  • 1 cup water
  1. Combine all ingredients in a sauce pan.
  2. Simmer until it looks like cranberry sauce.
This recipe is foolproof and will result in a solid, tasty cranberry sauce. However, it can be improved upon. I've been experimenting for the past few years, and I've finally hit on a really wonderful cranberry sauce recipe. It is a bit more involved, but I think it is well worth it.
  • 1 pound fresh cranberries
  • 3/4 cup sugar
  • 1 orange
  • 1/2 cup boiling water
  • 1 tablespoon grated fresh ginger
  1. Steep the grated ginger in the boiling water for 5 minutes. Strain the ginger and discard. Reserve the ginger water.
  2. Zest and juice the orange. Combine the orange juice with the ginger water. If the combined liquids are less than 1 cup in total, add water to bring it to 1 cup.
  3. Combine the orange zest, orange juice mixture, sugar and cranberries in a sauce pan.
  4. Simmer until it looks like cranberry sauce.
The addition of the orange and the ginger make a subtle, but tangible difference in the flavor of the sauce. The orange-ginger sauce has a brighter and tangier flavor. In a side by side comparison, most people should be able to tell that there is something different about the sauce, but I think it would take a discerning palate to be able to identify the specific flavors.

In my recipe, I've cut the sugar a bit, because I enjoy a tart cranberry sauce. This can be adjusted to taste. Steeping the ginger is a trick I learned from making ginger ale at home. I really wanted the fresh ginger flavor, but didn't want fibrous bits of ginger mixed into the sauce itself. The extra step of making a sort of ginger tea accomplishes this with minimal extra effort.

Saturday, April 13, 2013

Losing is Fun: A minimal tutorial for maximum Fun

Dwarf Fortress is an incredible game, but the menu system presents a significant hurdle for newbies. I recently got started and had to rely on the wealth of information available at the Dwarf Fortress Wiki, but in the end I think I regret using it too much.

The motto of Dwarf Fortress is "Losing is Fun." You cannot "win" this game. Your fortress will fall eventually. Seeing how your fortress falls is all part of the Fun of Dwarf Fortress. This way, you learn from your mistakes, so hopefully each fortress will be less disastrous more spectacular than the last. If you read too much about game mechanics before you start playing, you miss a lot of this Fun. This tutorial is meant to give you a minimal boost, to get you over the initial non-Fun of learning how to navigate the arcane set of menus, without giving away any Fun ruining spoilers.

Just to give you a taste of some of the problems with the menu system, scrolling through menu options is very inconsistent and haphazard. Sometimes you use arrow keys. Sometimes you use the number pad. Sometimes you use (+) and (-). Sometimes you use (u, h, k, m). You will have to pay attention to the on screen instructions for using each menu.

Worldgen


The first step in Dwarf Fortress is creating the world your dwarves are going to inhabit. Select the Create New World! option. The world generation menu is pretty straight forward. To begin with, use the default settings. World generation takes a few minutes to complete, so go look at cat pictures or something while your computer does its thing.

World generation is a one time thing. When your first fortress falls, don't generate a new world. Just set up a new fortress somewhere else in the same world. One of the coolest aspects of Dwarf Fortress is that the world is persistent, and incredibly detailed. That means that you can go back later to reclaim the ruins of an old fortress.

Embark!


Embarking just means choosing where to stake your claim. Select the Start Playing menu, and the Dwarf Fortress option. These menus are pretty straight forward, too. Use the commands across the bottom of your screen to select a location.

There are a couple keys to finding a good site. You want a site rich in raw materials, like wood, rock, metals and animals. Aquifers are bad. Flux is good. Once you've found a place that looks good you can Embark (e). Learning where to embark is Fun.

Once you have embarked, you will see a screen full of incomprehensible characters. You can get information about what each symbol means by using the Look (k) command. When you place the cursor over a square of the map, you will see information about what is occupying that square on the right hand side of your screen. Take some time to get a basic idea of what your surroundings look like.

If you ever get lost somewhere in the menu system, just keep hitting Esc repeatedly until you get back to the main screen. If you hit Esc too many times, it will just toggle between the main screen and the options menu.

To avoid ambiguity, all menu options will be given as a series of keystrokes as if you were navigating there from the main screen.

Strike the Earth!


One of the most important activities is mining. To start mining, open the Designation menu and select Mine (d, d). Use the arrow keys to mark sections of the map you would like to excavate. Parts of the map that you can mine are going to be solid black. If you don't see any solid black areas of the map, skip ahead to the section on Z-Levels, then come back here.

If you accidentally mark an area for excavation that you didn't mean to, you can use the Remove Designation (d, x) option to unmark these areas. When you unpause the game (space bar), your dwarves will get to work carrying out your wishes.

You should notice that while marking places for excavation, you never issued an order to a dwarf. In Dwarf Fortress, you only have indirect control over your dwarves. All you can do is issue a general order that this or that should be done. If a dwarf with that particular skill is available, then he or she will do the job. If no dwarf has the skill you need, then the job doesn't get done. This puts Dwarf Fortress about half way between a game like SimCity and The Sims. SimCity does not simulate the actions of individual people. The Sims does, but you have to micromanage every action that your Sims take. The game play of Dwarf Fortress presents some interesting challenges, because your dwarves will sometimes act in Fun and unpredictable ways.

Buildings


One of the more confusing things about Dwarf Fortress is what does and does not count as a "building". Things like tables and chairs count as "buildings", and are in the same menu as wells and bridges. Even more confusing is the error message you get when you try to build some of these "buildings". If you open up the Building menu and try to "build" a table (b, t), you will get the error message "Needs table". WTF does that mean?

For an object like a table, "building" the table really means placing a fully constructed table somewhere for use. Before you can place a table somewhere for use, you must "construct" the table in the first place.

Some of the most important buildings in the game are Workshops (b, w). Workshops are where your dwarves turn raw materials into useful objects. You will need workshops, but figuring out which workshops you need, and for what can lead to a lot of Fun.

Most buildings don't do anything on their own. In order for them to be useful, you have to designate tasks to be carried out by your dwarves at the buildings. You can interact with a building using the Set Building Tasks/Prefs command (q). In this mode, when you move your cursor near a building, you will get a menu of what you can do with that building. This will be very important with your workshops early in the game.

Z-Levels


The world of Dwarf Fortress is a three dimensional one. The map that the game presents to you is really just a horizontal cross section of this three dimensional world. One single cross section is called a Z-level. That's because typically a flat plane is represented with two coordinates, X and Y. Three dimensions are represented by X, Y and Z, where Z denotes the vertical dimension. I like to think of Z-levels as the pictures that an MRI or CT scan produces. Horizontal cross sections that you have to visualize stacked on top of each other.

To navigate down to lower Z-levels use the (>) key. To navigate to higher Z-levels use the (<) key. In order to dig between Z-levels, you will need to construct stairs or ramps. The various flavors of stairs and ramps can be found in the Designations menu along side the Mine command. These will need to be placed correctly in order for your dwarves to gain access to other Z-levels. You should experiment and have some Fun with this.

Miscellaneous Tips and Hints


Here are a couple more menus you should familiarize yourself with early on. Zones (i) and Stockpiles (p) are important. Play around with them to see what they do. Create a stockpile and see how your dwarves react to it. If it didn't seem to do anything, try creating a different type of stockpile. Do the same with zones.

The View Units (v) menu is also very important. In this mode, when you move your cursor near a creature, you will see some of the creatures stats and characteristics. Of particular importance is the Labor menu (v, p, l). It lets you designate which types of tasks a particular dwarf will perform.

My last advice is to not get too frustrated. Remember, "Losing is Fun." If your fortress falls, just start a new one. In my own process of learning the game, I try to focus on learning one thing at a time. For example, focus on how to acquire a particular raw material. Once you've acquired the raw material, see if you can figure out what that raw material is useful for. Nearly everything in this game has a use of some type.

On to the Fun


That should be enough information to at least get you started. I've deliberately left out a lot of key information that will hopefully result in a lot of Fun in your early fortresses. Once you've learned where to embark and familiarized yourself with the menus in this tutorial, you should be able to put together a fairly successful fortress. The key is to learn from your mistakes so you have a completely new and unexpected type of Fun in your next fortress.

To recap, these are the key menus you need to be familiar with to get your first fortress off (in?) the ground.

  • Look (k) - See what stuff is on a particular tile.
  • Designations (d) - Important commands like Mine can be found here
  • Building (b) - Used to erect buildings and place furniture around your fortress
  • Set Building Tasks/Prefs (q) - Interact with finished buildings
  • Zones (i) - Designate areas for certain uses
  • Stockpiles (p) - Designate areas for certain other uses
  • View (v) - View and set your dwarves' characteristics
And finally, access the third dimension by navigating Z-levels using (<) and (>).

Have Fun. :-)

Thursday, March 21, 2013

Compiling a custom dictionary for Kuromoji and Solr

The user dictionary functionality of Solr's Kuromoji tokenizer is extremely useful and easy to use, but it isn't always the right tool for the job. In my case, I'm migrating our system off of the MeCab tokenizer. MeCab also allows you customize the tokenization, but the two models are completely different. In Kuromoji's user dictionary, you take an untokenized phrase and provide the custom tokenization. In MeCab, you just supply a list of words to augment its dictionary. The only way to migrate MeCab's custom dictionary to Kuromoji's user dictionary is by hand.

Fortunately, Kuromoji uses the same base data set as MeCab to build its statistical model for tokenization, and all the data in the base data set is in the same format as the custom MeCab dictionary. Getting Kuromoji to use the custom MeCab dictionary just requires recompiling Kuromoji's dictionary. Figuring out how to do this was surprisingly painless.

In order to compile your new dictionary, you will need...
  1. A copy of the MeCab-IPADIC data
  2. A copy of the Solr source code
  3. A Solr distribution of the same version as your source code
  4. A servlet container in which to run Solr

Download MeCab-IPADIC


A tarball of the MeCab dictionary can be downloaded from SourceForge at the following link.


Unpack the tarball to a directory of your choice. From now on I will be referring to this directory as the "dictionary source" directory, or as $DICTSRC in code examples. If you look inside the dictionary source directory, you will see several CSV files. These files use the EUC-JP character encoding scheme. Any custom dictionary will need to be in the same format.

If you open up one of the CSV files, you will see something like this.

いっぽう,555,555,5224,接続詞,*,*,*,*,*,いっぽう,イッポウ,イッポー
そもそも,555,555,4784,接続詞,*,*,*,*,*,そもそも,ソモソモ,ソモソモ
では,555,555,5262,接続詞,*,*,*,*,*,では,デハ,デワ
そういや,555,555,5420,接続詞,*,*,*,*,*,そういや,ソウイヤ,ソーイヤ
かたや,555,555,5368,接続詞,*,*,*,*,*,かたや,カタヤ,カタヤ

The data in each field is roughly as follows.

Field 1 - A word to be used for tokenization
Field 2 - Left cost
Field 3 - Right cost
Field 4 - Word cost
Fields 5-10 - Part of speech
Field 11 - Base form
Field 12 - Reading
Field 13 - Pronunciation

Fields 2, 3, and 4 have to do with the statistical model for tokenization. For the purposes of constructing your custom dictionary, treat fields 2 and 3 as magic numbers mapping to part of speech. Column 4 is the "cost" of the word itself. The lower the cost of the word, the more likely it is to be used in a tokenization. Fields 5-10 should be copied from the appropriate MeCab CSV files. I don't know enough about Japanese to know the differences between fields 11, 12, and 13.

Once you have your custom dictionary ready, drop it into the dictionary source directory along with the rest of the MeCab CSV files.

Set up Solr


Set up your servlet container and deploy the Solr WAR file. Make sure that your servlet container expands the war file so that you can access its contents. The expanded Solr webapp directory will be referred to as $WEBAPP. If the directory $WEBAPP/WEB-INF/classes does not exist, create it.

Open a terminal and find the Solr source code you downloaded. This directory will be referred to as $SOLRSRC. Run the following commands.

> cd $SOLRSRC/lucene/analysis/kuromoji
> ant compile-tools

Compile your dictionary


At this point, we should have everything necessary to compile the custom dictionary. Run the following commands.

> cd $SOLRSRC/lucene/build/analysis/kuromoji/classes/tools/
> java -cp ".:$WEBAPP/WEB-INF/lib/*" \
org.apache.lucene.analysis.ja.util.DictionaryBuilder \
ipadic $DICTSRC $WEBAPP/WEB-INF/classes euc-jp false

Once this completes, you can inspect the files that it created in $WEBAPP/WEB-INF/classes. There will be a deep hierarchy of directories, and then nine binary files that make up your dictionary. One of the JAR files in the lib directory contains a set of files with the same names as these, but the Java servlet spec says that the servlet container should first look in the classes directory, then look in the lib directory. Having your dictionary in the classes directory will override the dictionary packaged with your Solr distribution.

You should now have a Solr instance with a custom Japanese dictionary for tokenization. Start up your servlet container and test it out.

Tuesday, March 19, 2013

Custom Japanese tokenization in Solr 4.0

Solr 4.0 (really, it's been there since 3.6) has a new analysis module for handling Japanese, called Kuromoji. Kuromoji was developed by Atilika, Inc., who donated it to Solr. I don't speak Japanese myself, but I've been doing some preliminary tests with a Japanese coworker, and it seems to work fairly well.

However, it's not perfect. It will miss an occasional phrase, and is especially problematic with domain specific phrases. To get around this, the Japanese tokenizer accepts a user dictionary, where you can list custom tokenizations. Unfortunately, I couldn't find any documentation on the format for the user dictionary. Fortunately, there is a sample user dictionary in the unit tests for the Japanese tokenizer.


# Custom segmentation for long entries
日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,テスト名詞

# Custom reading for sumo wrestler
朝青龍,朝青龍,アサショウリュウ,カスタム人名

# Silly entry:
abcd,a b cd,foo1 foo2 foo3,bar
abcdefg,ab cd efg,foo1 foo2 foo4,bar


It's a fairly straight forward CSV format. A hash character (#) starts a comment that continues to the end of a line. Empty lines are ignored. Each non-empty line has four fields separated by commas. After some testing, I was able to figure out what each field in the CSV was.

  1. Untokenized phrase
  2. Tokenized phrase
  3. Reading, or pronunciation
  4. Part of speech

There are a couple particulars you need to be aware of when putting together your user dictionary.

Every field is required - If you do not have all four fields your core will not load properly.

Tokenized phrase and reading - These fields are lists of words delimited by spaces. It is important that both the tokenized phrase and the reading have the same number of words. If you don't have this, your core will not load properly.

Spaces around commas - The CSV parser is very picky about format. You should never have any spaces surrounding the commas separating fields. Your core may or may not load, but can get other strange errors during tokenization.

I haven't tested this extensively, but I don't believe there is any way to escape a comma or a hash character. The CSV parser will accept fields surrounded by quote marks, but putting a comma or hash inside a quoted string does not seem to change how it is interpreted. Fortunately, this seems like it would be a very rare use case.

Thursday, January 3, 2013

Grokking Solr Trie Fields

I've been trying to wrap my head around Solr's trie field types for the past week, and finally made a break through.

Trie


The first thing you need to understand is the idea of a trie data type. Here's a basic outline of the data structure. Let's say you want to index the following list of words.

bad
bag
bar
bin
bit

We can arrange these words in a tree structure as follows.

b--a--d
|  |
|  |--g
|  |
|  \--r
|
\--i--n
   |
   \--t

To reconstruct one of the words you start at the root node of the tree, and work your way towards a leaf node, keeping track of the letters you encounter along the way. Depending on what you're using the trie data structure for, you may store some piece of data in one of the leaf nodes. If you want more information about the trie data structure, I will refer you to the Wikipedia page, which is fairly complete and easy to understand.

Solr's version of the trie


Another term for a trie is a "prefix tree". Solr uses this idea of prefixes to index numbers so that it can perform range queries efficiently. Just like we can organize words into tries, we can also organize numbers into tries. Unfortunately, the Lucene index that lies at the heart of Solr has no concept of a trie, so Solr has to do a bit of a hack to represent a trie. It does this by storing each number multiple times at different levels of precision.

Let's say I want to index the integer 3735928559. For clarity, let's rewrite that in hexadecimal, 0xDEADBEEF. When we index this using a TrieIntField, Solr stores the integer four times at different levels of precision.

0xDE000000
0xDEAD0000
0xDEADBE00
0xDEADBEEF

What Solr is doing here is constructing numbers with different length prefixes. This would be equivalent to a trie with this structure.

DE--AD--BE--EF

The reason that this allows for fast range queries is because of what the prefixes represent. The prefixes represent a range of values. It might be better to think of them indexed like this, instead.

0xDExxxxxx
0xDEADxxxx
0xDEADBExx
0xDEADBEEF

Each "x" represents an unset digit. That means the entry 0xDEADxxxx represents every number from 0xDEAD0000 to 0xDEADFFFF. You can get a better feel for this if you play around in the analysis section of the Solr admin console.

Precision Step


The option to set the precision step was the part that I understood the least. The available documentation is rather dense and unhelpful. The precision step lets you tune your index, trading range query speed for index size. A smaller precision step will result in a larger index and faster range queries. A larger precision step will result in a smaller index and slower range queries.

In the example above, I was using a precision step of 8, the default. What the precision step means is how many bits get pruned off the end of the number. Let's see what would happen if we indexed 0xDEADBEEF with a precision step of 12.

0xDExxxxxx
0xDEADBxxx
0xDEADBEEF

And here with a precision step of 4.

0xDxxxxxxx
0xDExxxxxx
0xDEAxxxxx
0xDEADxxxx
0xDEADBxxx
0xDEADBExx
0xDEADBEEx
0xDEADBEEF

As you can see, compared to the default precision step of 8, a precision step of 4 doubled the number of entries in the index. The way it speeds up range searches is by allowing better granularity. If I wanted to search for the documents matching the range 0xDEADBEE0 to 0xDEADBEEF with the default precision step, I would have to check all 16 records in the index and merge the results. With the precision step of 4, I can check the one record for 0xDEADBEEx and get the results I want.

That's a bit of a cherry picked example, but arbitrary range queries will be faster. How that works is left as an exercise for the reader.