Saturday, April 13, 2013

Losing is Fun: A minimal tutorial for maximum Fun

Dwarf Fortress is an incredible game, but the menu system presents a significant hurdle for newbies. I recently got started and had to rely on the wealth of information available at the Dwarf Fortress Wiki, but in the end I think I regret using it too much.

The motto of Dwarf Fortress is "Losing is Fun." You cannot "win" this game. Your fortress will fall eventually. Seeing how your fortress falls is all part of the Fun of Dwarf Fortress. This way, you learn from your mistakes, so hopefully each fortress will be less disastrous more spectacular than the last. If you read too much about game mechanics before you start playing, you miss a lot of this Fun. This tutorial is meant to give you a minimal boost, to get you over the initial non-Fun of learning how to navigate the arcane set of menus, without giving away any Fun ruining spoilers.

Just to give you a taste of some of the problems with the menu system, scrolling through menu options is very inconsistent and haphazard. Sometimes you use arrow keys. Sometimes you use the number pad. Sometimes you use (+) and (-). Sometimes you use (u, h, k, m). You will have to pay attention to the on screen instructions for using each menu.

Worldgen


The first step in Dwarf Fortress is creating the world your dwarves are going to inhabit. Select the Create New World! option. The world generation menu is pretty straight forward. To begin with, use the default settings. World generation takes a few minutes to complete, so go look at cat pictures or something while your computer does its thing.

World generation is a one time thing. When your first fortress falls, don't generate a new world. Just set up a new fortress somewhere else in the same world. One of the coolest aspects of Dwarf Fortress is that the world is persistent, and incredibly detailed. That means that you can go back later to reclaim the ruins of an old fortress.

Embark!


Embarking just means choosing where to stake your claim. Select the Start Playing menu, and the Dwarf Fortress option. These menus are pretty straight forward, too. Use the commands across the bottom of your screen to select a location.

There are a couple keys to finding a good site. You want a site rich in raw materials, like wood, rock, metals and animals. Aquifers are bad. Flux is good. Once you've found a place that looks good you can Embark (e). Learning where to embark is Fun.

Once you have embarked, you will see a screen full of incomprehensible characters. You can get information about what each symbol means by using the Look (k) command. When you place the cursor over a square of the map, you will see information about what is occupying that square on the right hand side of your screen. Take some time to get a basic idea of what your surroundings look like.

If you ever get lost somewhere in the menu system, just keep hitting Esc repeatedly until you get back to the main screen. If you hit Esc too many times, it will just toggle between the main screen and the options menu.

To avoid ambiguity, all menu options will be given as a series of keystrokes as if you were navigating there from the main screen.

Strike the Earth!


One of the most important activities is mining. To start mining, open the Designation menu and select Mine (d, d). Use the arrow keys to mark sections of the map you would like to excavate. Parts of the map that you can mine are going to be solid black. If you don't see any solid black areas of the map, skip ahead to the section on Z-Levels, then come back here.

If you accidentally mark an area for excavation that you didn't mean to, you can use the Remove Designation (d, x) option to unmark these areas. When you unpause the game (space bar), your dwarves will get to work carrying out your wishes.

You should notice that while marking places for excavation, you never issued an order to a dwarf. In Dwarf Fortress, you only have indirect control over your dwarves. All you can do is issue a general order that this or that should be done. If a dwarf with that particular skill is available, then he or she will do the job. If no dwarf has the skill you need, then the job doesn't get done. This puts Dwarf Fortress about half way between a game like SimCity and The Sims. SimCity does not simulate the actions of individual people. The Sims does, but you have to micromanage every action that your Sims take. The game play of Dwarf Fortress presents some interesting challenges, because your dwarves will sometimes act in Fun and unpredictable ways.

Buildings


One of the more confusing things about Dwarf Fortress is what does and does not count as a "building". Things like tables and chairs count as "buildings", and are in the same menu as wells and bridges. Even more confusing is the error message you get when you try to build some of these "buildings". If you open up the Building menu and try to "build" a table (b, t), you will get the error message "Needs table". WTF does that mean?

For an object like a table, "building" the table really means placing a fully constructed table somewhere for use. Before you can place a table somewhere for use, you must "construct" the table in the first place.

Some of the most important buildings in the game are Workshops (b, w). Workshops are where your dwarves turn raw materials into useful objects. You will need workshops, but figuring out which workshops you need, and for what can lead to a lot of Fun.

Most buildings don't do anything on their own. In order for them to be useful, you have to designate tasks to be carried out by your dwarves at the buildings. You can interact with a building using the Set Building Tasks/Prefs command (q). In this mode, when you move your cursor near a building, you will get a menu of what you can do with that building. This will be very important with your workshops early in the game.

Z-Levels


The world of Dwarf Fortress is a three dimensional one. The map that the game presents to you is really just a horizontal cross section of this three dimensional world. One single cross section is called a Z-level. That's because typically a flat plane is represented with two coordinates, X and Y. Three dimensions are represented by X, Y and Z, where Z denotes the vertical dimension. I like to think of Z-levels as the pictures that an MRI or CT scan produces. Horizontal cross sections that you have to visualize stacked on top of each other.

To navigate down to lower Z-levels use the (>) key. To navigate to higher Z-levels use the (<) key. In order to dig between Z-levels, you will need to construct stairs or ramps. The various flavors of stairs and ramps can be found in the Designations menu along side the Mine command. These will need to be placed correctly in order for your dwarves to gain access to other Z-levels. You should experiment and have some Fun with this.

Miscellaneous Tips and Hints


Here are a couple more menus you should familiarize yourself with early on. Zones (i) and Stockpiles (p) are important. Play around with them to see what they do. Create a stockpile and see how your dwarves react to it. If it didn't seem to do anything, try creating a different type of stockpile. Do the same with zones.

The View Units (v) menu is also very important. In this mode, when you move your cursor near a creature, you will see some of the creatures stats and characteristics. Of particular importance is the Labor menu (v, p, l). It lets you designate which types of tasks a particular dwarf will perform.

My last advice is to not get too frustrated. Remember, "Losing is Fun." If your fortress falls, just start a new one. In my own process of learning the game, I try to focus on learning one thing at a time. For example, focus on how to acquire a particular raw material. Once you've acquired the raw material, see if you can figure out what that raw material is useful for. Nearly everything in this game has a use of some type.

On to the Fun


That should be enough information to at least get you started. I've deliberately left out a lot of key information that will hopefully result in a lot of Fun in your early fortresses. Once you've learned where to embark and familiarized yourself with the menus in this tutorial, you should be able to put together a fairly successful fortress. The key is to learn from your mistakes so you have a completely new and unexpected type of Fun in your next fortress.

To recap, these are the key menus you need to be familiar with to get your first fortress off (in?) the ground.

  • Look (k) - See what stuff is on a particular tile.
  • Designations (d) - Important commands like Mine can be found here
  • Building (b) - Used to erect buildings and place furniture around your fortress
  • Set Building Tasks/Prefs (q) - Interact with finished buildings
  • Zones (i) - Designate areas for certain uses
  • Stockpiles (p) - Designate areas for certain other uses
  • View (v) - View and set your dwarves' characteristics
And finally, access the third dimension by navigating Z-levels using (<) and (>).

Have Fun. :-)

Thursday, March 21, 2013

Compiling a custom dictionary for Kuromoji and Solr

The user dictionary functionality of Solr's Kuromoji tokenizer is extremely useful and easy to use, but it isn't always the right tool for the job. In my case, I'm migrating our system off of the MeCab tokenizer. MeCab also allows you customize the tokenization, but the two models are completely different. In Kuromoji's user dictionary, you take an untokenized phrase and provide the custom tokenization. In MeCab, you just supply a list of words to augment its dictionary. The only way to migrate MeCab's custom dictionary to Kuromoji's user dictionary is by hand.

Fortunately, Kuromoji uses the same base data set as MeCab to build its statistical model for tokenization, and all the data in the base data set is in the same format as the custom MeCab dictionary. Getting Kuromoji to use the custom MeCab dictionary just requires recompiling Kuromoji's dictionary. Figuring out how to do this was surprisingly painless.

In order to compile your new dictionary, you will need...
  1. A copy of the MeCab-IPADIC data
  2. A copy of the Solr source code
  3. A Solr distribution of the same version as your source code
  4. A servlet container in which to run Solr

Download MeCab-IPADIC


A tarball of the MeCab dictionary can be downloaded from SourceForge at the following link.


Unpack the tarball to a directory of your choice. From now on I will be referring to this directory as the "dictionary source" directory, or as $DICTSRC in code examples. If you look inside the dictionary source directory, you will see several CSV files. These files use the EUC-JP character encoding scheme. Any custom dictionary will need to be in the same format.

If you open up one of the CSV files, you will see something like this.

いっぽう,555,555,5224,接続詞,*,*,*,*,*,いっぽう,イッポウ,イッポー
そもそも,555,555,4784,接続詞,*,*,*,*,*,そもそも,ソモソモ,ソモソモ
では,555,555,5262,接続詞,*,*,*,*,*,では,デハ,デワ
そういや,555,555,5420,接続詞,*,*,*,*,*,そういや,ソウイヤ,ソーイヤ
かたや,555,555,5368,接続詞,*,*,*,*,*,かたや,カタヤ,カタヤ

The data in each field is roughly as follows.

Field 1 - A word to be used for tokenization
Field 2 - Left cost
Field 3 - Right cost
Field 4 - Word cost
Fields 5-10 - Part of speech
Field 11 - Base form
Field 12 - Reading
Field 13 - Pronunciation

Fields 2, 3, and 4 have to do with the statistical model for tokenization. For the purposes of constructing your custom dictionary, treat fields 2 and 3 as magic numbers mapping to part of speech. Column 4 is the "cost" of the word itself. The lower the cost of the word, the more likely it is to be used in a tokenization. Fields 5-10 should be copied from the appropriate MeCab CSV files. I don't know enough about Japanese to know the differences between fields 11, 12, and 13.

Once you have your custom dictionary ready, drop it into the dictionary source directory along with the rest of the MeCab CSV files.

Set up Solr


Set up your servlet container and deploy the Solr WAR file. Make sure that your servlet container expands the war file so that you can access its contents. The expanded Solr webapp directory will be referred to as $WEBAPP. If the directory $WEBAPP/WEB-INF/classes does not exist, create it.

Open a terminal and find the Solr source code you downloaded. This directory will be referred to as $SOLRSRC. Run the following commands.

> cd $SOLRSRC/lucene/analysis/kuromoji
> ant compile-tools

Compile your dictionary


At this point, we should have everything necessary to compile the custom dictionary. Run the following commands.

> cd $SOLRSRC/lucene/build/analysis/kuromoji/classes/tools/
> java -cp ".:$WEBAPP/WEB-INF/lib/*" \
org.apache.lucene.analysis.ja.util.DictionaryBuilder \
ipadic $DICTSRC $WEBAPP/WEB-INF/classes euc-jp false

Once this completes, you can inspect the files that it created in $WEBAPP/WEB-INF/classes. There will be a deep hierarchy of directories, and then nine binary files that make up your dictionary. One of the JAR files in the lib directory contains a set of files with the same names as these, but the Java servlet spec says that the servlet container should first look in the classes directory, then look in the lib directory. Having your dictionary in the classes directory will override the dictionary packaged with your Solr distribution.

You should now have a Solr instance with a custom Japanese dictionary for tokenization. Start up your servlet container and test it out.

Tuesday, March 19, 2013

Custom Japanese tokenization in Solr 4.0

Solr 4.0 (really, it's been there since 3.6) has a new analysis module for handling Japanese, called Kuromoji. Kuromoji was developed by Atilika, Inc., who donated it to Solr. I don't speak Japanese myself, but I've been doing some preliminary tests with a Japanese coworker, and it seems to work fairly well.

However, it's not perfect. It will miss an occasional phrase, and is especially problematic with domain specific phrases. To get around this, the Japanese tokenizer accepts a user dictionary, where you can list custom tokenizations. Unfortunately, I couldn't find any documentation on the format for the user dictionary. Fortunately, there is a sample user dictionary in the unit tests for the Japanese tokenizer.


# Custom segmentation for long entries
日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,テスト名詞

# Custom reading for sumo wrestler
朝青龍,朝青龍,アサショウリュウ,カスタム人名

# Silly entry:
abcd,a b cd,foo1 foo2 foo3,bar
abcdefg,ab cd efg,foo1 foo2 foo4,bar


It's a fairly straight forward CSV format. A hash character (#) starts a comment that continues to the end of a line. Empty lines are ignored. Each non-empty line has four fields separated by commas. After some testing, I was able to figure out what each field in the CSV was.

  1. Untokenized phrase
  2. Tokenized phrase
  3. Reading, or pronunciation
  4. Part of speech

There are a couple particulars you need to be aware of when putting together your user dictionary.

Every field is required - If you do not have all four fields your core will not load properly.

Tokenized phrase and reading - These fields are lists of words delimited by spaces. It is important that both the tokenized phrase and the reading have the same number of words. If you don't have this, your core will not load properly.

Spaces around commas - The CSV parser is very picky about format. You should never have any spaces surrounding the commas separating fields. Your core may or may not load, but can get other strange errors during tokenization.

I haven't tested this extensively, but I don't believe there is any way to escape a comma or a hash character. The CSV parser will accept fields surrounded by quote marks, but putting a comma or hash inside a quoted string does not seem to change how it is interpreted. Fortunately, this seems like it would be a very rare use case.

Thursday, January 3, 2013

Grokking Solr Trie Fields

I've been trying to wrap my head around Solr's trie field types for the past week, and finally made a break through.

Trie


The first thing you need to understand is the idea of a trie data type. Here's a basic outline of the data structure. Let's say you want to index the following list of words.

bad
bag
bar
bin
bit

We can arrange these words in a tree structure as follows.

b--a--d
|  |
|  |--g
|  |
|  \--r
|
\--i--n
   |
   \--t

To reconstruct one of the words you start at the root node of the tree, and work your way towards a leaf node, keeping track of the letters you encounter along the way. Depending on what you're using the trie data structure for, you may store some piece of data in one of the leaf nodes. If you want more information about the trie data structure, I will refer you to the Wikipedia page, which is fairly complete and easy to understand.

Solr's version of the trie


Another term for a trie is a "prefix tree". Solr uses this idea of prefixes to index numbers so that it can perform range queries efficiently. Just like we can organize words into tries, we can also organize numbers into tries. Unfortunately, the Lucene index that lies at the heart of Solr has no concept of a trie, so Solr has to do a bit of a hack to represent a trie. It does this by storing each number multiple times at different levels of precision.

Let's say I want to index the integer 3735928559. For clarity, let's rewrite that in hexadecimal, 0xDEADBEEF. When we index this using a TrieIntField, Solr stores the integer four times at different levels of precision.

0xDE000000
0xDEAD0000
0xDEADBE00
0xDEADBEEF

What Solr is doing here is constructing numbers with different length prefixes. This would be equivalent to a trie with this structure.

DE--AD--BE--EF

The reason that this allows for fast range queries is because of what the prefixes represent. The prefixes represent a range of values. It might be better to think of them indexed like this, instead.

0xDExxxxxx
0xDEADxxxx
0xDEADBExx
0xDEADBEEF

Each "x" represents an unset digit. That means the entry 0xDEADxxxx represents every number from 0xDEAD0000 to 0xDEADFFFF. You can get a better feel for this if you play around in the analysis section of the Solr admin console.

Precision Step


The option to set the precision step was the part that I understood the least. The available documentation is rather dense and unhelpful. The precision step lets you tune your index, trading range query speed for index size. A smaller precision step will result in a larger index and faster range queries. A larger precision step will result in a smaller index and slower range queries.

In the example above, I was using a precision step of 8, the default. What the precision step means is how many bits get pruned off the end of the number. Let's see what would happen if we indexed 0xDEADBEEF with a precision step of 12.

0xDExxxxxx
0xDEADBxxx
0xDEADBEEF

And here with a precision step of 4.

0xDxxxxxxx
0xDExxxxxx
0xDEAxxxxx
0xDEADxxxx
0xDEADBxxx
0xDEADBExx
0xDEADBEEx
0xDEADBEEF

As you can see, compared to the default precision step of 8, a precision step of 4 doubled the number of entries in the index. The way it speeds up range searches is by allowing better granularity. If I wanted to search for the documents matching the range 0xDEADBEE0 to 0xDEADBEEF with the default precision step, I would have to check all 16 records in the index and merge the results. With the precision step of 4, I can check the one record for 0xDEADBEEx and get the results I want.

That's a bit of a cherry picked example, but arbitrary range queries will be faster. How that works is left as an exercise for the reader.

Sunday, November 11, 2012

Custom Solr token filter factories with arguments

Many of the token filter factories, tokenizer factories, and char filter factories that come bundled with Solr accept parameters from a schema.xml. The documentation for writing your own filters and tokenizers doesn't include any details for how to access these parameters, but it's pretty easy to figure out by inspecting the source code for one of the included factories.

The MappingCharFilterFactory takes a path to a file as a parameter. The Javadoc for the MappingCharFilterFactory shows the declaration to put in schema.xml.
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
In the source code for the MappingCharFilterFactory we can find the following line.
mapping = args.get("mapping"); 
Looking through the class hierarchy, all token filter factories, tokenizer factories, and char filter factories are descendants of AbstractAnalysisFactory where args is declared as a protected variable. All you have to do access the parameters passed from schema.xml is access the args map. args can also be accessed via the getArgs() function.

Wednesday, October 31, 2012

Solr 4.0 and the BaseTokenFilterFactory

At work, we're upgrading from an ancient version of Lucene to the shiny new Solr 4.0. Unfortunately, the documentation on the Lucene wiki hasn't quite caught up with the most recent version of the software. I would fix omissions like this as I found them, but the wiki does not seem to accept public edits.

There were three classes in Solr 3.6 for creating custom analyzers. Those classes were the BaseCharFilterFactory, BaseTokenizerFactory, and the BaseTokenFilterFactory. These three classes were in the documented package, org.apache.solr.analysis, through Solr 4.0 ALPHA, but as of the BETA release, they have been moved and renamed.

In Solr 4.0, the new classes are the CharFilterFactory, TokenizerFactory, and TokenFilterFactory. They can be found in the org.apache.lucene.analysis.util package, which is part of Lucene's analyzers-common project

Handy links:
Thanks to comment 7 for pointing this out.

Sunday, August 5, 2012

Close, but no cigar

Anita Sarkeesian is the author of Feminist Frequency, a blog where she writes and makes videos about the portrayal of women in popular culture. She has made a couple of posts analyzing movies using the Bechdel Test. The Bechdel Test first appeared in the comic Dykes to Watch Out For by Alison Bechdel. The test lays out three simple rules as follows. To pass the test, a movie must...
  1. Have two female characters...
  2. Who talk to each other...
  3. About something other than a man.

The good parts

In Sarkeesian's latest post about the Bechdel test, The Oscars and the Bechdel Test, she uses the test on the 2011 Oscar nominees. On the whole, Sarkeesian does a good job of analyzing the movies and applying the test in a sensible manner. For example, she notes that the Bechdel Test was not originally conceived of as a serious metric.
Let’s remember that this was made as a bit of a joke to make fun of the fact that there are so few movies with significant female characters in them. The reason the test has become so important in recent years is because it actually does highlight a serious and ongoing problem within the entertainment industry.
I also agree with her analysis of how application of the Bechdel Test can be useful.
Again, to be clear this test does not gauge the quality of a film, it doesn’t determine whether a film is feminist or not, and it doesn’t even determine whether a film is woman centered. Some pretty awful movies including ones that have stereotypical and/or sexist representations of women might pass the test with flying colours. Where really well made films that I would highly recommend might not.
She goes on to note that the Bechdel Test is most informative used in aggregate when applied to a group of films. Her choice to use the 2011 Oscar nominees is also good, as it lessens the chance of selection bias.

The Rest

Unfortunately, with all the good things she has to say, her post has one major flaw that undermines any conclusions that can be drawn from her analysis.
In response to the Bechdel Test, I’m often asked, well, what about the reverse? “Why isn’t there also a test to determine if two men talk to each other about something other then a woman”. The answer to that is simple, the test is meant to indicate a problem, and there isn’t a problem with a lack of men interacting with one another.  The Bechdel test is useful because it can point out an institutional pattern and since there’s no problem with men and men’s stories being underrepresented in films, the reverse test is not useful or relevant.
Her dismissal of a Reverse Bechdel Test is very misguided. In fact, not only is the Reverse Bechdel Test relevant, I'll go even further and say that the Bechdel Test is useless without it. To demonstrate this, let me give you a similarly flawed analysis of labor statistics.
Historically, women have been under represented in technical positions, such as engineers and medical doctors. Unfortunately, these problems persist even today. According to the United States Bureau of Labor Statistics, in 2011 there were only 198,000 female software developers in the United States. There are similarly paltry numbers in psychology, with only 140,000 female psychologists during 2011.
Anyone should be able to see the obvious flaw in this paragraph. I've omitted the number of men working in these disciplines. Let's apply the same logic here that Sarkeesian used to dismiss the Reverse Bechdel Test.
What about the number of men in these disciplines? The answer to that is simple, these statistics are meant to indicate a problem, and there isn't a problem with lack of men in these fields. These statistics are useful because it can point out an institutional pattern, and since there's no problem with men being underrepresented in these fields, the number of men in these fields is not useful or relevant.
Unfortunately, this reasoning fails to support its conclusions when you look at all the relevant data. It is true that women are underrepresented among software developers. Compared to the 198,000 women working as software developers, there are over 840,000 men working as software developers. However, it is not the case at all that women are underrepresented among psychologists. While there are 140,000 female psychologists, there are only 56,000 male psychologists. That comes to women making up 19% and 71% of these fields respectively.

"Ah, ha!" you might say, "But Sarkeesian has already accounted for this. She notes that 2 out of 9 movies clearly pass the Bechdel test. That's only 22%."

The problem here is that she is comparing the wrong things. That 22% is a ratio between those movies that pass the test and those that do not. This would be equivalent to comparing the number of women who are psychologists and the number that are not. Of course, this is silly, which is why we compare the number of psychologists that are women to the number of psychologists that are men. Similarly, to make sense of the Bechdel Test, we need to compare the number of movies that pass the test against the number of movies that pass same test with the genders reversed.

If you are still not conviced that the Reverse Bechdel Test is relevant, then I have one simple question for you. What percentage of movies should pass the Bechdel Test, and how do you arrive at that number?

Postscript

For the sake of clarity, I'd like to follow up with a couple points.

First, the two employment statistics I selected were clearly cherry picked. I selected one where women were in the clear minority, and another where they were in the clear majority. This was so I could demonstrate that while a flawed analysis can't affirm a position, neither can it disprove it either. My use of these statistics should not be misconstrued to say anything about representation of women in the work force in general. If you look at all the statistics, it's clear that women are still underrepresented in STEM fields. It certainly took a while for me to find a suitable statistic with women in the majority.

Second, this post is only to point out that Sarkeesian's conclusion is unsupported by her methods, not that her conclusion is necessarily wrong. In fact, I expect that her conclusion is entirely correct. However, without a proper analysis we have no way to accurately assess whether progress towards equality is being made, and if so, how much. We also won't have a good way of determining when the problem has been fixed.

Update: It seems I'm not the first to notice this problem. Ryan over at Mad Art Lab already covered this several months ago.