Mental Detritus: 2012

Sunday, November 11, 2012

Custom Solr token filter factories with arguments

Many of the token filter factories, tokenizer factories, and char filter factories that come bundled with Solr accept parameters from a schema.xml. The documentation for writing your own filters and tokenizers doesn't include any details for how to access these parameters, but it's pretty easy to figure out by inspecting the source code for one of the included factories.

The MappingCharFilterFactory takes a path to a file as a parameter. The Javadoc for the MappingCharFilterFactory shows the declaration to put in schema.xml.

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>

In the source code for the MappingCharFilterFactory we can find the following line.

mapping = args.get("mapping");

Looking through the class hierarchy, all token filter factories, tokenizer factories, and char filter factories are descendants of AbstractAnalysisFactory where args is declared as a protected variable. All you have to do access the parameters passed from schema.xml is access the args map. args can also be accessed via the getArgs() function.

Wednesday, October 31, 2012

Solr 4.0 and the BaseTokenFilterFactory

At work, we're upgrading from an ancient version of Lucene to the shiny new Solr 4.0. Unfortunately, the documentation on the Lucene wiki hasn't quite caught up with the most recent version of the software. I would fix omissions like this as I found them, but the wiki does not seem to accept public edits.

There were three classes in Solr 3.6 for creating custom analyzers. Those classes were the BaseCharFilterFactory, BaseTokenizerFactory, and the BaseTokenFilterFactory. These three classes were in the documented package, org.apache.solr.analysis, through Solr 4.0 ALPHA, but as of the BETA release, they have been moved and renamed.

In Solr 4.0, the new classes are the CharFilterFactory, TokenizerFactory, and TokenFilterFactory. They can be found in the org.apache.lucene.analysis.util package, which is part of Lucene's analyzers-common project

Handy links:

Thanks to comment 7 for pointing this out.

Sunday, August 5, 2012

Close, but no cigar

Anita Sarkeesian is the author of Feminist Frequency, a blog where she writes and makes videos about the portrayal of women in popular culture. She has made a couple of posts analyzing movies using the Bechdel Test. The Bechdel Test first appeared in the comic Dykes to Watch Out For by Alison Bechdel. The test lays out three simple rules as follows. To pass the test, a movie must...

Have two female characters...
Who talk to each other...
About something other than a man.

The good parts

In Sarkeesian's latest post about the Bechdel test, The Oscars and the Bechdel Test, she uses the test on the 2011 Oscar nominees. On the whole, Sarkeesian does a good job of analyzing the movies and applying the test in a sensible manner. For example, she notes that the Bechdel Test was not originally conceived of as a serious metric.

Let’s remember that this was made as a bit of a joke to make fun of the fact that there are so few movies with significant female characters in them. The reason the test has become so important in recent years is because it actually does highlight a serious and ongoing problem within the entertainment industry.

I also agree with her analysis of how application of the Bechdel Test can be useful.

Again, to be clear this test does not gauge the quality of a film, it doesn’t determine whether a film is feminist or not, and it doesn’t even determine whether a film is woman centered. Some pretty awful movies including ones that have stereotypical and/or sexist representations of women might pass the test with flying colours. Where really well made films that I would highly recommend might not.

She goes on to note that the Bechdel Test is most informative used in aggregate when applied to a group of films. Her choice to use the 2011 Oscar nominees is also good, as it lessens the chance of selection bias.

The Rest

Unfortunately, with all the good things she has to say, her post has one major flaw that undermines any conclusions that can be drawn from her analysis.

In response to the Bechdel Test, I’m often asked, well, what about the reverse? “Why isn’t there also a test to determine if two men talk to each other about something other then a woman”. The answer to that is simple, the test is meant to indicate a problem, and there isn’t a problem with a lack of men interacting with one another. The Bechdel test is useful because it can point out an institutional pattern and since there’s no problem with men and men’s stories being underrepresented in films, the reverse test is not useful or relevant.

Her dismissal of a Reverse Bechdel Test is very misguided. In fact, not only is the Reverse Bechdel Test relevant, I'll go even further and say that the Bechdel Test is useless without it. To demonstrate this, let me give you a similarly flawed analysis of labor statistics.

Historically, women have been under represented in technical positions, such as engineers and medical doctors. Unfortunately, these problems persist even today. According to the United States Bureau of Labor Statistics, in 2011 there were only 198,000 female software developers in the United States. There are similarly paltry numbers in psychology, with only 140,000 female psychologists during 2011.

Anyone should be able to see the obvious flaw in this paragraph. I've omitted the number of men working in these disciplines. Let's apply the same logic here that Sarkeesian used to dismiss the Reverse Bechdel Test.

What about the number of men in these disciplines? The answer to that is simple, these statistics are meant to indicate a problem, and there isn't a problem with lack of men in these fields. These statistics are useful because it can point out an institutional pattern, and since there's no problem with men being underrepresented in these fields, the number of men in these fields is not useful or relevant.

Unfortunately, this reasoning fails to support its conclusions when you look at all the relevant data. It is true that women are underrepresented among software developers. Compared to the 198,000 women working as software developers, there are over 840,000 men working as software developers. However, it is not the case at all that women are underrepresented among psychologists. While there are 140,000 female psychologists, there are only 56,000 male psychologists. That comes to women making up 19% and 71% of these fields respectively.

"Ah, ha!" you might say, "But Sarkeesian has already accounted for this. She notes that 2 out of 9 movies clearly pass the Bechdel test. That's only 22%."

The problem here is that she is comparing the wrong things. That 22% is a ratio between those movies that pass the test and those that do not. This would be equivalent to comparing the number of women who are psychologists and the number that are not. Of course, this is silly, which is why we compare the number of psychologists that are women to the number of psychologists that are men. Similarly, to make sense of the Bechdel Test, we need to compare the number of movies that pass the test against the number of movies that pass same test with the genders reversed.

If you are still not conviced that the Reverse Bechdel Test is relevant, then I have one simple question for you. What percentage of movies should pass the Bechdel Test, and how do you arrive at that number?

Postscript

For the sake of clarity, I'd like to follow up with a couple points.

First, the two employment statistics I selected were clearly cherry picked. I selected one where women were in the clear minority, and another where they were in the clear majority. This was so I could demonstrate that while a flawed analysis can't affirm a position, neither can it disprove it either. My use of these statistics should not be misconstrued to say anything about representation of women in the work force in general. If you look at all the statistics, it's clear that women are still underrepresented in STEM fields. It certainly took a while for me to find a suitable statistic with women in the majority.

Second, this post is only to point out that Sarkeesian's conclusion is unsupported by her methods, not that her conclusion is necessarily wrong. In fact, I expect that her conclusion is entirely correct. However, without a proper analysis we have no way to accurately assess whether progress towards equality is being made, and if so, how much. We also won't have a good way of determining when the problem has been fixed.

Update: It seems I'm not the first to notice this problem. Ryan over at Mad Art Lab already covered this several months ago.

Friday, June 29, 2012

Another shell script to share

At home I get a lot of use out of the tree command line utility. It gives me a quick and easy way to look at a nested directory structure without having to leave the command line. The computers at work don't have the tree utility, and every so often I really miss it.

So, I wrote my own.

Here's a bash script that is a simplified version of the tree utility.

#!/bin/bash

tree() {
local prefix=$1
local dir=$2
local count=`ls -l $dir | wc -l`

if [[ $count -eq 0 ]]
then
return
fi

local i=1
local bar="|"
local nextPrefix="| "

for file in "$dir"/*
do
i=$(($i + 1))
if [[ $i -eq $count ]]
then
bar="\`"
nextPrefix=" "
fi
filename=$(basename $file)
echo "$prefix$bar---$filename"
if [[ -d $file ]]
then
tree "$prefix$nextPrefix" $file
fi
done
}

if [[ -z $1 ]]
then
dir=`pwd`
else
dir=$1
fi

if [[ ! -e $dir ]]
then
echo "Directory $dir does not exist"
exit 1
elif [[ ! -d $dir ]]
then
echo "$dir is not a directory"
exit 2
fi

basename $dir
tree "" $dir

Wednesday, March 28, 2012

Back to school

I've started taking classes as part of the Embedded Systems Engineering certificate offered through UC Irvine Extension. Just last week I finished up my first course in the certificate program. It was the requisite software engineering course that I imagine every such program has. Overall, the course was fairly dull, and I can't say I got much out of it.

The deliverables for the course consisted of a five question multiple choice quiz every week, and four larger assignments. Two of these assignments were research papers, which were little more than book reports based on our assigned reading. The other two assignments were demonstrations of the Hatley-Pirbhai Methodology (HPM). HPM is mostly just a repackaged waterfall workflow, but with a requirements document following a specific structure. None of these were particularly interesting, challenging or fun.

Although the course was rather disappointing on the whole, I can't say I got nothing out of it. Much of the required reading had very little to do with software engineering and instead was more a survey of common hardware found in embedded systems. Not having an extensive background in hardware, there was a lot of new information for me. I now have a much better appreciation for things like system bus protocols, and the details of how DMA works. This was also my first exposure to a digital signal processor architecture, when previously I only was familiar with MIPS and x86 type instruction sets.

Unfortunately, I only got to read about all this, and not actually get my hands dirty, but that's going to change. My second class* starts up this week, which is an intro to embedded programming, and my dev board just arrived in the mail. I haven't had a chance to play with it very much, but I can already tell this class is going to be much better than the last. The AVR board we are using has fun blinky LEDs, but sadly no piezo buzzer to annoy the roommates with.

* You may think it's odd to have the first class in a curriculum to be software engineering, and you would be right. My "first" and "second" classes should have been reversed in order. I ended up taking them out of order because I started the program at an odd time with regards to class schedules. In the end, I think this will turn out for the best. I have accidentally eaten my vegetables first, which means now there's nothing left for me but dessert.