Solr DisMax gotchas: fun with stopwords and punctuation

September 17, 2010 No Comments by Ben

I have been usingĀ Apache Solr on a recent project for my employer, and ran into some gotchas. Notably, I had issues with a DisMax query not working the way I expected because of inconsistent stopwords configuration on the fields being searched.

I also ran into a problem with words being tokenized with their surrounding punctuation, rather than simply the word — searching for “something” was definitely NOT the same thing as searching for “something.” — note the included punctuation in the second search.

The bottom line:

  • make sure stopwords is configured the same way for the fields you are searching (or at least be sure you understand the ramifications)
  • don’t be afraid to play with the “minimum-match” (mm) default configuration
  • when it makes sense, strip surrounding punctuation using PatternReplaceFilterFactory using an index tokenizer

For the full, detailed write-up, head over to OpenSky’s engineering blog and read the original blog post.