Solr DisMax gotchas: fun with stopwords and punctuation
I have been using Apache Solr on a recent project for my employer, and ran into some gotchas. Notably, I had issues with a DisMax query not working the way I expected because of inconsistent stopwords configuration on the fields being searched.
I also ran into a problem with words being tokenized with their surrounding punctuation, rather than simply the word — searching for “something” was definitely NOT the same thing as searching for “something.” — note the included punctuation in the second search.
The bottom line:
- make sure stopwords is configured the same way for the fields you are searching (or at least be sure you understand the ramifications)
- don’t be afraid to play with the “minimum-match” (mm) default configuration
- when it makes sense, strip surrounding punctuation using PatternReplaceFilterFactory using an index tokenizer
For the full, detailed write-up, head over to OpenSky’s engineering blog and read the original blog post.