Picky Stemming

ruby / picky

This is a quick post about a new feature in Picky 4.6.6+: stemming.

Stemming

Stemming is used in information retrieval, and basically serves the purpose of “finding the thing” in an index, even if the appearance of the thing was different in the original.

In other words: if we had saved the word “arguing” in the index, then when somebody searches for “argued”, the saved document should still show up, even though “arguing” and “argued” are not exactly the same word. However, both are about the fact that somebody argued (a point, with somebody, themself or others). The words “argued” and “arguing” both resolve to the stem “argu”, which is not a word itself. This stem is what ends up in the index.

This was not yet possible in Picky.

And surprisingly, it did not seem urgent, as nobody complained.

Until, of course, somebody did.

Usage

Let’s make this simple: how do you use this in Picky?

(Look up the current spec, if that is most convenient to you.)

It is very easy. Both Index#indexing and Search#searching methods offer the option stems_with.

You give it an object that responds to stem(word), which gets a tokenized word, and returns a stemmed word. One such stemmer is Lingua::Stemmer. In the tokenization pipeline, it is the last step to be executed.

Therefore, if you want stemmed words in the index, use this:

index = Picky::Index.new :stemming do
  indexing stems_with: Lingua::Stemmer.new
  category :some_text_that_needs_to_be_stemmed
end

Usually, if you use stemming, you also want search terms to be stemmed when searching (otherwise your search for “arguing” will not find “argued” in the index).

index = Picky::Search.new index do
  searching stems_with: Lingua::Stemmer.new
end

But as usual, the flexibility of Picky leaves that decision up to you: it could be that you are writing a stem-search, where you don’t stem in the search. Or you already only get stems for the index, no stemming needed (or even allowed), and you only need to stem on the user’s input.

A word of caution

If somebody searches for e.g. “Arguing!”, and you don’t remove the “!” (either by declaring it illegal in the tokenizer, or split on it), then Picky won’t stem it, since the stemmer doesn’t know what to do with “Arguing!”. It, however, would be perfectly able to stem “Arguing”. Consider yourself warned so we don’t have to argue later on.

Why anybody would search for “Arguing!”, I don’t know. I could for example see Paul Ryan search for: “Arguing and debating, how does it work?”

Next Experimental Features for Picky 5

Share


Previous

Comments?