Normalizing Indexed Data

ruby / picky / indexing

A quick blog post on a Picky tokenizer option.

Intro / Problem

On mobile devices it can be a bit annoying to enter special symbols, like +, or &, and it would be easier to just enter plus, or and.

Or maybe there are a lot of abbreviations, like abbrev, or e.g., but you’d still like to find the item when searching for abbreviation, or example.

Or maybe you’d like number 1 to be findable with one.

In the search engine domain, this is one part of text normalization, the examples being expanding abbreviations and converting numbers.

In Picky, this is done using the tokenizer option normalizes_words.

Tokenizer option “normalizes_words”

This option makes the tokenizer normalize words before indexing them.

The usage is very simple. Just pass a 2d array of regexps and replacement terms into the normalizes_words option, like so:

index = :normalized do
  indexing normalizes_words: [
    [/\+/, 'plus'], # + -> plus
    [/\&/, 'and'], # & -> and
    [/\w\//, 'with'], # w/ -> with
    [/abbr(ev)?/, 'abbreviation'], # abbr, abbrev -> abbreviation
    [/e\.g\./, 'example given'] # e.g. -> example given (note that the . have to survive)

Note that

are specifically handled in options

and should be handled there.


What if this doesn’t work for you?

No problemo! Picky is all Ruby, so feel free to either monkey patch, or probably better: Preprocess the data to your heart’s content.

Have fun!

Next Guest Post: Chris Corbyn of Flippa