Normalizing Indexed Data Tweet
A quick blog post on a Picky tokenizer option.
Intro / Problem
On mobile devices it can be a bit annoying to enter special symbols, like +, or &, and it would be easier to just enter plus, or and.
Or maybe there are a lot of abbreviations, like abbrev, or e.g., but you’d still like to find the item when searching for abbreviation, or example.
Or maybe you’d like number 1 to be findable with one.
In the search engine domain, this is one part of text normalization, the examples being expanding abbreviations and converting numbers.
In Picky, this is done using the tokenizer option normalizes_words.
Tokenizer option “normalizes_words”
This option makes the tokenizer normalize words before indexing them.
The usage is very simple. Just pass a 2d array of regexps and replacement terms into the normalizes_words option, like so:
index = Picky::Index.new :normalized do
indexing normalizes_words: [
[/\+/, 'plus'], # + -> plus
[/\&/, 'and'], # & -> and
[/\w\//, 'with'], # w/ -> with
[/abbr(ev)?/, 'abbreviation'], # abbr, abbrev -> abbreviation
[/e\.g\./, 'example given'] # e.g. -> example given (note that the . have to survive)
]
end
Note that
- stopwords
- case
- character removal
- character replacement
are specifically handled in options
stopwords: /\b(word1|word2|...)\b/case_sensitive: true/falseremove_characters: /[characters]/substitutes_characters_with: Picky::CharacterSubstituters::WestEuropean.new
and should be handled there.
Alternatives
What if this doesn’t work for you?
No problemo! Picky is all Ruby, so feel free to either monkey patch, or probably better: Preprocess the data to your heart’s content.
Have fun!
Next Guest Post: Chris Corbyn of FlippaShare
Previous CocoaPods Search Design