Normalizing Indexed Data Tweet
A quick blog post on a Picky tokenizer option.
Intro / Problem
On mobile devices it can be a bit annoying to enter special symbols, like +
, or &
, and it would be easier to just enter plus
, or and
.
Or maybe there are a lot of abbreviations, like abbrev
, or e.g.
, but you’d still like to find the item when searching for abbreviation
, or example
.
Or maybe you’d like number 1
to be findable with one
.
In the search engine domain, this is one part of text normalization, the examples being expanding abbreviations and converting numbers.
In Picky, this is done using the tokenizer option normalizes_words
.
Tokenizer option “normalizes_words”
This option makes the tokenizer normalize words before indexing them.
The usage is very simple. Just pass a 2d array of regexps and replacement terms into the normalizes_words
option, like so:
index = Picky::Index.new :normalized do
indexing normalizes_words: [
[/\+/, 'plus'], # + -> plus
[/\&/, 'and'], # & -> and
[/\w\//, 'with'], # w/ -> with
[/abbr(ev)?/, 'abbreviation'], # abbr, abbrev -> abbreviation
[/e\.g\./, 'example given'] # e.g. -> example given (note that the . have to survive)
]
end
Note that
- stopwords
- case
- character removal
- character replacement
are specifically handled in options
stopwords: /\b(word1|word2|...)\b/
case_sensitive: true/false
remove_characters: /[characters]/
substitutes_characters_with: Picky::CharacterSubstituters::WestEuropean.new
and should be handled there.
Alternatives
What if this doesn’t work for you?
No problemo! Picky is all Ruby, so feel free to either monkey patch, or probably better: Preprocess the data to your heart’s content.
Have fun!
Next Guest Post: Chris Corbyn of FlippaShare
Previous CocoaPods Search Design