Searching with Picky: Character Substitution Tweet
This is a post in the Picky series on its configuration. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)
What is Character Substitution?
Character substitution in a search engine is one of the first steps in the process of sanitizing your users’ input.
Examples:
ä => ae
,
ø => o
,
é => e
This is used to make the search engine indifferent to a user’s origin or way of writing.
For example, my hometown is called Zürich
, with an umlaut character, ü
.
German users will search with an ü. However, most users of the world don’t know this character, and will simply type Zurich
. So what we want is make the search engine ignore the umlaut diacritics, the two dots over the u.
How do we do this?
Usually, what search engines do is perform a sort of character substitution before putting text into the index, so Zürich
will go into the index as zurich
. For that, we character substituted ü => u
. I also lowercased it, since that is what search engines also do, to significantly save index space.
So now we have Zurich
in the index. If a user now searched for Zürich
, the search engine wouldn’t find it.
So what we do is also perform this character substitution in a query, so that if the user enters an ü
, it is replaced by an u
, making Zurich
out of Zürich
.
In a nutshell, the indexing and the querying map both Zürich
and Zurich
to Zurich
and a user will find it, regardless if they searched for my hometown with or without umlaut.
How do we do this in Picky?
Picky offers two class methods in a Picky Application
where you can define how characters are substituted, amongst other things:
default_indexing options = {}
default_querying options = {}
The default_
in the method name comes from the fact that whatever options are given, will be used for all indexing and querying unless overridden. So most of the time you will be configuring it there.
One of the options is substitutes_characters_with
and you give it a character substituter object that has a #substitute(text)
method.
Picky already includes one for west european character sets. You use it as follows:
default_indexing substitutes_characters_with: CharacterSubstituters::WestEuropean.new
I use the Ruby 1.9 hash style, key: value
, for that. The rocket I use for mapping things, map '/some/path' => controller
.
What the west european character substituter does is this:
ä => ae
,
Ä => Ae
,
ë => e
,
Ë => E
,
ï => i
,
Ï => I
,
ö => oe
,
Ö => Oe
,
ü => ue
,
Ü => Ue
,
and 22 others. See the spec if you’d like to know more.
So a query like Hände Nüsse
will be sanitized to haende nuesse
before being further processed. Again also lowercasing it, since this is usually also done.
How do I define my own character substituter?
It is extremely simple. A character substituter just needs to implement a substitute(text)
method that returns the substituted text.
See the source of the west european substituter if you want to see how I did it.
Why is it so illegibly written?
It is heavily optimized. Since this method will be called for all indexed data, and for each query, it should be performant.
The west european spec includes two performance specs for that:
describe "speed" do
it "is fast" do
result = performance_of { @substituter.substitute('ä') }
result.should < 0.00009
end
it "is fast" do
result = performance_of { @substituter.substitute('abcdefghijklmnopqrstuvwxyz1234567890') }
result.should < 0.00015
end
end
The method performance_of
is used in Picky quite often to maintain performance and notify me should anything get slower. It looks like this:
def performance_of &block
GC.disable
result = Benchmark.realtime &block
GC.enable
result
end
Conclusion
So we’ve seen
- that most search engines need a character substituter.
- that character substituter help your international users find things.
- how they are configured in Picky.
- how you can write your own.
Hope you learnt something new :)
Contributing one to Picky
If you write your own, please let me know!
Next Searching with Picky: Partial SearchShare
Previous Speccing methods called in initialize