code is code

Picky Tutorial: Rails 3.2

2012-12-09T00:00:00+11:00

A quick sidenote: The main Picky site is now running at pickyrb.com

Update: Thanks to Gleb Mazovetskiy (@glebm) on his input on ActiveRecord.

Intro

You’d like to integrate a small Picky server directly in the Rails 3.2 Rails app you are running?

This is the tutorial for you.

To make things a bit more interesting, I want to be able to filter a query with the current user – and also have an AJAX search interface.

Note that the indexes for this search will be created on startup and that they will live in your app. If you need big indexes, or a more elaborate search you should go for a separate Picky search server.

The code pieces below are quite large mostly because of the elaborate comments. In reality, the whole search clocks in at about 30 lines – and could be further reduced to about 15, without any configuration.

Files We Will Touch

Gemfile
initializers/picky.rb
model.rb
controller.rb
views/JavaScript

Gemfile

First of all, we start out by adding picky and the picky-client to the Gemfile, like so:

gem 'picky', '~> 4.9'
gem 'picky-client', '~> 4.9'

The spermy operator ~> results in versions 4.9 up to but not including 5.0 being used, at which point the API changes which might result in your application not running anymore.

Then do a

bundle install

like the latest code preachers tell us to.

initializers/picky.rb

Here’s where you define the actual indexes and configure Picky. This is an example where we use a very generic model, imaginatively named “things”:

# Silence Picky, as an example.
#
Picky.logger = Picky::Loggers::Silent.new

# We create a new index and store it in the constant ThingsIndex.
#
ThingsIndex = Picky::Index.new :things do
  # Our keys are integers.
  # Use :to_s if you have strings.
  #
  key_format :to_i

  # Default indexing options.
  # Please see: https://github.com/floere/picky/wiki/Indexing-configuration
  # for more information.
  #
  indexing removes_characters: /[^a-z0-9\s\/\-\_\:\"\&\.]/i,
           stopwords:          /\b(and|the|of|it|in|for)\b/i,
           splits_text_on:     /[\s\/\-\_\:\"\&\/]/,
           rejects_token_if:   lambda { |token| token.size < 2 }

  # We can search on the titles of the thing.
  #
  # We use postfix partials which means a word can
  # be found if only part has been entered (from the beginning).
  #
  category :title, :partial => Picky::Partial::Postfix.new(:from => 1)

  # We should also be able to search the years that the things have.
  #
  # We want the exact year, so no partial searching.
  #
  category :year,
           :partial => Picky::Partial::None.new

  # We should be able to restrict searches to a specific user.
  #
  # This needs to be an exact (non-partial) search, as we don't 
  # want user 15 to be found when searching for user 1.
  #
  # The :from designates the message used to get the user_ids.
  #
  category :user,
           :partial => Picky::Partial::None.new, 
           :from => :user_ids_as_string

end

# ThingsSearch is the search interface
# on the things index.
#
# See https://github.com/floere/picky/wiki/Searching-Configuration
# for some tokenizing options.
#
ThingsSearch = Picky::Search.new ThingsIndex

# We are indexing at the end of this method
# using explicit indexing.
#
# Feel free to run the initial indexing somewhere else.
#
Thing.order('title ASC').each do |thing|
  ThingsIndex.add thing
end

Next up is the model.

model.rb

The model is straightforward: we want to index when saving a model, or delete the model from the index.

# After committing, index.
#
after_commit :picky_index

# Index correctly, depending on whether it
# was destroyed or updated/created.
#
def picky_index
  if destroyed?
    ThingsIndex.remove id
  else
    ThingsIndex.replace self
  end
end

# Since we want to index all users that have something to
# do with this thing together with it, we return a string
# of space separated user ids.
# (Picky version 5 will be able to use user_ids directly)
#
def user_ids_as_string
  user_ids.join ' '
end

If we didn’t have the special case with the user ids, we’d only have two lines in the model.

Now, the controller is a bit bigger…

controller.rb

Create a controller action and wire it up in the routes.rb correctly. For example:

resources :things do
  collection { get :search }
end

Now, back to the search action.

def search
  # This line prepends the current user to the query.
  #
  # Since we have indexed the thing's user in the
  # user category, we can prepend a filter to the
  # currently received query.
  #
  # A query like
  #   "one two three"
  # will be transformed into
  #   "user:15 one two three"
  # which will result in things only
  # being found if it is associated to the current user.
  #
  query = "user:#{current_user.id} #{params[:query]}"

  # Perform the search.
  #
  results = ThingsSearch.search query, params[:ids] || 20, params[:offset] || 0
  
  # Render each thing in the results nicely as a partial.
  #
  # (You need to have a "thing" partial file)
  #
  results = results.to_hash
  results.extend Picky::Convenience
  results.populate_with Thing do |thing|
    render_to_string :partial => "thing", :object => thing
  end
  
  # We respond with a nice JSON result.
  #
  respond_to do |format|
    format.html do
      # Homework: Make this a nice HTML results page.
      #
      render :text => "Deal result ids: #{results.ids.to_s}"
    end
    format.json do
      render :text => results.to_json
    end
  end
end

JavaScript

The javascript is a bit more elaborate.

The picky-client helper method .cached_interface (code) gives you the HTML:

<%= Picky::Helper.cached_interface %>

Picky comes with its own JS library (code, 12kB), and lots of configuration options (list).

It knows two modes of searching: full and live. Full searching is run on pressing enter and expected to return rendered results, to show them in a results list. Live searching runs while typing and only updates the counts next to the input box.

This example is a bit special as it renders live searches as if they were full ones. It’s like pressing enter while typing.

So in a JS file – or coffeescript, if you like that – insert this:

$(window).load(function() {
  pickyClient = new PickyClient({
    full: '/things/search',  // The URL that maps to our search action.
    fullResults: 50,         // Default is 20.
    live: '/things/search',  // Use the same URL as the full search.
    liveResults: 20,         // Default is 0.
    liveRendered: true,      // Render live results as if they were full ones.
    liveSearchInterval: 166, // Time between keystrokes before it sends the query.
    searchOnEmpty: true,     // Search even when the query field is empty.
    
    // beforeInsert: function(query) {  },   // Optional. Before a query is inserted via pickyClient.insert(...).
    // before: function(query, params) {  }, // Optional. Before Picky sends any data. Return modified query.
    // success: function(data, query) {  },  // Optional. Just after Picky receives data. (Get a PickyData object)
    // after: function(data, query) {  },    // Optional. After Picky has handled the data and updated the view.
  });
};

As you can see, the Picky JS interface offers you four callbacks that are called: before inserting a query (sanitize a query), before sending the query (add any filters from radio buttons, checkboxes etc.), just after receiving the data (modify the incoming data as you wish), and after updating the view (make modifications and necessary updates to the view).

This is pretty handy and is used in the cocoapods.org search (example code) to add the OS filter to the query without it being visible in the search field (but in the URL).

End

I hope this helps getting Picky into your Rails app :)

Finally, if you don’t want to index each time your app is started, you could use load and dump on the index. Perhaps like this…

In the initializer, to save the index:

at_exit do
  ThingsIndex.dump
end

To load the index:

tries = 0
begin
  exit 1 if tries > 1
  ThingsIndex.load
rescue
  tries = tries + 1
  ThingsIndex.index
  retry
end

Cheers and have fun!

Experimental Features for Picky 5

2012-11-20T00:00:00+11:00

This is a quick post about two experimental features in Picky 4.11+ that will be available stably in Picky 5.

Intro

Picky is very much driven by its users.

After adding stemming in Picky 4.6.6 from a push I got by John Barton and Glen Maddern of goodfil.ms fame, Andy Kitchen supplied a piece of code for automatic word segmentation, while also mentioning that he needs a range query.

They are now both available as experimental features.

Range queries

Let’s say you’d like to find all people born in 1977, 1978, and 1979. Previously, this was not too easy to do in Picky.

Now you can. Let’s look at a full copy-and-paste-able example:

require 'picky'
  
index = Picky::Index.new :people do
  key_format :to_s
  category :year
end

Person = Struct.new :id, :year

index.add Person.new('Picky',   2008)
index.add Person.new('Kaspar',  1978)
index.add Person.new('Florian', 1977)
index.add Person.new('Joe',     1955)

people = Picky::Search.new index

p people.search('1977-1979').ids
p people.search('year:1977-1979').ids
p people.search('year:1900-2010').ids

The first result will be

["Florian", "Kaspar"]

since I was born in 1977, and Kaspar was born in 1978. If you categorize it with year:1977-1979 it will yield the same result. If you only want results for a specific category, remember to categorize it by prefixing a search term or range category_name:.

By going over the whole range, as in the third result, you’ll get

["Joe", "Florian", "Kaspar", "Picky"]

as the range year:1900-2010 includes all the results.

Range queries the Ruby way

Picky internally uses Enumerable#inject, so any range will work. For example, initial:a-d will yield results for each "a", "b", "c", and "d". Cool, eh?

Not impressed? Read on…

Custom ranges!

Andy Kitchen was happy with the range queries, however he needed range queries that were wrapping. If somebody wanted to find eg. an event that was on between 10pm and 2am in the morning, the current range query implementation did not allow that, as event_start:10-2 did not work (#each or #inject will yield nothing).

Because Picky accepts any kind of range, he implemented a wrapping range (the version here is a slight rewrite of the original):

class Wrap12Hours
  include Enumerable

  def initialize(min, max)
    @hours = 12
    @min   = min.to_i
    @top   = max.to_i
    @top   += @hours if @top < @min
  end

  def each
    @min.upto(@top).each do |i|
      yield (i % @hours).to_s
    end
  end
end

This is then passed into an index category like this

category :hour, ranging: Wrap12Hours

to make Picky use this “ranging” for that category.

The result: If Wrap12Hours is given a range like 10-2, it will #each this: [10, 11, 0, 1, 2], which is exactly what he needed.

Picky range queries use #inject, but there is no #inject on Wrap12Hours – so why does it work? Note that Andy does an include Enumerable. Enumerable#inject uses the #each method which is already there to implement #inject and some other methods. Pretty snazzy! (And, I might add, the Ruby way of doing things)

The ability to implement custom ranges is very powerful and underlines the flexibility of Picky.

Automatic word segmentation

Just a quick note on this as it is just a sketch, currently. A fully functional sketch, though.

What if you want to not split on a regexp as you would usually, but you’d like Picky to split on words in the index.

So if you had “purple”, “rainbow”, and “pony” (don’t ask) in your index, then you’d want Picky to automatically split a query like “purplerainbowpony” into “purple”, “rainbow”, “pony”.

This can be achieved by giving the search category option splits_text_on an automatic splitter rather than a regexp. The automatic splitter is initialized with the index category you’d like to use for the splitter.

automatic_splitter = Picky::Splitters::Automatic.new index[:text]

some_search = Picky::Search.new index do
  searching splits_text_on: automatic_splitter
end

That’s it!

Note that if you want to test the spitter itself you can simply call #split on it, as this is the method called by the Picky Tokenizer to split incoming queries:

automatic_splitter.split 'hellopicky' # => ['hello', 'picky']

Please give it a go and report back!

The partial option

The automatic splitter supports a partial option. This will make Picky also use the partial index.

automatic_splitter = Picky::Splitters::Automatic.new index[:text], partial: true

What does it mean? It means that it will

automatic_splitter.split 'hellopic' # => ['hello', 'pic']

correctly split off the partial ‘pic’. The non-partial version would simply split off ‘hello’:

automatic_splitter.split 'hellopic' # => ['hello']

Have fun!

As Picky grows and grows, I am especially happy that Picky is fed well by its enthusiastic and helpful users.

This is much appreciated, amigos! Keep it coming :D

Outlook for Picky 5

The above features will – after some polishing and feedback – be included into Picky 5.

Environments

After a discussion with Kaspar Schiess (my cofounder at The Technology Astronauts), I am very inclined to drop environments (ie. development, test, production) in the next Picky.

Have you ever asked yourself if you really need environments?

I hope to cover this topic in the next post.

Cheers, and have (pink, tentacly) fun!

Picky Stemming

2012-10-15T00:00:00+11:00

This is a quick post about a new feature in Picky 4.6.6+: stemming.

Stemming

Stemming is used in information retrieval, and basically serves the purpose of “finding the thing” in an index, even if the appearance of the thing was different in the original.

In other words: if we had saved the word “arguing” in the index, then when somebody searches for “argued”, the saved document should still show up, even though “arguing” and “argued” are not exactly the same word. However, both are about the fact that somebody argued (a point, with somebody, themself or others). The words “argued” and “arguing” both resolve to the stem “argu”, which is not a word itself. This stem is what ends up in the index.

This was not yet possible in Picky.

And surprisingly, it did not seem urgent, as nobody complained.

Until, of course, somebody did.

Usage

Let’s make this simple: how do you use this in Picky?

(Look up the current spec, if that is most convenient to you.)

It is very easy. Both Index#indexing and Search#searching methods offer the option stems_with.

You give it an object that responds to stem(word), which gets a tokenized word, and returns a stemmed word. One such stemmer is Lingua::Stemmer. In the tokenization pipeline, it is the last step to be executed.

Therefore, if you want stemmed words in the index, use this:

index = Picky::Index.new :stemming do
  indexing stems_with: Lingua::Stemmer.new
  category :some_text_that_needs_to_be_stemmed
end

Usually, if you use stemming, you also want search terms to be stemmed when searching (otherwise your search for “arguing” will not find “argued” in the index).

index = Picky::Search.new index do
  searching stems_with: Lingua::Stemmer.new
end

But as usual, the flexibility of Picky leaves that decision up to you: it could be that you are writing a stem-search, where you don’t stem in the search. Or you already only get stems for the index, no stemming needed (or even allowed), and you only need to stem on the user’s input.

A word of caution

If somebody searches for e.g. “Arguing!”, and you don’t remove the “!” (either by declaring it illegal in the tokenizer, or split on it), then Picky won’t stem it, since the stemmer doesn’t know what to do with “Arguing!”. It, however, would be perfectly able to stem “Arguing”. Consider yourself warned so we don’t have to argue later on.

Why anybody would search for “Arguing!”, I don’t know. I could for example see Paul Ryan search for: “Arguing and debating, how does it work?”

How I develop a feature for Picky

2012-07-23T00:00:00+10:00

How do I add a feature – here: Facets – to Picky? When? Why?

Starting out 2 years ago, I had a relatively clear picture of what I was going to do in the original roadmap.

The last 3 points are:

Obtain real live octopus. Call it Picky and teach it searching tricks.
Become mayor of Krakow. Hold more Ruby conferences there. Eat all the available polish food.
Implement coffee making capabilities.

Pets aren’t allowed by my landlord. Also, as you can see I’m still working on becoming the mayor of Krakow. Regarding the coffee making capabilities, I am still evaluating several brands of coffee, converging on Papua New Guinean blue mountain sun roasted beans.

Thankfully, world domination is already achieved. Or can you show me one of the seven seas which is not yet filled with octopi?

But seriously: Where do you go from here? Total chaos, burning lines of code? Software pattern anarchy? Class warfare?

UNDD: User Need Driven Development

I find myself often without direction regarding Picky – since I don’t use it myself for any especially challenging projects (with Picky, too, no project is challenging – just kidding), how does it get to push its own boundaries?

Thankfully, Picky has a few helpful users to push it a bit:

UNDD, aka User Need Driven Development! (Coincidentally almost the German word for “and”, ie. “und” – UNDD expressed as a sentence: “We’d like this and that and and and and…”, it basically never ends)

A week ago, UNDD happened: https://groups.google.com/forum/?fromgroups#!topic/picky-ruby/UvIxg4d1PME

David Lowenfels asked: “I am wondering if Picky can do facets?”

As with any case of UNDD, if there is no philosophical reason against including it in a framework, the answer is always:

Not yet, but…

Example: Facets

Facets – as I understand them – is slicing the available data into categories and category-facets.

David gave a good example with this hiking boot page. On the left facets are used to refine (filter) the results. In “Brand” we find “Salomon”, “Merrell”, “Timberland”, etc.

If you then choose eg. “Salomon”, only Salomon shoes are shown. And, more importantly, not all Gender refinements are available anymore, but only the ones that are relevant to the brand “Salomon”.

So, should I add that to Picky? Let’s review the official feature policy™:

Feature Philosophy

Picky’s Feature Philosophy, reprinted here:

1. If it is relatively easy to do, I write a feature myself.
2. If it is relatively easy to do, but not perfect, I write it myself too, with the option of adding an adapter to another search engine later.
3. If it is hard to do (and it is too much against Picky’s structure and way of doing things), I write a Query object that uses another search engine.

Is it easy to do?

My first reaction to David’s question was: Of course! Facets are all about filtering – and Picky is all about filtering.

Eeeeeasy. Right?

Not necessarily. Although Picky’s inverted indexes (eg. { ‘florian’ => [1, 4, 5, 19] }) already contain the right structure to get facets, it’s not so clear cut in the case where a facet already was applied as a filter.

Initially I thought that this is a #1 case, but due to the multiple facets applied filtering, it’s squarely in #2: I can write it myself, but it might not be that easy.

How do we go about implementing this feature?

Write first

Write first. Before your code reaches perfection, just write. This could be rewritten as Stupid and works > Perfect and doesn’t.

I always write a very simple solution first, and even though it might be slow, I am happy.

Straightforward facets on the Index instance

The first stab at facets for class Picky::Index was ultra simple:

def facets category_identifier
  self[category_identifier].exact.weights
end

So I simply get the right category from the index and extract the right index. In this case the weights.

It is used like so (data is the index):

data.facets :brand

This code eg. results in:

{
  'salomon' => 3.14,
  'merell' => 1.61,
  …
}

Nice, eh?

The actual method signature is now facets(:category, more_than: N) with the more_than option a filter for only including facets with weight higher than N.

This is, of course, blazingly fast.

What about facet filtering?

Filtered facets on the Search instance

This one was a bit of a head scratcher. Picky does not have any indexes that would allow it to easily extract filtered facets.

What was I to do?

Remembering “write first” I simply made it work, disregarding all performance issues. Some details are omitted:

def facets category_identifier, options = {}
  weights = index.facets category_identifier, options
  
  return weights unless filter_query = options[:filter]
  
  weights.select do |key, weight|
    search("#{filter_query} #{category_identifier}:#{key}", 0, 0).total > 0
  end
end

This is used like so:

search.facets :brand, filter: 'gender:unisex', more_than: 3.14

Let’s look at the code pieces in turn:

weights = index.facets category_identifier, options

Get the facet hash we got from the facets method in the last section.

If we don’t filter:

return weights unless filter_query = options[:filter]

we simply return it as-is, as in the facets method on an index.

If we need to filter, go over all facets, and remove the ones where we get zero results when applying the filter:

weights.select do |key, weight|
  search("#{filter_query} #{category_identifier}:#{key}", 0, 0).total > 0
end

This returns a facet hash as in the other method.

Note that Picky actually runs a query for each facet.

Is this a problem? It was for David, as he had more than 100 facets. So for each of the 100 facets, a query was run.

However, facets usually number only in extreme cases over 20. I’d say a more useful range is 3 to 10 (see http://www.trailspace.com/gear/boots/midweight/).

In addition to that, facet results are highly cacheable. There is no reason not to cache this result – except, of course, if the data is highly dynamic. But even then, I’d cache it for half an hour.

If you look at the last piece of code, you notice something: filter_query is passed into that search multiple times. Couldn’t that be optimized?

Clean up later

Indeed it can. But remember, we wanted to get it out and working first. This serves a dual purpose:

A user can already work with it, with the promise of it getting faster.
I am now under pressure of improving it.

The above code then resulted in this mini roadmap for facets:

~~Write first simple implementation.~~ (This can be released as “experimental”)
Improve the code by not tokenizing the filter query each time. (This can be released officially)
Optimize the code by either redefining the API, or only partially run the query. (This can be released in a white paper)

What do I mean by #2? Again, for each facet, Picky does the work of tokenizing the filter_query that is interpolated into the query. See:

search("#{filter_query} #{category_identifier}:#{key}", 0, 0).total > 0

This is bad, of course. So we could rewrite the method to either only accept a pretokenized filter, something like:

search.facets :brand, filter: [['gender'], 'unisex', ['price', 'age'], 50], more_than: 3.14

So, a filter would be an array of pairs, filter categories and filter value. This would reduce the impact on Picky a lot already. However, I like the flexibility of passing in a search string to filter.

So #2 means that Picky will process the string once, and we will then use the tokenized results to put together an optimized query. Something akin to:

filter_tokens = tokenize filter_query
facets.select do |key, _|
  query_tokens = tokenize "#{category_identifier}:#{key}"
  search_with(filter_tokens + query_tokens, 0, 0).total > 0
end

Suddenly we don’t do as much work anymore. Nice.

Point #3 is a bit harder, and usually, this is optional, or a coding/thinking goodie for later. Here, I could partially evaluate the filter query, and then use the halfway evaluated query to inject it with the variable parts (each facet), and continue running it for the final result. If this just sounded like garbled blah to you – it’s fine. It just means I have no idea how to specifically do this. Yet.

In short

This is how I develop Picky features:

Listen to the needs of your users.
Check if the need goes against the Picky grain.
Say “Not yet.”
Implement stupidly.
Release experimentally.
Say “Please try.”
Refine cleverly.
Release officially.
Leave ultra-cool rewrite for a glorious future.
Wait for next user request.

And that is it.

And faster still

2012-07-16T00:00:00+10:00

Lately I’ve been obsessed with making Picky as fast as possible (while not sacrificing any flexibility).

This post is all about exploiting Picky’s flexibility to gain speed. We’ll also push towards its extremes to see how to sacrifice some of the flexibility to gain even more speed!

So if you need a high performance Picky, or simply like to see big numbers: This is the post for you!

As is the trade off of the high priests of speed: On the altar of performance, they are going to sacrifice flexibility…

The tests

All tests are run on my MacBook Pro 2010 model with 2 cores. They are all based on the standard Picky example you get when you run:

$ picky generate server some_server_directory

We will modify that example slightly to adapt it to use different servers, however.

We run three queries of varying complexity. First, just “a” (which means “a*”), complexity 1, then “a* a”, complexity 2, then “a* a* a” (see below for results of these queries). This covers more than 99% of all usual Picky search cases. As Picky is a combinatorial search engine, we expect a nonlinearly increasing query duration.

How much we will find out :)

All numbers are in requests per second.

Unicorn

Unicorn is the workhorse of the web servers. It is reliable, can use multiple cores, and has so far been the recommended server for Picky, also because it weakens the impact of GC runs.

Let’s see how it fares:

Complexity 1:	619	= (600 + 632 + 625 + 620 + 619)/5
Complexity 2:	588	= (595 + 585 + 580 + 596 + 584)/5
Complexity 3:	527	= (561 + 537 + 425 + 552 + 562)/5

Quite respectably. But we don’t want a workhorse. We want an arabian horse that shoots fire out of its nostrils! (and anywhere else, for that matter)

Thin (with Sinatra)

Thin is a very well known event machine based server. It is fast.

How fast?

Complexity 1:	1252	= (1262 + 1213 + 1270 + 1244 + 1269) / 5
Complexity 2:	1059	= (1091 + 993 + 1042 + 1097 + 1074) / 5
Complexity 3:	936	= (872 + 931 + 946 + 975 + 954) / 5

That is impressive, given that these are the numbers from one core.

Two weeks ago, this happened:

Racer (with Sinatra)

Racer by Charlie Somerville is a “Rack compliant Ruby web server”. It is mainly based on libuv. According to its README (worth a look just for the image ;) ), it is twice as fast as thin using a “Hello world!” app.

As Picky performs a bit more work than a simple “Hello world!”, it won’t be twice as fast. But how much faster will it be? Let’s see…

Complexity 1:	1370	= (1374 + 1381 + 1384 + 1374 + 1337)/5
Complexity 2:	1134	= (1243 + 1153 + 1088 + 1072 + 1115)/5
Complexity 3:	1094	= (1143 + 1081 + 1081 + 1080 + 1084)/5

Now, why don’t we get double the speed as with thin, as shown on Racer’s webpage, but just 10%? The thing is, instead of just returning “hello world”, Picky needs to do a bit of work.

Picky vs. Racer

To calculate how much of this time is needed by Picky, let’s assume “hello world” takes no time at all, and Racer is double as fast as thin. With Picky, Racer is only 10% faster than thin. What does this tell us about Picky?

Let’s calculate a bit. With the time from “hello world” ignored we know:

1:	T(thin) / T(racer) == 2
2:	(T(thin) + T(picky)) / (T(racer) + T(picky)) == 1.1

Rewriting:

3:	T(thin) + T(picky) == 1.1T(racer) + 1.1T(picky)	from 2.
4:	T(thin) – 1.1T(racer) == 0.1T(picky)	from 3.
5:	T(thin) == 2*T(racer)	from 1.
6:	0.9T(racer) == 0.1T(picky)	from 4, 5.
7:	T(picky) == 9*T(racer)	from 6.

So, Picky (including Sinatra) takes around 9 times longer than Racer. Let’s remember this for our conclusion.

Multiple processes

In the Ruby web app world, to get more speed, we usually run more processes.

As Racer cannot yet accept on file descriptors, I am going to use http load balancers Pen and Nginx and see how they fare on my 2 core MBP.

Pen (with Racer)

Compl. 1:	1993	= (2140 + 1915 + 1901 + 2142 + 1869)/5	1370 (1 core)
Compl. 2:	1696	= (1798 + 1735 + 1631 + 1644 + 1673)/5	1134 (1 core)
Compl. 3:	1490	= (1256 + 1546 + 1541 + 1542 + 1565)/5	1094 (1 core)

Certainly a good result, and plausible since it is not 2x as fast.

Nginx (with Racer)

Compl. 1:	2048	= (2078 + 1993 + 1790 + 2177 + 2203)/5	1370 (1 core)
Compl. 2:	1765	= (1660 + 1843 + 1830 + 1684 + 1808)/5	1134 (1 core)
Compl. 3:	1489	= (1549 + 1456 + 1463 + 1473 + 1503)/5	1094 (1 core)

Nginx seems to be a bit more speed-stable than Pen, but otherwise in the same ball-park.

Sacrificing flexibility

A high priest of speed approaches us to remind us of a good rule:

To gain speed, one must often sacrifice an abstraction layer and its inherent flexibility. Evaluate if this flexibility is needed, and if not, sacrifice without remorse.

The question here is: Do we really need the routing etc. capabilities of Sinatra? (while still keeping the abstraction given to us by Rack)

Let’s assume we don’t and rewrite our app a bit. To remove Sinatra, we simply do not inherit from Sinatra::Base and install a #call method on our class.

# Prepare a few pseudo-constants.
#
query_string = "QUERY_STRING".freeze
result_array = [200, { "Content-Type" => "text/html" }, []]
regexp       = /\Aquery=([^&]+)&ids=([^&]+)&offset=([^\z]+)/

# Define #call method.
#
define_method :call do |env|
  # Extract relevant parameters.
  #
  _, query, ids, offset = *env[query_string].match(regexp)
  results = books.search query, ids || 20, offset || 0
  
  # Put together result.
  #
  result_array[2][0] = results.to_json
  
  result_array
end

Note that we manually extract the parameters from the query_string, and thus reduce the work done to only what we actually need. We don’t need routing or any other processing.

However, we now can only call our app with a strictly ordered query string (and lose the flexibility afforded to us by Sinatra):

?query=S&ids=N&offset=M

(However, we still get Rack conform data)

We run it the exact same way as the Sinatra app:

run BookSearch.new

(We can do this since we still use the abstraction defined by Rack)

Removing Sinatra

Let’s see how our no-sinatra approach turns out to be and compare:

Compl. 1:	3972	= (3855 + 3900 + 4203 + 3574 + 4329)/5	2048 (Sinatra)
Compl. 2:	2295	= (2246 + 2352 + 2337 + 2294 + 2245)/5	1765 (Sinatra)
Compl. 3:	1173	= (1157 + 1157 + 1155 + 1166 + 1232)/5	1489 (Sinatra)

Quite breathtaking, especially in the low complexity case!

Let’s calculate again a bit. We know that:

1:	T(picky + sinatra) == 9*T(racer) == 1/2000 (roughly)
2:	T(picky) == ?*T(racer) == 1/4000 (roughly)

Rewriting:

T(picky + sinatra) == 2*T(picky)

from 1, 2.

This was easier!

From this we see that Sinatra takes as much time as does Picky in the low complexity case. For the highest complexity, Sinatra takes about 30% of the time that Picky takes.

Conclusion

Given that we want speed, and only speed: Knowing that Sinatra and Picky each take about 4.5x the time that Racer does – is it prudent to try many fast servers, or should one simply not use Sinatra?

We arrive at:

Which app server to choose is not as relevant as deciding whether to use Sinatra.

Surprised?

Note (especially to Sinatra fans): Remember, this is always under the assumption that speed is the ultimate goal, and that flexibility can be sacrificed.

However:

If the ultimate speed is what you need, choosing a fast server also becomes important.

That one is pretty obvious.

What if we go one step further?

Next up: Sacrificing Rack?

The big question is:

What happens when we give up the flexibility afforded by Rack?

Let’s say we were to rewrite Racer such that it would not call our app anymore with Rack conform data, but only with minimally processed data (eg. we would not process the domain, for example, but only extract the query string).

How fast can we get this thing? Please tune in in the next blog post, where we explore rewriting Racer for ultimate speed.

Footnote 1: The pinnacle of ultimate speed

To compare: How fast would this be without app servers?

Let’s first see how fast we can get: In pure Ruby

p Benchmark.measure {
  5000.times {
    results = books.search 'a', 20, 0 # and "a* a", and "a* a* a", as above.
    results.to_json
  }
}

Running this on a single core yields us the following (rounded) numbers:

Complexity 1:	6250
Complexity 2:	3000
Complexity 3:	1500

Impressive.

Footnote 2: Results

FYI, these are the JSON results Picky put together for each HTTP response:

{"allocations":[["books",18.439999999999998,74,[["author","a","a"]],[4,7,8,11,18,38,48,51,55,80,97,108,117,119,125,126,132,134,138,140]]],"offset":0,"duration":0.000163,"total":74}

a*-a:

{"allocations":[["books",9.872,36,[["author","a*","a"],["title","a","a"]],[4,7,8,11,18,38,48,51,55,80,117,119,132,134,138,142,165,184,227,239]],["books",6.568,262,[["title","a*","a"],["title","a","a"]],[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]]],"offset":0,"duration":0.00019,"total":36}

a*-a*-a:

{"allocations":[["books",15.44,36,[["author","a*","a"],["title","a*","a"],["title","a","a"]],[4,7,8,11,18,38,48,51,55,80,117,119,132,134,138,142,165,184,227,239]],["books",9.872,36,[["title","a*","a"],["author","a*","a"],["title","a","a"]],[4,7,8,11,18,38,48,51,55,80,117,119,132,134,138,142,165,184,227,239]],["books",6.568,262,[["title","a*","a"],["title","a*","a"],["title","a","a"]],[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]]],"offset":0,"duration":0.000226,"total":36}

Picky Statistics Interface

2012-07-02T00:00:00+10:00

This is a post in the Picky series on its workings. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

This post is about a fun statistics interface I’ve been working on, including a video. Download 4.5.2+, and enter this in your preferred shell.

picky stats path/to/log/file.log

This will tell you this:

Logfile path/to/log/file.log found.
Clam, Picky's friend, is looking at Picky's logfile
path/to/log/file.log
and showing results on port 4567.
== Sinatra/1.3.2 has taken the stage on 4567 for development with backup from Thin
>> Thin web server (v1.3.1 codename Triple Espresso)
>> Maximum connections set to 1024
>> Listening on 0.0.0.0:4567, CTRL+C to stop

Then, in another shell, enter

open localhost:4567

(on OSX) and have fun!.

Video Demo

See this short video (it’s best to full-screen it):

The interface uses this great JS lib: http://square.github.com/crossfilter/. Check it out :)

Interface Usage Ideas

Slice and dice your data:

What queries are slowest?
Are they suspiciously slow in the morning?
How many return more than 1 allocation?
Does more allocations also mean slower? Or more results?

Etc.

Visual Programming 1

2012-06-25T00:00:00+10:00

Have you ever heard of visual programming?

It’s about “writing”, or rather, drawing programs by using a graphical language, where programs aren’t codified as text, but mostly using boxes and lines, with only little text.

It’s one of my favorite hobby horses.

Designing a visual programming language

As an exercise, I’d like to design a visual programming language top-down. By top–down I mean that first we are going to sketch out the visual interface, and how it would work.

We will be looking at whether we can interface the program with its running counterpart, to display debugging information and other information normally hard to look at. Also, can we make a running program changeable, so we can play with the program, as Bret Victor has shown in one of his latest presentations, Inventing on Principle.

If the design proves feasible, we will start working on an implementation. I’d like to do it in Ruby, but I get a strong feeling that Haskell would be far more suited. Why? Because it’s a purely functional language. But more on that later. We’ll see.

Dataflow programming

Let’s look at some ideas. Dataflow programming is an often used paradigm in visual programming environments where graphics are processed. In dataflow programming, all operations on data are represented by boxes with multiple inputs. If all inputs are “ready”, ie. have valid data to offer, the operation will run.

One famous example of this is Quartz Composer. If you haven’t seen it and own OSX, you should type cmd-space quartz composer right now and have a look. It’s great to play with. Some people even used it to make a 24 hours music video stream once… (ah, the memories).

The data in dataflow programming comes from various sources, always positioned on the left hand side, and flowing to the right, where the sinks are. A sink might be a display of some sort, or a loudspeaker.

This language

However, in my imagination, programs look more like trees, or a forest (a set of trees), that are traversed according to rules set in the nodes.

“Huh?”, you wonder, “Tree? What?”. Consider this piece of Ruby code:

def this_or_that thing
  if thing
    this thing
  else
    that thing
  end
end

Let’s assume the tree grows from the left to the right.

This method would be represented by a node, with two branches growing to the right. One branch would be traversed if the thing wasn’t nil or false, and the other if it was.

In fact, the this_or_that method could be considered a tree itself, with a root and two branches, that can be grafted onto other trees. The newly formed tree would represent a new program.

So instead of flowing from left to right, a – let’s say – turtle would traverse the tree down to its branches, and return, following instructions along the way.

Going right would represent function calls, and left returning from them.

The tree metaphor is fitting in other ways. Perhaps you’ve seen the fantastic video series by Abelson and Sussman? In the first lession they talk about interfaces. Let’s say you collapse a subtree of a program – if you then are looking at a useful set of inputs, you know your interface is fine. If you still understand the program when collapsing a subtree, chances are you did a good job of designing the interfaces.

Let’s look at a quick example. The program is about cleaning your house.

(The instructions for Clean Kitchen and Clean Garage are collapsed)

If you collapsed everything that was “inside” Clean Bedroom, ie. Clean Bed, Clean Floor, Clean Windows, would you still understand the program itself?

Yes you would! Whoever is going to do the actual work on cleaning the bedroom would need the detailed instructions, but you know that one of the things that will be cleaned will be the bedroom (well, assuming your kid is actually doing the work – if it’s the compiler, the work will be done). You still understand the program itself: Possibly well designed.

You can already see that naming the nodes is very important, and that it might be easier designing APIs using a visual programming language than text.

So far

Without really designing anything, we already made a lot of assumptions on how this language is going to look.

Let’s just throw our example element up there:

That looks ok. There’s an input and two function calls. We’ve come suprisingly far already, since before looking at simple components, we’ve already built our more complex component.

Did you notice anything? Do we have to use constrained method names like (does_it_exist?)? Do we have to use true or false? Is the underlying language in any way important on how we name things?

No, no, and no: Luckily this is not important anymore. You ask: “How is this possible, when this is incredibly important in my favorite programming language?” Magic?

Possibly next

Knowing a bit about visual programming languages:

How would you design a for loop? Would it be easy to do? (My guess: Hard.) Is it needed? (No.)
How about a map operation? (Easy)
How would you combine functions? (Super easy)
What does it mean to have a higher-order function? (Uh-oh)

Business Cards

2012-06-20T00:00:00+10:00

Last year, Kaspar Schiess and I opened our own business: Technology Astronauts! This May, we started in earnest. Hooray!

(Wait until crowd’s cheers calm down)

However, we found ourselves too many times short one business card when friendly exchanges were in order. Only one way to rectify this:

Technology Astronauts Business Cards

We opted to go for a very simple, striking high-contrast xylography design with a modern font and subjects that are indirectly or directly related to the rough business of astronauting.

First, the moon:

On the northern hemisphere, our main area of operations (apart from, you know, space) if the moon is seen in this configuration, it is filling up, mere days away from blasting earth with photons. Its craters remind us of its resilience towards hits, its striking character and its ability to accumulate new material without changing too much: Moon stays moon, born from the earth itself.

Second, coding astronaut:

Joel Spolsky for example sees the astronaut as running out of oxygen.

I disagree: The astronaut represents the ultimate in human achievement in engineering, they itself are operating in an extreme environment, focusing on a specific number of tasks and performing above the rest of us.

This is fairly standard in business cards. Where do we differentiate us?

“Put a face on it”

I haven’t found any business cards with faces on them, but there must be some out there!

In any case, after dozens of conferences and business meetings I have accumulated about a hundred business cards. Looking at them, I can’t remember the person behind them, if I don’t communicate regularly.

Up to about three weeks after a meeting a face is fresh in my mind. After that? Not so much. If the business card has been handed to me with an accompanying anecdote, I can remember.

Now, this might be just my brain. But chances are, this might be your brain and memory as well.

To bring it up to speed much faster, we decided to “put a face on it”:

This is me laughing uselessly into space. But: We believe that this jogs your memory much better than just mere text. How do you like it?

Say “Hi!” either to Kaspar or me if you want one too :)

P.S: We like it even better if you say “Hi, I have this fantastic project for you!”.

P.P.S: The “Put a face on it” is a reference to this YouTube episode of Portlandia.

P.P.P.S: Joel, you know who is much more about oxygen than astronauts? Divers.

Guest Post: Chris Corbyn of Flippa

2012-03-20T00:00:00+11:00

This is a great guest post by Chris Corbyn where he explains the search engine journey undertaken by Flippa and the decisions behind them.

Intro

(Later sections written by Chris Corbyn)

Wondering why us developers don’t talk much more about search engine design, I asked on twitter:

App developers with searches: Am I the only one here who thinks search design should be much much more important than it is? I’m interested!
— Florian Hanke (@hanke) March 3, 2012

Subsequently, a few discussionlets developed with other people also interested in search engine design: @ezkl, @_tomash, @manfreds and last but not least @d11wtq. Thanks all!

The man behind the curious pseudonym @d11wtq was Chris Corbyn, who took the time to respond in full on the design of Flippa, “The #1 Marketplace for Buying and Selling Websites”, where the search engine takes center stage.

With his gracious permission I am reprinting his email in full.

In Chris’ words:

“The motivation behind putting a focus on search

At Flippa, we’re currently up to our 3rd implementation of search and we consider it hugely important to the success of our business. We’re something along the lines of an eBay platform, but built purely for buying and selling websites. If buyers cannot find what they are looking for, we quickly lose those users, and if we don’t have buyers on the site, logically we lose the sellers who market to them. I still think we have a lot of room to improve, but when we look back at our first implementation, we have come a long way. I guess we have come to learn over time just how important search is, rather than it being something that was apparent to us right from the day we launched Flippa (three years ago).

The first implementation

When we built Flippa, we knew we needed a search, but the scope of this was simply something that had to exist so that users could find listings by keywords. It was not something that was well-integrated with the rest of the application. We used Solr (as was the fashion at the time) and search was just a “side feature” that was often forgotten about. Users could enter a keyword and get a set of results in a listing format entirely different from the layout we use when browsing our listings via the primary navigation. Users regularly complained, with reason… our search was more or less useless for their needs, which were far more complex than matching on keywords.

The second implementation

Acknowledging that users needed to be able to search on a range of metrics and that we needed to make some rather substantial changes to our search infrastructure, we sat down to discuss what our end goals were. Our users are interested more in raw numbers, than in text (e.g. they search for websites based on revenue, on page views, on alexa rank, etc). Users also wanted the ability to put together a custom search and save it to the database, so that when they returned to the site they could easily repeat a previous search.

We decided that we effectively needed to build a complete model around our search system, providing all the criteria our users would search on, in such a way that a fully-built set of criteria would be saved into the database for re-use. We also wanted to integrate our primary navigation with this search system. I don’t remember what the driving force was behind using the same system for our primary navigation, but I suspect it was mostly about unifying the UI and the underlying model code, in addition to improving our categorization of listings—sellers could previously specify if their website was “high end” or “turnkey”, for example, but with this new search system we could determine such things by looking at the numbers, in realtime.

We dropped Solr in favour of Sphinx, for two reasons:

Indexing time with MySQL was considerably faster.
It provided SphinxSE, which is a plugin for MySQL, allowing the index to be queried through MySQL.

We built an advanced search library around Sphinx, allowing us to compose searches from a selection of pre-defined criteria, which were all exposed in the UI through our advanced search page. Because of the MySQL integration, internally searches became a combination of full-text index querying + an INNER JOIN to our listings table in MySQL. A sort of hybrid of MySQL and Sphinx full-text querying. Searches could be saved to the database, though regrettably, as entire serialized objects. We actually stored our primary navigation options this way in the database too. This turned out to be a big mistake when it came to data portability.

From our users’ perspective, the search capabilities were good, but it was too difficult to use. We had tried to provide all the options they could ever need, but the end result was that there were too many options, some of which seemed ambiguous and confusing. Additionally, every time we changed the name of a search field, the serialized objects saved into the database broke, and the migration procedure was much more complicated than it should have been.

Users also found it difficult to “narrow down” their search, since it wasn’t clear what impact changing an input in the advanced search would have on the size of the result set, without performing that search.

At the time we built this particular search implementation, the only way to index your data with Sphinx was to rebuild the entire index. Fortunately this only took about 20 seconds or so in our case (Sphinx is good at doing this stuff efficiently). Though since we wanted close-to-realtime results (as our data changes practically every second or two), we were re-indexing the entire dataset every minute via cron, which was adding some strain to our database servers.

The third (and current) implementation

Generally we are happy with what have now, but we do have some things planned for further improvements.

When we rewrote search this time around, beyond our desire to improve the underlying code internally, we wanted to make it easier for users to “visualize” the data as they browsed. Since users were searching primarily on factors such as revenue and page views, we set about building a faceted search designed to allow click-by-click drill-down of the results, where the facets always show how many results you’ll get if you click on them. The facets would be displayed all the time, no matter what you were searching for. This presented some challenges, since now instead of executing a single query per search, we had to execute something like 20 queries.

Like the previous implementation, we used Sphinx—albeit a newer version with support for realtime indexes and multi-queries, which is how the facets are able to execute efficiently. We also retained the idea of having our primary navigation hooked into our search system. This had worked well for us in the previous implementation. We ditched SphinxSE due to the complexity it added to our server infrastructure and the fact we wanted to use multi-queries in Sphinx, which would not work efficiently through MySQL. While we still use the search system for our primary navigation (which means you’ll always have facets down the side of the page), we stopped storing these in the database and simply have them formalized in code. This makes tweaking them simpler, since it’s a code edit, not a data migration. We also built a proper schema for saving searches, instead of being lazy and serializing objects to the database (the benefits of which, probably do not require further explanation).

Since the primary complaint with our previous implementation, from the user experience perspective, was that it was too confusing to use, we spent a considerable amount of time assessing what options we were providing to users via our advanced search page. It was overly complicated and ambiguous in places. As a result, we decided to either remove search options entirely, or combine them together, thus greatly simplifying the UI for our users. I believe at the same time, we added new options, but the end result was still simpler. Part of this change, however, was designed to draw the focus away from the advanced search and more towards our pre-defined facets, which suit the needs of most casual users browsing the site.

The feedback we’ve had from regarding our current search has been extremely positive. Many of our listings are at the low end of the scale, which many buyers are not interested in. Now buyers are able to quickly filter these out simply by clicking on the facets, directly from our primary navigation options. This is something we were aiming to achieve… we’re looking to encourage more quality listings, so making it easier for buyers to reach these listings and hide the 3-day-old WordPress blogs solves this.

All the code is custom-written in PHP (parts of our site are written in PHP, other parts are written in Ruby). We’ll likely be porting this code to Ruby at some point, though we need a Sphinx gem that supports the features we’re using from Sphinx 2, and Pat Allan’s Riddle gem doesn’t offer this just yet. We may end up writing this ourselves.

Some things we have built around our search

From any of our primary navigation options, you may click “Advanced” at the top of the facets, to load the advanced search page with the inputs used to execute the search for that navigation option, either for inspection, or to modify them.
We have a JSON search API, available only on request, used by third-parties who analyze our data for use on their own websites.
Users can have the results of a search emailed to them on a daily basis. This simply loads the search from the database and executes it via a background job.
Some smaller features, such as watching certain tags and sellers use the search internally.

Where to next?

We have some things on the agenda for future improvements to our search, though nothing quite as major as our previous iterations. There are some internal optimizations we can certainly make, such as having an effective caching strategy (though cache invalidation is hard). We also have some changes planned that focus on tailoring the search according to the region of the user, though I can’t go into details on this. All in all, we think we’re getting there!"

Thanks / Guest Posts

Many thanks to Chris! Please post feedback right here or send to Chris’ Twitter.

If you don’t have a blog or are interested in writing a guest post, roughly in these areas: Ruby, Framework Design, Search Design or similar, please contact me.

Normalizing Indexed Data

2012-03-16T00:00:00+11:00

A quick blog post on a Picky tokenizer option.

Intro / Problem

On mobile devices it can be a bit annoying to enter special symbols, like +, or &, and it would be easier to just enter plus, or and.

Or maybe there are a lot of abbreviations, like abbrev, or e.g., but you’d still like to find the item when searching for abbreviation, or example.

Or maybe you’d like number 1 to be findable with one.

In the search engine domain, this is one part of text normalization, the examples being expanding abbreviations and converting numbers.

In Picky, this is done using the tokenizer option normalizes_words.

Tokenizer option “normalizes_words”

This option makes the tokenizer normalize words before indexing them.

The usage is very simple. Just pass a 2d array of regexps and replacement terms into the normalizes_words option, like so:

index = Picky::Index.new :normalized do
  indexing normalizes_words: [
    [/\+/, 'plus'], # + -> plus
    [/\&/, 'and'], # & -> and
    [/\w\//, 'with'], # w/ -> with
    [/abbr(ev)?/, 'abbreviation'], # abbr, abbrev -> abbreviation
    [/e\.g\./, 'example given'] # e.g. -> example given (note that the . have to survive)
  ]
end

Note that

stopwords
case
character removal
character replacement

are specifically handled in options

stopwords: /\b(word1|word2|...)\b/
case_sensitive: true/false
remove_characters: /[characters]/
substitutes_characters_with: Picky::CharacterSubstituters::WestEuropean.new

and should be handled there.

Alternatives

What if this doesn’t work for you?

No problemo! Picky is all Ruby, so feel free to either monkey patch, or probably better: Preprocess the data to your heart’s content.

Have fun!

CocoaPods Search Design

2012-03-01T00:00:00+11:00

You probably have heard of CocoaPods, an Objective-C library dependency manager. The project was initiated by Eloy Durán.

Let me tell you it’s good stuff!

Intro

This post is about designing a search engine for CocoaPods. I’m using Picky for it, with moderate modifications.

Chances are you know RubyGems. CocoaPods use a slightly different approach, one I personally find very elegant: After creating a podspec (similar to a gemspec), you ask for it to be included in the central repository via a pull request. If it is accepted, from then on you get commit rights to push other pods.

Since I think the rubygems search is too slow, and not very impressive, I tried to make the CocoaPods search an example of how such a search should be designed. Try it! :)

(Note: I’m not just criticizing, but also putting code where my mouth is regarding the rubygems search – try my alternative take on it and read about it here)

Many ideas for the CocoaPods search come from the old gem search alternative, but a few features are new, compiled in the…

Highlights

Automagic index updates via Github post receive hooks
Making composite names (e.g. BlocksKit) searchable
Advanced: Invisible filtering by OS
Advanced: Removing duplicates from results
Fun things to try!

Automagic index updates via Github post receive hooks

The challenge was to have Picky automatically update the search index without restarting, and without polling.

The fact that the CocoaPods specs live in their own repository is fantastic – it means that we have the full power of Github’s repo features at our disposal.

The feature we use is post receive hooks. Every time someone pushes a new spec, or updates a spec, the search engine sinatra app is notified via a garbled URL, as follows:

post "/my_example_hook_url/#{ENV['GARBLED_HOOK_PATH']}" do
  # index updating code here
end

Every time this URL is called, Picky downloads the zip file from github, unzips it, and indexes the loaded specs. All while running. That’s it.

HOLD ON!, you say, why don’t you just do a git pull? I wish I could. But currently, Heroku doesn’t allow git pull, or tar, or gunzip. So currently, the search engine always downloads the zip file.

Making composite names searchable

Pod names do not use spaces but are camelcased, e.g. “BlocksKit”. Like most search engines, Picky would index this as one word.

Another issue with pod names is that authors sometimes prepend their initials to it. So, for example, “Mocky” would actually be called “LRMocky”.

However, getting back to the “BlocksKit” example, we want people to be able to find it when they type blocks kit, or just kit.

In Picky lingo: If the data contains "BlocksKit", how do we index it as "BlocksKit Blocks Kit"?

Turns out there is a snazzy Ruby regexp for that:

"BlocksKit".split /([A-Z]?[a-z]+)/ # => ["", "Blocks", "", "Kit"]

Nice, eh? As a bonus works fine with numbers :)

The Pod model offers a prepared_name method, using the above split, returning "BlocksKit Blocks Kit", which Picky uses for the name category and consequently indexes all three words.

category :name,
         similarity: Similarity::DoubleMetaphone.new(2),
         partial: Partial::Substring.new(from: 1),
         qualifiers: [:name, :pod],
         :from => :prepared_name # <= :from indicates which (data) method to call in the source object

Try it with dynamic delegate! :)

Filtering by OS

This is a more advanced Picky trick, which might only be interesting to pros.

Like Ruby gems, pods can run on multiple OSs: On iOS and/or on OS X.

We always want to filter by either both (AND), iOS, or OS X. This means we always prepend the platform filter to the query like so: "on:some_platform rest of the query".

This is problematic since it uses a lot of input field space, and also confuses the user.

We would like to not show the OS in the search field, but use the value from the iOS style radio buttons.

Picky helps us by offering multiple JS callbacks. If you copy a search link like http://cocoapods.org/?q=on:osx%20Kiwi into the URL bar, Picky runs a few JS callbacks, in the following order:

beforeInsert(query) // Before inserting the query into the search field.
before(query, params) // Before sending the query back to the server.
after(data, query) // After receiving the query back, before rendering.
success(data, query) // After the view/results have been updated.

(data is the JS PickyData object)

We need both beforeInsert and before.

In beforeInsert, we remove the os part, before it is inserted into the search field. In before, before sending it to the backend, we add the OS back into the query, taken from the radio button value.

In code (the Picky JS search client options), it looks like this:

// Before a query is inserted into the search field
// we clean it of any platform terms.
//
beforeInsert: function(query) {
  return query.replace(platformRemoverRegexp, '');
}

The regexp to remove the platform search term looks like this:

var platformRemoverRegexp = /(platform|on\:\w+\s?)+/;

And before sending the search request to the backend, Picky calls the before callback where we remove any OS parts, prepending the selected one (the iOS style radio buttons have the values on:ios on:osx, on:ios, and on:osx).

before: function(query, params) {
  query = query.replace(platformRemoverRegexp, ''); // Clean the query.
  var platformModifier = platformSelect.find("input:checked").val(); // Get the selected OS.
  return platformModifier + ' ' + query; // Prepend it to the query.
}

However, the complete query, including the OS is still inserted into the URL, ready for you to copy and send to friends.

5 lines of nicely customizable code :)

Removing duplicates from results

This is another more advanced Picky trick, which might only be interesting to pros.

I often get requests on how to remove duplicates from search requests.

Why are there duplicates in Picky’s search results anyway?

Picky returns categorized search results. For example, it might deem the combination of categories "first_name", "last_name" more important, before all search results found in the categories "street", "last_name". But this also means that the same entry can be contained in both combinations of categories!

Many Picky users just use results.ids to extract a list of ids. To get the list of ids, Picky goes through the results in each combination of categories and extracts the ids. This means that Picky may well return [1,3,1,2,3], with results 1 and 3 occurring twice.

Since cocoapods.org only wants to show an uncategorized list of result pods, we wish to remove duplicates to not confuse searchers.

We achieve this by using Picky’s JS success callback. This goes through all combinations of categories (aka allocations) and removes entries from the allocations if we’ve already seen them previously. It ensures we only see unique results.

// We filter duplicate ids here.
// (Not in the server as it might be
// used for APIs etc.)
//
success: function(data, query) {
  var seen = {};
  
  var allocations = data.allocations;
  allocations.each(function(i, allocation) {
    var ids     = allocation.ids;
    var entries = allocation.entries;
    var remove = [];
    
    ids.each(function(j, id) {
      if (seen[id]) {
        data.total -= 1;
        remove.push(j);
      } else {
        seen[id] = true;
      }
    });
    
    for(var l = remove.length-1; 0 <= l; l--) {
      entries.splice(remove[l], 1);
    }
    
    allocation.entries = entries;
  });
  
  return data;
}

We could well do this in the server, but I opted against it, because a possible future search API might want to expose the duplicate results. This is why we do it in the client.

Other fun things to try!

Search for anything and then click on a pod author name in the results.
Enter Luke 1.0 to get all pods written by a luke with version 1.0*.
Enter e.g. stacked and press each platform button to see what happens to the results.
Enter e.g. uses:json to see all pods which use a pod with “json” in their name.

Feedback

We’re very glad for feedback – shoot us a line at http://twitter.com/CocoaPodsOrg, or at http://twitter.com/picky_rb. Thanks!

Thanks also to the CocoaPods team for a great project!

Picky Active Record 3

2012-01-14T00:00:00+11:00

This post talks about integrating Picky directly into Rails/ActiveRecord.

(By the way, greetings from Rails Camp X Adelaide – come up and say hi if you are here!)

The last post illustrated a way of writing an active record integration. Still missing is index persistence.

However, in this post I’d like to talk about wrapping the last solution up into a nicer bundle.

Beautifying the last solution

Why? It contains a few advanced Ruby concepts and statements. While I think everybody should know about class << self and define_method, it can get kind of hard to read compared to a more declarative style that Tire (Elastic Search), Thinking Sphinx (Sphinx), or Sunspot (Solr) offer.

However: While I like the declarative style in many cases, some libraries hide away too much important code. Many times even code that is hugely important, or does things to your model which you only find out about after reading the library source. After a crash. In production.

Goals

So what I’d like is

have the important bits be visible and manipulable.
hide away boilerplate code that makes code harder to read.

And maybe most important:

use the standard Picky API

A quick reminder what the basic Picky API is:

data = Picky::Index.new :name do
  category :name
end
things = Picky::Search.new data
things.search 'something'

Most other search engine adapters try to elegantify the original API. This is nice.

However, having control over both APIs, I believe that using the original (standard) Picky API creates a pressure on it to stay as elegant as possible and as useable as possible.

If we hide away the Picky API, pressure is only excerted on the ActiveRecord/Picky adapter. This also means that people who only use the Picky ActiveRecord API only come in contact with that one.

Why is this a problem? This is a problem when people want to transcend the AR API to use for example the separate and specific Picky server. If the APIs look and feel fundamentally different, users will not willingly make this jump. In fact, many people then start looking for search engine alternatives. This is a bad thing. Let me put this in bold, because it gets violated so many times:

The jump from the simple API to the harder API should not be noticeable.

The only way to do this is use a subset of the original API for the simpler one. However, since Picky is about giving you the power, we will not constrict you, but instead make the whole API accessible.

A first draft

I am not the biggest fan of the following pattern:

class Model < ActiveRecord::Base
  include Picky::ActiveRecord
  
  some_method_call_from_the included, module
end

I am not sure why since it’s perfectly ok Ruby. I believe it is because it usually consists of two lines, and only one really describes what is going on: “I am using this” and “I am using it like this”.

With this subgoal in mind, I started drafting the API. It turned out like this:

class Model < ActiveRecord::Base
  extend(Picky::ActiveRecord.new(:models) do
    Picky::Index.new :models do
      category :name
      category :surname
    end
  end)
end

Don’t judge me. It gets better.

Why do I use so many round parentheses, having declared them unnecessary not so long ago?

Turns out, extend gobbles up my block. Try running the following code:

module A; end
class B
  extend A do
  	# ...
  end
end

I am unsure what happens here. Looking at the CRuby code didn’t help. Ideas?

I guess we can all agree that this API is neither good looking nor elegant. Let’s try again.

A better draft

So, teeth grinding, we return back to the standard solution of having a separate include and declarations. However, I’d like to be able to use the Picky API.

This is what I’ve come up with:

class Model < ActiveRecord::Base
  include Picky::ActiveRecord
  
  index = Picky::Index.new :models do
    category :name
    category :surname
  end
  
  search = Picky::Search.new index
  
  updates_picky index
  searches_picky search
end

Let’s look at the design in detail.

In detail

First of all, note that no saving of indexes in instance variables is done. You can do it, should you need it, but Picky is not saving anything like @__picky_index for you. Instead, the index and the search are both passed into the a method in which they are captured in a closure.

Let’s look at the API code.

The line

include Picky::ActiveRecord

does two things: First, it includes two other modules, Picky::ActiveRecord::Indexing and Picky::ActiveRecord::Searching, that are concerned with indexing and searching, respectively. It is well imaginable that one doesn’t want realtime indexing, just searching, or vice versa.

index = Picky::Index.new :models do
  category :name
  category :surname
end

search = Picky::Search.new index

This is the standard Picky API. You create an index (definition) and pass it into the search.

The line

updates_picky index

tells this class to automatically update the given index as soon as the after_commit method is called.

This method can also be called as follows:

updates_picky :models

updates_picky

The first one uses the index called :models and the second one uses model_class.name.tableize to find the model name.

Finally, the line

searches_picky search

installs a Model.search method using the given search.

Also of note

This API does not really care where anything is set up. This is well possible:

class Model < ActiveRecord::Base
  include Picky::ActiveRecord
end

# In e.g. initializers/picky.rb
#
index = Picky::Index.new :models do
  category :name
  category :surname
end
  
search = Picky::Search.new index
  
Model.updates_picky index
Model.searches_picky search

for the case where you’d like your search code outside the model.

Also, you can call updates_picky multiple times:

Model.updates_picky index
Model.updates_picky index2
Model.updates_picky index3

Any updates to the model will update each index.

Implementation

If you’re interested in the implementation, see the Picky::ActiveRecord module (code at the time of this writing).

Finally, you

Hope you like the API design series. The API is certainly turning out to be simple. Too simple? Who knows.

Opinions, ideas?

We still haven’t looked at index persistence. We save this for another blog post.

Picky Active Record 2

2012-01-13T00:00:00+11:00

This post talks about integrating Picky directly into Rails/ActiveRecord.

(By the way, greetings from Rails Camp X Adelaide – come up and say hi if you are here!)

In the last post we talked about a light active record integration. This has been implemented in the prototype and released in Picky 4.0.9.

By light integration we mean:

You have a separate Picky server.
The Picky server is not configured via the ActiveRecord model.
The ActiveRecord data is simply sent to the Picky server as-is for indexing after each commit.

A quick example of 4.0.9 ActiveRecord integration

First, configure a Sinatra Picky server to be open for external indexing.

class YourSearch < Sinatra::Base
  extend Sinatra::IndexActions

  # Configure indexes etc. as usual
end

Then, configure your AR model:

class Model < ActiveRecord::Base
  # These are the default options.
  #
  extend Picky::Client::ActiveRecord.configure(host: 'localhost', port: 8080, path: '/')

  # The model definition as usual.
end

And that’s it already :)

Direct integration

While the above is very nice, you still need a separate server.

Usually I advocate keeping search separate from the app, because normally, search and app have different goals. For example, caching for either needs to work differently. Search maybe needs to be restarted independently etc.

But sometimes, you simply want a quick and simple search to directly run in the one server you have.

So instead of setting up a separate server, we would integrate Picky directly in the model.

How would we do this?

A first simple implementation

At this point I am incredibly glad to have designed Picky to work and run anywhere.

Since you already can stick it anywhere (a Sinatra server, a DRb server, a simple script, a PORO, …), you can relatively easily stick it into an active record model.

How, you ask? Let me show you the whole thing and then pick it apart.

class Model < ActiveRecord::Base
  
  class << self
    data = Picky::Index.new :models do
      category :name
      category :surname
    end
  
    define_method :replace do |model|
      data.replace model
    end
    
    define_method :remove do |id|
      data.remove id
    end
  
    models = Picky::Search.new data
    
    define_method :search do |*args|
      models.search *args
    end
  end
  
  after_commit do
    if destroyed?
      self.class.remove self.id
    else
      self.class.replace self
    end
  end
  
end

Got that? If not, here’s a step by step explanation:

We want the index and the search object to reside in the (singleton) class to define methods there, so we open it:

class << self

Then we define a Picky index (two searchable categories, name and surname) and two methods. One to replace (“insert or update”) indexed models and one to remove indexed models with a given id:

data = Picky::Index.new :models do
  category :name
  category :surname
end

define_method :replace do |model|
  data.replace model
end

define_method :remove do |id|
  data.remove id
end

Why am I using define_method instead of def? I want to capture the data (index) and the models (search) in the block for these methods to use them later on.

These two methods, since defined on the class’ singleton class, are used like that:

Model.replace model

and

Model.remove model_id

These are all the methods that have to do with curating the index.

Finally, we want the class to update the index as soon as it changes. We use AR 3.0+ after_commit callback for that:

after_commit do
  if destroyed?
    self.class.remove self
  else
    self.class.replace self
  end
end

So if the object has been destroyed, we remove it from the index (using the “class methods” we defined earlier). If it hasn’t, we simply replace the data.

Interesting to note: On a replace, Picky simply calls the methods the categories name: name and surname. So not only can Picky index Active Record attributes, but any method it has.

First conclusion

You can already do this in the current Picky version 4.0.9.

However, this has a few disadvantages:

The indexes aren’t yet saved. (Hint: Picky::Indexes.dump)
If they would be saved, they would not yet be reloaded.

How do we do this? The dumping is relatively easy, but how do we get the data back into that index when restarting and loading the index? If you’re into trying to implement that have a go. If not, stay tuned! :)

Another question for you: Is sticking the method on the Model like

Model.replace model

actually a good idea? What if, say Thinking Sphinx, reloads your models? Is your model – being an AR model – not already doing enough? What about the single responsibility principle?

It’s already night here at Rails Camp X Adelaide, so good night. And good luck. Stay tuned!

Picky Active Record

2012-01-04T00:00:00+11:00

This post is about the challenges of designing an Active Record interface for Picky.

When we last time looked at writing a nice ActiveRecord integration, around version 2.0, and then 3.0, Picky the server wasn’t ready yet.

What was missing?

Most importantly, an interface to save updates as they come in (in Picky: Index#add, Index#remove, Index#replace). Secondly, the possibility to dump indexes during runtime (Index#dump).

How would we go about designing an Active Record interface for Picky? How do others do it?

Others

Some search servers (like Sphinx) do not really offer an interface for live updates, but instead go the route of cleverly reindexing from a central data repository.

Other search servers offer HTTP interfaces (for example elasticsearch with its JSON POST/PUT etc. interface).

Since it is a nice and flexible standard interface, it enables interested coders to write software for it, for example Tire. This is a great way of attracting effort.

Another idea would be to open a port the engine listens on, pipes, or any form of communication imaginable.

In any case, Picky needs a standard interface.

The rough idea

Our rough idea is to listen for updates in the server and create a gem for use with active record (and others), which talks to the server every time some data is updated.

What are the challenges in the server?

The Server

Picky does not have a standard external interface beyond the Picky::Index and Picky::Search, which searches over the indexes.

index = Picky::Index.new(:name) do
  # ...
end
things = Picky::Search.new index
things.search 'something'

Of course, this is a very flexible approach, but comes with the problem that we need an implementation for all the different containers of Picky.

In the case of Sinatra, it will offer a HTTP interface, where the picky-activerecord gem will send updates to.

Let’s see how we would implement that. For updates, we will define a put action:

put '/' do
  index_name = params['index']
  index = Picky::Indexes[index_name.to_sym] # Get the right index from the indexes.
  index.replace_from params['data']
end

The method replace_from(hash) is available in edge currently. Error handling is omitted.

Then we can write up the DELETE action etc., wrap it into a nice module Picky::Interfaces::External, for example.

Finally, if someone wants their indexes updated by anything external, she would extend the Sinatra app with that Module:

class MyPickyServer < Sinatra::Base
  extend Picky::Interfaces::External
  
  # ...
end

Then, when we’d like to create/update/delete an indexed entry, we simply send a HTTP request to the Picky server with the following payload:

{
  index: "people",
  data: {
    id: 7,
    name: "Florian",
    surname: "Hanke"
  }
}

Sounds easy so far, right?

Ah, but what if we stop and restart Picky? What happens to the indexed data?

When Picky is restarted

Let’s say you don’t use the realtime SQLite or Redis persistent backend to store your indexes, but the standard Memory backend.

If we simply restarted, we would lose the indexes. We need a way to dump the data. One way to do this is simply dumping it when you quit Picky:

at_exit { Picky::Indexes.dump }

or a specific index:

at_exit { the_index.dump }

And then, as you restart the server, you simply load the indexes. Probably in config.ru:

Picky::Indexes.load

I’m quite excited about this!

Sure, you have to write this yourself, but … you also CAN write it yourself. And control the behaviour of it. Dump it every X requests? Only on exit? I don’t care! (I mean, I do, but not how you do it :) )

In closing, I like that in a documentation, picky-activerecord will only need a single line for the server: Add extend Picky::Interfaces::External to your Sinatra app.

Other interfaces?

At the beginning, we will focus on writing an experimental/standard Sinatra interface.

This will result in a nice Module that people can use to make their Sinatra Picky server open to external updates.

But what about other interfaces?

Since we expect the Picky Sinatra external interface to only be around 20-30 lines, we’ll just leave it open for now and implement as the need arises.

The Client

We’ll save the discussion on the client for later, but just quickly outline the ideas:

It should offer a simple and easy configuration possibility, with the default being host: 'localhost', port: 8080, path:'/'.
It hooks into the after_save callback.
It offers the possibility to save arbitrary data (not just model Person, or Company etc., but arbitrary hashes, like Music, including a list of Genres, even though that combined object doesn’t exist – I might note that it could make great sense to create a combined model like this).
It should be less than 100 lines. I’m not kidding.

You

What do you think of the server design? Any obvious flaws? Ideas? Suggestions by those who have used other, similar interfaces?

Have you already started on writing a picky-activerecord gem? :D

In any case, thanks for following the slow but steady progress of Picky!

Picky Recipes

2011-12-29T00:00:00+11:00

I’m currently putting together a collection of Picky recipes.

I noticed that people who wanted to try Picky had a bit of trouble getting into it. Sure, there is the getting started guide on the main page. And there’s also videos and blog posts.

The Quick Demo Fix

BUT. The question I should have asked myself is: When I try something for the first time, what do I need? I guess, like many others, I am guilty of being rather lazy when trying software – I need a quick fix.

This led me to put up a quick copy and paste code example on the main page. I haven’t received any feedback on it yet (except by a friend who urged me to use syntax highlighting), but I am happy with it. It shows the strengths of Picky off nicely.

Customized Search Engines in Projects

However, all of the Picky projects are projects where the search engine needed to be modified from a little to – mostly – a lot. And in all the cases I or someone else helped getting it right. I don’t believe this is a problem of Picky, but mostly a mixture of not knowing what options there are and the fact that Picky is not a “boolean” search engine framework. (And, I might add, some of the stunts would not be possible with a non-flexible search engine like … not-Picky)

The Recipes

This led me to start putting together a few examples which you can copy and paste quickly to see how something works and how it can be used.

The first 25 simple recipes are pushed to the picky repo. You can clone the project, then run the recipes by using “rake” on the command line inside the recipes directory.

Let me show one or two that I like to whet your appetite.

Realtime Indexing vs. Static Indexing

These examples illustrate how to use the static vs. the realtime index.

See https://github.com/floere/picky/blob/master/recipes/basic/static_index.rb and https://github.com/floere/picky/blob/master/recipes/basic/realtime_index.rb.

Static indexing is easier if you only index once per day and are happy to use rake index. Realtime indexing shows you how you can update the index as you get new data.

Only finds evenly sized partials

This is a bit of a silly recipe, but it illustrates well how easy it is to add a custom partializer.

Partial searches refer to somebody being able to search for “flor” and still finding “florian” (use “flor*” to explicitly search partially for that word).

Now what we want is to find only partial words whose length is even. This is the recipe for it: https://github.com/floere/picky/blob/master/recipes/partial/customized.rb.

To use it, we just pass in our own partializer

data = Picky::Index.new :people do
  category :first  
  category :last, partial: Partializer.new # <= Passed in here.
end

that is defined as

class Partializer
  def each_partial text
    temp = text.dup
    temp.length.times do
      yield temp if temp.size.even?
      temp.chop!
    end
  end
end

Picky just needs an object with an each_partial method. Our special partializer chops the word apart until it is gone, and yields if the word is of even length.

Thus we only find a partial if of even length.

Wasn’t that easy?

With a Twist

Thanks to it yielding, we could have just wrapped one of the given partializers to do the work for us.

class Partializer
  def initialize wrapped = Picky::Partial::Postfix.new(from: 1)
  	@wrapped = wrapped
  end	
  def each_partial text
	  @wrapped.each_partial do |partial|
	    yield partial if partial.size.even?
	  end
  end
end

I like it! The partializer doesn’t really know what partializer it gets. However, it will still only yield partials that are of even length. Think of it as a filter when used in this style.

Context Sensitive Advertisements

Let’s say you want to search for people via name or location. In addition, you’d like to show an advertisement next to the search results corresponding to the location.

See https://github.com/floere/picky/blob/master/recipes/advanced/advertisement.rb.

So if someone searches for “Florian Melbourne”, it should find a Florian in Melbourne, but also show an ad from Melbourne.

The problem is, if I just use two indexes (one for people, one for ads), if I search in both, the ad index won’t return any results if the query contains a name. So how do we make the ad search ignore names???

Picky tries to assign every search word to a likely category. What we’d like is to only assign to locations, and if it is a name, to just ignore it.

The magic thing to use here is ignore_unassigned_tokens. So if a name cannot be assigned to a category, it will simply be ignored. That’s it! Run the full example to see for yourself.

Yours?

If you have recipes to contribute, don’t be shy. I’d particularly be happy for a Rails one.

Outlook

I’ll be adding recipes as I go. What do you think? Do the recipes help? Do they bewilder you? Do you find what you are looking for? Why? Why not?

Unthinking Autoloader

2011-12-28T00:00:00+11:00

Let me inform you that the original title was “Autoloading Is Cancer”. That basically sets the scene.

Don’t know what autoloading is? Check out Peter Cooper’s quick intro.

Intro

I believe that autoloading is used for all the wrong reasons, and I posit that coders who use autoloading don’t really know why they use it.

If you are using it: Do you know why? Maybe I am unfair here, but this is a blog post to shake you up a little bit. You filthy autoloading pig.

Readable and clean code

After years of coding, one of the most important functions of code for me is its readability. I do it for myself, but first for everyone who has a problem with my lib’s functionality and/or simply wants to know how something works.

After all, code is the best documentation, and is used most as such (apart from its raison d’être of being run).

If others can go in and read your code, and even enjoy it, and are learning something from it, then you know that you have a great lib.

Even if the reader can’t use it right away, he/she can take away something from it.

Information transportation

To take away something from your code, it needs to transport information (into your brain that is) in the most efficient fashion.

Contrast and compare

Let’s see what information can be gained by reading this code:

require 'models/person'
require 'models/company'
require 'server/auxiliary'
require 'server/core'

From this code, I can take away a lot of things!

In order of importance:

4 files are required (obvious)
Apparently we have model-related code and server related code (obvious since someone has done his/her naming homework)

But much cooler:

The Person model is probably* most independent.
The Company model might depend on the Person, but the Person model is independent* of the Company model.
The server code might use the models.
The auxiliary server code is probably* independent of the core server code.

The * refers to the fact that the code, dynamically, might still not be independent (since it could refer to a constant in a method etc.). If it isn’t, and is required before the “dependency”, ewwww.

So, why do I think this independence thing is so cool. Let me show you another example:

require 'server/core'
require 'models/person'
require 'models/company'
require 'server/auxiliary'

This code tells me with a high probability that something might be awry in this code and I definitely need to take a closer look. Do the models refer to the server? Does the server not use the models at all? Does the “auxiliary” code use the models? Is this just a naming problem or do we have something at hand that needs to be trashed?

Now that we know that one can gain quite a bit of information by reading code (surprise! :) ), let’s take a look at the autoloading example:

module Server
  autoload :Core,      'server/core'
  autoload :Auxiliary, 'server/auxiliary'
end
module Models
  autoload :Person,  'models/person'
  autoload :Company, 'models/company'
end

From this code we can take away the following things that are non-obvious:

Jack shit.

I have no clue as to who needs what, and why. No hints.

This code fails me in readability on so many levels. Never mind introduces unneeded complexity. Also, can you tell me what happens when code like this is run in forked child processes? What about threads? What about both?

Conclusion

I noticed that many people use autoloading for three reasons:

It’s a cool Ruby technique. (I am ignoring this one)
They don’t want to think about dependencies in their code. (Also ignoring this one)
It enhances startup time.

Startup Time

Although it might help a bit by spreading the loading code over the run time of your program, believe me: This is not the place to solve this problem.

A long startup time hints at deeper problems with code structure, unnecessary precaching, etc. in your lib or the libs you are using.

Autoloading is not a solution for slow startup time. It is, at best, a quick fix for a problem which really is begging for some brains to be applied.

Final Question

Why do you use autoloading? Do you have good reasons that I haven’t considered?

Why I don't use round brackets

2011-12-18T00:00:00+11:00

This is a blog post for once NOT ABOUT PICKY! :D So enjoy the tentacle-free space.

Let me be blunt: I really don’t like reading Ruby code that uses a lot of round brackets.

No, let me be blunter: I hate reading code that uses a lot of round brackets.

Actually, it’s like this: Round brackets are the training wheels of a Ruby coder. They might be useful in the beginning, but at some point they should come off!

But let me be less contrarian and just show you why I don’t use them anymore…

Weaning yourself off the training wheels

There’s a few good reasons why I don’t use round brackets anymore.

Less noise

Brackets introduce visual noise. Compare and contrast these two method signatures:

def extract_from(text)

with

def extract_from text

What do you gain by introducing brackets? Would you gain something by introducing them into text?

My name is(Florian Hanke)

If you think this text example has nothing to do with code then we have different views on code readability. It’s more legible to me.

Law of Demeter

You’ve probably heard of the Law of Demeter? If you haven’t, please read about it :)

Not wanting to use round brackets introduces a strain every time I am about to break the Law of Demeter.

Consider this code:

text = extract other_text

Now, if I wanted to call another method on the result, I’d have to write this:

text = extract(other_text).process

Spotting violations is easy for me. I just look for the brackets. If I see brackets in my code, I instantly know that they are there for a good reason and that I actually had a reason to break the Law of Demeter.

Code like

a.b(c).d(e).f

is simply impossible for me, and that’s a good thing!

Typing

This is not about typing speed. It is simply about comfort. The comfort of not having to do bracket acrobatics™.

Not using brackets lets you type as if the code was free text.

As opposed to e.g. JavaScript, Ruby actually lets you do this, so take advantage.

Being explicit about no parameters

Two small counterpoints.

I use Rspec. Chances are, you use it as well.

There’s an expression that goes like this:

thing.should_receive(:some_method).once.with

It’s a fluid interface, so using parentheses is ok for me. One of the exceptions. However, I even add them explicitly to tell the future me that I really don’t expect any parameters:

thing.should_receive(:some_method).once.with()

Equals “with nothing”.

Another exception is the “gobbler” * argument to a method, where Ruby needs brackets to know what it is looking at.

def try(*) end

But I’m used to it!

Yes, and you’re also trained on QWERTY. Doesn’t mean it was a good idea.

But, but, I need to help Ruby with reading my code!

Please. You’re probably the first to cheer when the robot overlords arrive.

Conclusion

It’s a good idea to be sceptical.

I simply asked myself: Why am I actually using brackets when they are not needed?

I couldn’t think of good reasons, while I was able to find some reasons against using brackets.

Hence, no brackets.

WDYT?

Picky Search Options

2011-12-18T00:00:00+11:00

A few examples of what search options are there Picky.

We’re going to look at a simple example and how to search it with Picky 4.0!

The Copy & Paste Example

(This is the same example as in the last post)

The example is simple. We have an index of 4 persons (you might recognize the two famous ones). Each person has a first and a last name. Then we use a Search object on the index to search on it.

Go ahead, copy it into TextMate or similar!

require 'picky'

Person = Struct.new :id, :first, :last

data = Picky::Index.new :people do
  category :first
  category :last
end

data.replace Person.new(1, 'Donald', 'Knuth')
data.replace Person.new(2, 'Niklaus', 'Wirth')
data.replace Person.new(3, 'Donald', 'Worth')
data.replace Person.new(4, 'Peter', 'Niklaus')

people = Picky::Search.new data

results = people.search 'donald'

p results.ids
p results.allocations

This returns ids [3, 1] and the allocations [ [:people, 0.0, 2, [ [:first, "donald", "donald"] ], [3, 1]] ]. That might look a little funny, so let me explain: :people is the index name where it was found. 0.0 is the total weight. 2 is the total number of ids in this “allocation” (combination of categories). [:first, "donald", "donald"] is the category the query word was found in, together with the token and the original.

All clear?

Try searching for “Niklaus”:

results = people.search 'niklaus'

You should find ids [2, 4] and two allocations now, first in the first name, then in the last name.

Cool. Are there some options to fudge the search?

Sure!

boost

To move an allocation up in the ranking, we used weights (see last post).

Picky knows a trick that almost no search engine knows. It can boost combinations!

Look for:

results = people.search 'Donald Knuth'

Looking at the allocations, we see that Picky tells us that Donald was found in a first name, and Knuth in a last name:

[[:people, 0.693, 1, [[:first, "donald", "donald"], [:last, "knuth", "knuth"]], [1]]]

That’s pretty useful to know what was found where.

As people usually look for the first name, then the last name, we want to give this more boost.

Replace this:

people = Picky::Search.new data

with this

people = Picky::Search.new data do
  boost [:first, :last] => +3
end

Now try again:

results = people.search 'Donald Knuth'

A whole 3 points more! Try it the other way around:

results = people.search 'Knuth Donald'

We don’t get the boost. This is incredibly useful: If you look at how people search and then support them this way, they will find relevant results even easier!

max_allocations

Sometimes you only want the best allocation to appear in the results.

results = people.search 'Niklaus'

This finds two ids and two allocations, once in the first name, once in the last name.

Replace:

people = Picky::Search.new data do
  max_allocations 1
end

Now Picky only calculates 1 allocation. Try

results = people.search 'Niklaus'

Only the best allocation is found.

ignore_unassigned_tokens

Did Donald Knuth ever have the nickname “Popeye”? Try this:

results = people.search 'Donald Popeye Knuth'

Not really. But what if we want to find him even if one token cannot be assigned to a category?

people = Picky::Search.new data do
  ignore_unassigned_tokens
end

Try again:

results = people.search 'Donald Popeye Knuth'

Voilà!

This is incredibly useful for an advertisement search. Say in the ads index you only index the city where a person lives. If someone looks for Florian Hanke Melbourne, you can show the person relevant ads from Melbourne.

terminate_early

Search for niklaus, and tell Picky you only want 1 id:

results = people.search 'Niklaus', 1

Yes, Picky only calculates 1 id, but still calculates and returns all valid allocations. if you only really need the ids (the Picky interface needs the allocations), then this is unnecessary and could be faster.

Replace:

people = Picky::Search.new data do
  terminate_early
end

Try again:

results = people.search 'Niklaus', 1

Hey presto! Just one allocation.

This code

people = Picky::Search.new data do
  terminate_early +2
end

will tell Picky to calculate all necessary allocations, plus 2 following ones, for good measure.

ignore

Try this:

results = people.search 'Niklaus'

You’ll get results in first and last name. If you only wanted results from the first name, you’d search for this:

results = people.search 'first:Niklaus'

Cool. But let’s say: You, the search engine designer, don’t want anybody to find anything in a last name, for any reason. Using first: will select only first. But you might only want to remove the last category. Do this:

people = Picky::Search.new data do
  ignore :last
end

Try again:

results = people.search 'Niklaus'

Niklaus is not found in the last name again.

You can give it even more:

people = Picky::Search.new data do
  ignore :first, :last
end

But that is pretty silly in this example. Picky won’t find anything anymore!

Conclusion

And that’s the options the Picky Search object has. As you’ve seen in the last post, some searching is defined on the indexes, but some options are exclusive to the search side, and are only defined there.

It’s best to play a bit to unlock their versatility and power :)

Picky APIs

2011-12-18T00:00:00+11:00

A few examples of how to inject your own functionality into Picky.

We’re going to look at a simple example and how to customize it with Picky 4.0!

The Copy & Paste Example

The example is simple. We have an index of 4 persons (you might recognize the two famous ones). Each person has a first and a last name. Then we use a Search object on the index to search on it.

Go ahead, copy it into TextMate 2 Alpha or similar!

require 'picky'

Person = Struct.new :id, :first, :last

data = Picky::Index.new :people do
  category :first
  category :last
end

data.replace Person.new(1, 'Donald', 'Knuth')
data.replace Person.new(2, 'Niklaus', 'Wirth')
data.replace Person.new(3, 'Donald', 'Worth')
data.replace Person.new(4, 'Peter', 'Niklaus')

people = Picky::Search.new data

results = people.search 'donald'

p results.ids
p results.allocations

All clear?

Try searching for “Niklaus”:

results = people.search 'niklaus'

You should find ids [2, 4] and two allocations now, first in the first name, then in the last name.

What if you want to find the last name first? We add some weight to it!

Adding weight

By default, Picky already weighs the categories with a logarithmic weight. That is, the more a token occurs in a category, the “heavier” it is.

So this:

category :last

is actually

category :last, weight: Weights::Logarithmic.new

However, for “Niklaus”, that resolves to a weight of 0.0.

So let’s add our own weight object. It just needs to respond to #weight_for(amount_of_ids) and return a float.

We ignore the amount and return a flat 12.3. Copy this in your example:

Weight = Class.new do
  def weight_for amount
    12.3
  end
end

and replace

category :last, weight: Weight.new

Now the last name comes first, with a weight of 12.3, not surprisingly.

[[:people, 12.3, 1, [[:last, "niklaus", "niklaus"]], [4]], [:people, 0.0, 1, [[:first, "niklaus", "niklaus"]], [2]]]

Picky provides a few weights itself:

Picky::Weights::Logarithmic.new The default.
Picky::Weights::Constant.new (with 0.0) or Picky::Weights::Constant.new(1.23) (with 1.23)
Picky::Weights::Dynamic.new { |str_or_sym| str_or_sym.size }

What if we want “Wirth” and “Worth” be found at the same time?

Adding similarity

By default, Picky does not look for similar words.

This:

category :last

is actually

category :last, similarity: Similarity::None.new

Now, look for “warth~” (the ~ tells Picky to look for similar words):

results = people.search 'warth~'

You found nothing, right?

Picky only looks for similar words if the category enables it!

Let’s write a similarity such that both will be found. Copy this in your example:

Similarity = Class.new do
  def encode text
    text.gsub /[aeiou]/, ''
  end
  def prioritize ary, encoded

  end
end

We encode a text such that its vowels are removed. This will make “wirth” and “worth” resolve both to “wrth”, and that makes them similar. (The prioritize method allows you to sort and trim the similars list)

and replace

category :last, similarity: Similarity.new

Again, search for “warth~”.

results = people.search 'warth~'

This time you found both, right?

Picky offers Similarity::Soundex.new(amount_of_similar), Similarity::Metaphone.new(amount_of_similar) and Similarity::DoubleMetaphone.new(amount_of_similar). But rolling your own is easy, as you have seen.

Adding partial searching

Can you find Donald Knuth by entering “Donal”?

results = people.search 'donal'

You can. But why?

The word “donal” finds something because this:

category :first

is actually

category :first, partial: Partial::Postfix.new(from: -3)

That means it finds “dona”, “donal”, “donald”. Try them all!

Does it find “don”? Try it:

results = people.search 'don'

No, it doesn’t! We could use Partial::Postfix.new(from: -4) to include this case, but let’s write our own :)

Partial = Class.new do
  def each_partial text
    text = text.dup
    (text.size - 1).times do
      yield text.chop!
    end
  end
end

and replace

category :first, partial: Partial.new

Try again:

results = people.search 'don'

Now we find Donald. You can even do this with our partial code:

results = people.search 'd'

We still find him.

Now, Picky already offers a few partial behaviours:

Partial::None.new (Do not search for a partial)
Partial::Postfix.new(from: position)
Partial::Substring.new(from: position, to: position)
Partial::Infix.new(min: size, max: size)

One important note: Picky always searches for the last token in the partial index, even without the asterisk next to the word. If it’s not the last word, you need an asterisk: “Don* Knuth”.

Boosting

To move an allocation up in the ranking, we used weights.

Picky knows a trick that almost no search engine knows. It can boost combinations!

Look for:

results = people.search 'Donald Knuth'

Looking at the allocations, we see that Picky tells us that Donald was found in a first name, and Knuth in a last name:

[[:people, 0.693, 1, [[:first, "donald", "donald"], [:last, "knuth", "knuth"]], [1]]]

That’s pretty useful to know what was found where.

As people usually look for the first name, then the last name, we want to give this more boost.

Replace this:

people = Picky::Search.new data

with this

people = Picky::Search.new data do
  boost [:first, :last] => +3
end

Now try again:

results = people.search 'Donald Knuth'

A whole 3 points more! Try it the other way around:

results = people.search 'Knuth Donald'

We don’t get the boost. This is incredibly useful: If you look at how people search and then support them this way, they will find relevant results even easier!

But how about we want to boost in a specific way?

Custom Boosting

Copy this into the example:

Boosts = Class.new do
  def boost_for combinations
    @map ||= {
      [:first, :last] => +5
    }
    @map[combinations.map(&:category_name)] || -20
  end
end

(A combination is basically a tuple of category and token)

and replace:

people = Picky::Search.new data do
  boost Boosts.new
end

Now try again:

results = people.search 'Donald Knuth'

A whole 5 points more! Try it the other way around:

results = people.search 'Knuth Donald'

A whopping -20, which would send this allocation back to the end of the list, was there more data.

Conclusion

I hope you’re going to try Picky in your next project.

See the next post for some fancy search options.

Picky 4.0

2011-12-18T00:00:00+11:00

Picky 4.0 release – a quick description of the goals and the changes from version 3.6.16. More to come later.

Goals

The ultimate goal of Picky is to become THE choice for a lightweight search engine, as flexible as possible, regarding the container on one hand (useable in a script/a Sinatra instance, a DRb server, wherever) and in itself on the other hand, offering a rich API where you plug in search engine behavior.

Release 4.0 is another big step towards these goals.

Thanks

Thanks to all who helped with this release! Among others: Roger Braun, Niko Dittmann, Kaspar Schiess, Glen Maddern.

Changes (tl;dr)

The one big change is that both the classic Picky application and classic Picky sources have been removed. If you need these, please continue using 3.6.16.

If you want to jump on 4.0, replace with a Sinatra app and convert your source into one that responds to #each (See the Wiki on sources).

Other important changes:

Picky::Index, option weights has been renamed to weight.
Picky uses the procrastinate gem to parallelize indexing.
Picky::Indexes.reload => Picky::Indexes.load, analog on Index, Category.
If you call any define_* methods, please remove the define_ part.
If you defined a source { with a block }, the block is now evaluated each time the indexer runs on a category.
Rake task rake index:parallel is used by rake index. If you can’t index in multiple processes, please use rake index:serial.

Detailed Changes

This is for users that are currently on version 3.6.×. Extracted from the history.textile file:

hanke: (server) BREAKING Picky::Indexes.index does not index in parallel anymore.
hanke: (server) BREAKING Renamed Picky::Indexes.index_for_tests to Picky::Indexes.index.
hanke: (server) If you want to explicitly run parallel indexing programmatically, use Picky::Indexes.index Picky::Scheduler.new(parallel: true) or Picky::Indexes[:index_name].index Picky::Scheduler.new(parallel: true).
hanke: (server) BREAKING Renamed Picky::Wrappers::Category::ExactFirst to Picky::Results::ExactFirst. Extend instead of wrap: index.extend Results::ExactFirst or category.extend Results::ExactFirst. If an index is extended, each category of the index will be extended.
hanke: (server) BREAKING Picky::Indexes.reload has been renamed to Picky::Indexes.load.
hanke: (server) BREAKING index.reload has been renamed to index.load.
hanke: (server) BREAKING category.reload has been renamed to category.load.
hanke: (server) BREAKING Removed all define_... methods on indexes.
hanke: (server) BREAKING Removed Picky classic application. Please use Picky e.g. in a Sinatra app.
hanke: (server) BREAKING Removed Picky classic sources. Please use a source with the #each method.
hanke: (server) BREAKING Option weights for the Picky::Index#category method has been renamed weight to conform with the other methods.
hanke: (server) BREAKING Picky does not require the text gem anymore by default. Only when you use phonetic similarity. It will tell you what it needs.
hanke: (server) BREAKING Added the PICKY_ENVIRONMENT in front of the Redis key namespace to differentiate the various environments.
hanke: (server) BREAKING Removed rake routes since only the classic server was able to provide it.
hanke: (server) BREAKING Removed the classic server from the generators.
hanke: (server) BREAKING Reverting customizeable backends from version 3.3.2. They are no longer available. Please use simple subclassing to achieve funky backends.
hanke: (server) BREAKING SQLite self_indexed and Redis immediate option is now called realtime, as changes go directly through to the actual backends, in “realtime”.
hanke: (server) BREAKING The tokenizer option for a category has been renamed to indexing, to conform with the methods for the index and the sinatra app.
hanke: (server) BREAKING Internal Similarity#encoded method has been renamed to #encode.
hanke: (statistics) Overhauled statistics interface. Use picky statistics log/search.log to start it.
hanke: (server) The Index#source block is now evaluated every time an indexer runs.
hanke: (server) Explicitly uses Yajl::Encoder#encode for JSON encoding.
hanke: (server) Fixed cases where even when no similarity was defined on a category, similar results were still found.
hanke: (server) Rake task index now points to task index:parallel by default. Call rake:serial to index serially.
hanke: (server) Indexer calls reconnect! on sources that support it.
hanke: (server) Location/Volumetric/Geosearch rewritten.
hanke: (generators) Fixed integration specs for the generated “all in one” server/client.
hanke: (generators) Changed method calls to adapt to above changes.
hanke: (server) Using the procrastinate gem to parallelize indexing.
hanke: (server) Indexing call structure cleaned up. Improves performance by about 40%.

Picky Search Performance (Backends)

2011-11-20T00:00:00+11:00

This is a post about Picky performance when searching in various backends.

But first, a picture that was taken during the performance tests:

How is taking this picture possible you ask? I am writing this from a hospital.

Heh, no. Not really.

tl;dr

In the single-process/single-threaded case on one core of a 2.66 GHz i7 Macbook Pro, Picky’s search performance ranges from 0.0001s for a single-word query on the memory backend to 0.01s for a three word query on the Redis backend. Around 0.0003s per query on the memory backend for a more realistic case.

Why?

We are currently working on designing the Picky backends, amongst other ideas, to enable realtime indexing.

If you want to contribute a backend, please do!

The raw data

In descending order of performance, we evaluated four backends that are available: Memory, File, SQLite (graciously donated by Roger Braun) and the Redis backend.

The 10 – 100000 show the number of objects in the database. The columns 1-3 denote the complexity. 1 is just using one word, and 3 means we looked for three words.

We were wondering about the Redis backend a bit, and also the file backend (see below). Memory and SQLite are as expected. What did we expect?

Expectations

All of the following charts show the three different complexity levels in various index sizes (objects indexed).

Since the memory backend runs fully in memory (duh), we get the best performance there. It’s all fully in memory, so none of the dirty slow stuff even gets touched.

With the exception of that dirty old man that touches everything, the Ruby Garbage Collector.

The file backend (very naïve, see here) surprised us a bit, since we are actually loading JSON encoded data from a file.

However, seeking in Ruby and decoding with Yajl Yajl::Parser.parse IO.read(cache_path, length, offset) is apparently quite fast.

Tests of a first draft of a SQLite database (by Roger Braun) show lots of promise as well.

Redis is rather slow, as expected. However, this is not just Redis’ fault. The current implementation does three roundtrips per simple internal query.

For example, in the three words case, and having four different categories each word can be in results in 36 up to 72 roundtrips. And for that, the Redis backend performs very well.

With the arrival of Redis 2.6.0, we will make use of the Lua scripting and the EVALSHA command to divide the number of roundtrips by 3.

That will, for a four category, three word query result in only 12 up to 24 roundtrips. Still a lot, but this should prove to be much faster.

One Redis behaviour that surprised us a lot was that for the “complexity 3” case where we looked for three words, the performance of Redis in the graph remains constant. Why does it remain constant, and why doesn’t it show the same behaviour?

Turns out, the curve does exactly the same, but is squished, because the complexity tends to make a large difference to the baseline.

If you look at just the “complexity 3” case (here in blue instead of yellow), we can see the same behaviour.

What happens is that for the multi-word case, the amount of expensive roundtrips shoots up. The amount of combinatorics and calculations that Picky does is just the cherry on top of a large roundtrip cake.

For four words, this would be even worse: We would have to search for the line around 0.02s.

We hope to reduce this greatly with Redis 2.6.0 and expect a 3-4x speed increase.

Comparisons

Comparing each of the complexity cases (1 word, 2 words, 3 words) for the backends, they are nicely evenly spaced apart.

That is, on a log scale. From Memory to File, from File to SQLite, from SQLite to Redis we each have about a 2x query time increase. Comparing Memory and Redis, we thus get about a 8x increase (actually, more like 10x).

While for the one word case, the data remains quite flat as the index size increases, the impact on performance is very noticeable in the three word cases.

A note on the index sizes: Yes, 100’000 entries is not a very realistic size (we do not have access to large servers yet). But it is enough to see Picky’s behaviour regarding speed. However, the curves behaviour is quite predictable and can be extrapolated from the curves seen above.

For example, if you extend the curve of the memory case to 1000 times the size (to 100’000’000 entries): The complexity case 1 it arrives at 0.0002s, in the complexity case 3, at around 0.005s.

In the case of 15’000’000 entries, this is exactly what we found to be true for the memory case. Please see use case 1 on the Picky page.

Selecting a backend

What does it mean for you when choosing the backend?

If you need a realtime index, then the only backend that supports this is the Memory backend (current version at the time of this post is 3.5.4). We are working on getting the others up to speed, but this is what’s there for now.
If you need persistence and/or distributed Pickies, we recommend the Redis backend. Speed may not be fantastic, but from Redis 2.6.0 on it will be quite a bit faster. We predict around 3-4 times faster.
The File and SQLite backends are still in development. Use the File backend when you have a static index and do not want to use too much memory. The same holds for the SQLite backend, with the improvement that you have all the SQLite tools at your service.

As usual, it’s a tradeoff between speed, space, tools etc.

The code

The code for these tests is here:

http://github.com/floere/picky/blob/master/server/performance_tests/search.rb

We generated sets of 10-100000 indexed things, each with 4 categories and an id. Then we randomly selected data from the indexes and in roughly half of the cases are searching for just part of the word for which Picky uses a partial search.

We ran 100 random queries each, and divided the resulting time by 100 to get an average per-query-time.

A note on combinatorial search engines

Combinatorial search engines are hard to performance test.

If in a phone book search on Picky you search for “peter paul victoria”, Picky evaluates what you are most likely looking for. This involves a fair bit of calculation.

In the mentioned case, if “peter” can be a first name, name, street, city, and the other words are similarly ambiguous, then Picky has to look at all the possible combinations and has to find out which one is the one that is most likely, based on the weights and boost you defined.

Now, this is very dependent on the data underlying it. So I tried to use relatively standard data.

So, in closing, it must be said that it is hard to compare this style of search engine to one of the generic search engines. But Picky would really like to take one on soon ;)

Picky Update Performance

2011-11-13T00:00:00+11:00

This is a post about Picky performance when updating realtime indexes.

tl;dr

In the single-process/single-threaded case on one core of a 2.66 GHz i7 Macbook Pro, Picky realtime index update performance ranges from 500 updates/s to 25’700 updates/s. Around 2’300 to 5’100 updates/s for a default case.

Quick realtime index refresher

If you didn’t know, since 3.2.0, you can add/remove/replace (update) objects from and to a Picky index. In realtime.

For example,

index = Index.new :things do
  category :text
end

index.replace thing_with_text_method

would replace the index data for the thing_with_text_method.

If you added a search interface for the index,

things = Search.new index

you could also search for it and it would return different things if you changed the index in between.

things.search "some thing" # => Finds the thing.

index.remove thing_with_text_method.id

things.search "some thing" # => Finds it no more.

The Setup

All numbers are valid for a 2.66 GHz i7 Macbook Pro (one core of it) with 4GB 1067 MHz DDR3 RAM, using Picky 3.5.3 on ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0].

For testing performance, we randomly pregenerated a large set of objects with methods id, user (8 random characters), text1 (20 random characters, 26 from the alphabet, 5 spaces), and text2, text3, text4 (see generation of text1).

Then, we used the following index. To make everything easier, we used a config variable to enable/disable categories.

include Picky

config = 0 # 1, 2, 3, 4.

index = Index.new :things do

  weights    = Weights::Default    # These configurations were changed.
  partial    = Partial::Default    #
  similarity = Similarity::Default #

  if config >= 0
    category :user,
             weights:    weights,
             partial:    partial,
             similarity: similarity
  end
  if config >= 1
    category :text1,
             weights:    weights,
             partial:    partial,
             similarity: similarity
  end
  if config >= 2
    category :text2,
             weights:    weights,
             partial:    partial,
             similarity: similarity
  end
  if config >= 3
    category :text3,
             weights:    weights,
             partial:    partial,
             similarity: similarity
  end
  if config >= 4
    category :text4,
             weights:    weights,
             partial:    partial,
             similarity: similarity
  end
end

If config was for example 3, Picky used categories :user, :text1, :text2 and :text3.

We then varied the configurations for weights, partial, similarity. Weights control how the categories are weighed. Partial how you can search partially (like just for the first character or only for the exact word). And similarity defines if you can search for similar words to the one you entered.

We indexed until the average update/s value stabilized.

The Chart

A quick explanation of the legend: It is ordered weights/partial/similarity. From fastest to slowest, the options were…

Weights (2): Constant, Default/Logarithmic.

Partial (3): None, Default/Postfix(from: -3), Postfix(from: 1).

Similarity (3): Default/None, Soundex(3), DoubleMetaphone(3).

We did not explore all combinations. The numbers were rounded down to the nearest hundreds.

From left to right, we first indexed just the user category, then successively added text1, text2, text3, and text4.

The baseline (not shown), when no category was defined, was 212'000 updates/s.

The absolute winner is indexing just the 8 character user category, with a constant weight, no partial indexing, and no similarity, at 25'700 u/s.

Usually though, you’d want Picky’s weighing and scoring to be used. So, the same scenario, with no partial/no similarity yields a speed of 22'000 u/s. In a more realistic case with 3 text categories and 1 user category, it is 5'400 u/s.

For added convenience, you’d use the default partial algorithm, which also includes parts of words, from the 3rd last character of a word to the last. With default weighing, and no similarity, this yields 1'600 up to 10'400 u/s. This is with all settings to default (Default/Default/Default), as if you had defined nothing:

category :text1 # etc.

If you are interested in similarity, but not partial, the numbers range from 900 to 6'200 u/s.

The most brutal case, standard weighing, full partial, and best similarity costs dearly: Only 500 up to 4'700 updates per second.

If you like pure numbers better than a graph, here’s a table for you:

Now, let’s all “jump” to conclusions! ;)

Conclusions

First of all, we are very happy with the numbers. We did expect a much lower performance. (Sorry, Picky)

How Picky weighs and scores the data didn’t impact the results much. This is no surprise, as no string manipulation is done.

Partial indexing impact was what we expected. Around a 40% to 60% reduction from (Default/None/Default) to (Default/Default/Default) in speed depending on how many text categories were indexed. The jump to (Default/Postfix(from: 1)/Default) – an all inclusive partial – is around 50% to around 70%.

The worst impact comes from similarity indexing: Using similarity brings down indexing speed to about 25% (no partial) to 50% (also using partial).

The big takeaway: Text categories with much content are most important, followed by whether you do similarity, followed by whether you do partial searches. Weighing almost plays no role.

The bigger takeaway: Picky is fast when updating indexes. And does it in realtime. On Ruby.

Search Engine in a Script

2011-10-26T00:00:00+11:00

This is a post about running Picky in a small script.

Design Philosopy

You all know that with Picky we want the full flexibility of Ruby.

What we also want is a search engine that runs with a minimal setup. Small and sweet. Portable, lightweight, bam!

In short: Picky wants to be the Sinatra of search engines. Did we achieve this? Not yet, but we are very close indeed.

Let’s have a quick dance in the rain.

The code

Go ahead and replace Picky with a very small script! You can copy, right? And paste? Ok. Go.

# Possible since Picky 3.2.0.
#
require 'picky'

include Picky

Thing = Struct.new :id, :name

index = Index.new :test do
  category :name, similarity: Similarity::DoubleMetaphone.new(3)
end

index.replace Thing.new(1, 'Picky')
index.replace Thing.new(2, 'Parslet')

things = Search.new(index) do
  boost [:name] => +3
end

p things.search("Pick").ids
p things.search("Pic").to_hash
p things.search("Parsley~").allocations

That’s it. Easy to just try something, and later evolve into a fully fleshed, super-powerful search engine.

Twitter Account

2011-10-23T00:00:00+11:00

Our loveable octopus, Picky, got a Twitter account under @picky_rb.

In one of his first tweets he noted Man I look fat in my profile image. I really should cut back on crabs and molluscs.. Yeah, you should. Also on rich indexes, I might add.

He will not tweet very often, apparently, as he mentions in this tweet. Just version updates and some personal life stuff.

He can be a bit of a blabbermouth. Let’s hope he can control himself.

Last thing I heard he was engaged in a semi-epic battle in the Mariana Trench with his bigger buddies, against whales. In a DM, in his usual style he wrote me: “Battling the big blue ones. 5 suitably categorized targets found in 0.000013s. Wish me luck! P.S: Don’t snack on my sushi.”

I hope he makes it out alive. We need to get going on this realtime indexes update.

Designing Realtime Indexes

2011-10-23T00:00:00+11:00

This is a post about designing realtime indexes for Picky.

Realtime indexes are an exciting thing! The possibility of inserting something into a search engine, then having the thing pop up immediately in results is fantastic. Wouldn’t you love that in Picky? Man, me too!

Too bad that we yet have to implement it. Heh.

On the other hand, some good TDD should do it. “TDD” you ask? TDD of course, is the noble activity of Thought Driven Development. Also known as QDD, Question Driven Development.

Let’s fire up the cranial engines and get the gray matter bubbling.

Specifically, I’d like to talk about the API, and how to implement it in Picky. Along the way I will touch on the inverted index, necessary bookkeeping, how I will implement it, how to use it and how not to use it, the latter being more important than the former.

What is a realtime index?

A realtime index is an index that has the ability to have e.g. text indexed at runtime, and returning results for that text immediately after indexing.

One example for a Ruby realtime search engine is Whistlepig by William Morgan.

Ok, let’s talk about what we want.

What I want

In ways, writing Picky has been like being the first person in a group of mountaineers: You climb a mountain. It’s a bit taxing. But we take it one step at a time. The summit is visible at all times. Meaning: Goal clear, steps towards it too.

As a first step towards a multi-process, multi-threaded realtime index we’d like to get it working for a single process, for a single thread. Then cross the next bridge when we get to it.

When designing software, I have yet to see a case where designing multiple things at the same time is better than focusing on a single thing at first.

Ok, so let’s just look at what we want:

API

Let’s offer three methods on an index:

#remove(id) (Removes an element with a given id from the index)
#add(object) (Adds an element with a given id to the index)
#replace(object) (#remove followed by #add – this is what you’d use usually)

We could just do the replace method first, but there might be cases where you’d want to remove and add separately. When you have a producer and a consumer, for example.

What would they return? I’m quite unsure yet. Let’s leave that out for later. Maybe you have some ideas?

How would we call these?

index = Picky::Index.new(:example) do
  # index definition
end
index.remove(13)
index.add(thing)
index.replace(thing_that_responds_to_the_id_method)

You call these methods from a Sinatra or Rails action, from a Signal trap, etc.

What I do (not yet) want

To focus on a good SRP implementation of realtime indexes, we won’t (yet) implement:

Multiprocess
Multithreading

and work for the (assumed) 80% case where we want to have more recent objects sorted at the top of the results (Whistlepig also does this).

So, realtime indexes will be sorted initially like a normal index, but will then gravitate towards a “most recent first” sorting.

Summary

So, what we want is a realtime index that lets us add and remove elements at runtime. Elements that are removed will not show up anymore in the results and elements that are added will show up on top of the results.

So,

thing_search.search "blah" # => [1,2,3]
index_of_thing_search.add(thing_with_id_5_and_text_blah)
thing_search.search "blah" # => [5,1,2,3]

is what we want.

That is an (assumed) default case and this is what we will go for in this first implementation.

The inverted index

Amongst other things, Picky contains an Inverted Index that is central to most search engines.

We’ll review it quickly so you can follow the implementation.

In its simplest form, the inverted index saves tokens that point to a list of ids.

{
  :token1 => [1,4,2,5,6],
  :token2 => [3,4,8,2]
}

and so on. This makes it easy to look up text that the user is looking for. Just do a

ids = inverted_index[text]

and you have all the ids that contain that text.

Picky has quite a few more internal indexes that help it look stuff up, but we’ll focus on the inverted index here.

All clear? Now let’s add realtime indexing to that.

The naive approach

So, given that we have an inverted index like

{
  :picky => [1,2,3,4,5],
  :whistlepig => [5,6,7,8,9]
}

and we want to remove an id, say 5, in a naive way.

We could just iterate over all arrays on the values side of the hash. Here, this would be easy:

inverted_index.each do |_, ids|
  ids.delete id_to_remove
end

You probably already see the problem. On a 12GB index (the first Picky production use case), this would take a loooong time.

So, although nice and very understandable, this is not feasible.

We need to make it faster.

A better way

Q: How do you make something faster in computer science?

A: Get a bigger computer?

A: More processors?

A: Uh, why are you looking at me like that?

Q: whacks student with a large trout

But seriously, if you want to get speed, you have to sacrifice space. Hello, age-old trade-off.

This always means adding some sort of data structure, since when I say space, I mean data structures. And this means complexity. From which follows that we have consistency troubles ahead of us.

Anyway, on with it!

The fast approach that needs some bookkeeping™

So, instead of iterating over all id arrays, we should remember which array had a certain id in it.

How would you do this?

Hello Mr. Hash. We remember which id was in which array. So we have a telephone book of ids that maps to the id array references, such that:

{
  1 => [[1,2,3,4,5]],             # reference
  5 => [[1,2,3,4,5], [5,6,7,8,9]] # references
}

Now we can ask this mapping to find out incredibly quickly which arrays we need to update:

array_of_id_arrays = mapping[5]

A bit of a kicker / homework

I’ve got a question for you:

In the case of removing an id, how would you remove it? Look at

mapping[5].each do |id_array|
  id_array.delete 5
end

Does this work? Has the array in the hash changed? If so, why? If not, why not?

Hint: It’s not an accident I was talking about references, above.

Adding

Removing is relatively easy. How about adding?

When adding, we process the data to get tokens, then look up each token in the inverted index, prepending the id to the id_array.

tokens.each do |token|
  inverted_index[token].unshift id
end

Easy as well.

Note: This is only a good thing to do if the id isn’t in the index yet.

Conclusion

When just looking at the inverted index, realtime indexing looks rather easy doesn’t it?

Well, I hope it does so now, to you, I also hope that the basics of search engines seem less daunting to you now :)

It will be a bit more complicated to implement, as a few more internal indexes need to be held consistent, but as usual, a large array of tests should help with that.

Caveats

This implementation completely ignores the case where Picky runs in multiple processes (i.e. in Unicorn), or in multiple threads. But we’ll cross that bridge when we get to it. These concerns are completely orthogonal, thus it’s a good thing to separate thinking about them. As usual.

Picky Case Study: Single Server App for Heroku

2011-09-11T00:00:00+10:00

This is a post in the Picky series on its workings.

This is about running a Picky search on a single server on Heroku.

Skipping options:

Skip the Intro, but what is Heroku?.
Skip the Intro, I know what Heroku is..

Intro

Last night you got together with your friends. Beer flowed freely, smoothly moved over to wine, Caipirinhas. The sizzling of meat on a grill. Chicken hearts. Entrecôtes.

Then, pure Vodka, shots, maybe even as far as Baltimore Zoos. Women. Making out.

The night drags on. One of your friends mistakes the kitchen for a toilet. Dancing on tables. The police visits multiple times. Sausages.

The policemen decide to join you. Vomit. Promises. Friendships.

And dares. You are the computer dude of the group.

“Make a new Google in a day!” someone shouts. “I dare you!”

That’s the last thing you remember as you dive nose first into an Aperol Spritz.

Make that “eye first”.

The next day

You wake up with a grandmother of a hangover. A lingering smell of meat and vomit, caked on your lips. Ketchup stains. Who is that girl on the floor?

Blearily, you wander to your computer, take a look at your emails, a swig of water, a munch on raw bacon. Shit.

There is it. The email you’ve been dreading. A dare and promise forged in blood: “Make a drinks search engine. You have until midnight.”

Picky

You barely remember a blog post by a crazy dude called Florian Hanke, always touting a search engine’s simplicity and usability, on using it with Heroku. Man, that guy is crazy. Fucking foaming at the mouth.

What was it called again? “Pinky”? What a silly name.

Maybe he’s right, though. Let’s see.

You try to navigate Google, but the search bar keeps moving. It’s like being seasick, but on the interwebs. Man, totally netsick. Heh, netsick. snort

There it is. Found it. Man, thank goodness it’s rather short.

Heroku

This use case uses Heroku.

Heroku is a great place to host your small search engine. They are very generous in offering free servers for your projects.

The original GemSearch was running on two servers. One for running the web app, one for running the actual Picky server. Read more about it here.

This was problematic, since the data for the index needed to be on both servers. Once as an index, and once for rendering, in the web app.

Another thing was that free Heroku servers are started up on demand. This meant waiting a little for the web app, then waiting for the search server. Many people were wondering why their search was taking so long.

We can speed this up by moving the web app and the search server into a single Heroku server.

Single Server App

Picky 3.0+ offers the possibility of generating single server apps (aka “all in one”). Just type:

$ picky generate all_in_one drinks

to generate such an app in the drinks directory. This app combines the Picky server with the web app.

The app.rb represents the web app and the search server in one (the separate areas are clearly marked). The images, javascripts, stylesheets and views directories belong to the web app. And the index directory is from the server.

With this in mind, adapt it to your needs.

Herokuizing this Single Server App

Four simple steps:

First, make it a Heroku app: http://devcenter.heroku.com/articles/quickstart

Index your data:

$ PICKY_ENV=production bundle exec rake index

Then, check the production index into git. The app loads the index from there.
Finally, let it loose:
```
git push heroku master
```

One example of this is the Gem search. The code is here.

Outro

After two hours you’re done. A bit of sun next to the lake does you good. Over the iPhone you look up that crazy drink you’re having, The Ricky Martini. Man, where do they find these bartenders?

Smooth. It works. Rose’s Lime Juice? It’s good, though.

Your end of the dare is met.

With a broad grin you type your friend’s email address. Your dare. His turn.

You’re wondering though where he’s going to get a Tutu and a Scooter on a Sunday…

Picky: Ignoring Unassigned Tokens

2011-09-05T00:00:00+10:00

This is a post in the Picky series on its workings.

It is about a new Search object option ignore_unassigned_tokens that is exposed from version 3.1.5 onwards. It allows you to tell Picky that it should just ignore any tokens which cannot be found in an index.

This is how you set it:

Search.new my_index do
  ignore_unassigned_tokens true
end

The option was buried in an internal API but slowly made its way out to the Search object (see last post).

Ignoring unassigned tokens

What do I mean by this?

Let’s say you are searching for "Chicken Cajun Style".

Picky only has “Chicken” and “Cajun” indexed, as a recipe title.

What happens is: Picky will find the token “Chicken” in the title category, and the token “Cajun”, also in the title category. But it won’t find “Style” anywhere in the index. It might, but not for the same indexed object.

So Picky will return an empty result set.

So maybe you want to make Picky more forgiving.

One way to do this is to tell it to ignore unassignable/unassigned tokens. This means that if a token cannot be matched to any category, it will be thrown away.

So, in the example above, Picky would return the results for "Chicken Cajun". It’s as if the “Style” had never existed.

An idea on how to use this

One idea on how to use this is in an implicit search, separate from the main search.

So you have a main search, using the Picky interface, but also a space where you show relevant ads.

Say you have a Car model, with advertisements attached. If someone searches for a car, it will show relevant ads.

In the code you’d have:

cars_search = Search.new cars_index

ads_search = Search.new cars_index do
  ignore_unassigned_tokens true
end

And then you’d do two searches. The idea here is – even if there is no exact result in the main search – to show anything that is in any way related to the query. (See the case study on location based ads three posts back on how to fine-tune this)

That’s it – hope it inspires you to try Picky be more lenient, or perhaps this was exactly what you were looking for!

A quick note on APIs

2011-09-04T00:00:00+10:00

While writing Picky, one thing occurred to me: If you have an (external) API, it will exert pressure on the internal APIs, or the design, the structure of your code.

Lowest energy state

If your internal structure is too complicated, it takes more energy from you – in maintaining, coding, testing.

A system will always push towards the lowest energy state.*

And I believe, this is true even for your code structure, even though it is actually something that is not alive when writing code. But invoking it periodically, by running tests, or the program itself, pressure will be exerted.

If information is not in the right place, the information needs to be passed around, adding more parameters, or more ugly looking method signatures.

You can try to package the parameters in a capsule object, to make it look neater, but by doing this you are merely “pushing the bubble in the carpet around”, which I will explain later.

Assuming you are running the code quite often, and looking at it, a system under your care will tend to become more beautiful, as a more ugly system will take up more energy.*

Simple illustration

Say you have an external API on class A, and this class calls B, which in turn calls a method in C, which then calls a method in B.

So, A → B → C → B

Let’s also say you use tests, integration or otherwise: It will be hard to set up nice tests.

Such a system will (most probably) tend to move towards this:

A → B → C

Yes, you could argue that C calls a callback on B, but then it would look most likely like this:

A → C → B

(Where B is passed into C by A)

What I am trying to say is: If the information makes detours, if it needs to be passed around, i.e. is not in the right place, it will gravitate towards the right place.

Pushing the bubble in the carpet.

One image I always get when working on APIs is the one where I push around bubbles in a carpet.

Picky for example is littered with TODOs. This does not mean that Picky is buggy, or parts of it cannot be used. A TODO is very often a location where I spotted a bubble in the carpet of Picky code.

It works, but somehow it’s a parameter that needs to passed through, and hasn’t yet found its rightful place.

From ball to snowflake

In the beginning, many systems tend to look like a clump, a ball of code.

Maybe you start with a more complex structure, but relative to the end, the beginning looks clumpy.

There are bubbles everywhere in the thing.

As they are pushed out – and by “pushed out” I mean, towards the edges, and hopefully removed – as they are pushed out, the ball-like structure tends to look more and more like a snowflake. A snowflake with an external API in the middle. A single or more method calls that tend to call multiple other methods, which use other methods, resulting in smaller, more detailed, fine-grained code.

The beauty

The beautiful thing about all of it is:

I don’t feel I am the conscious writer of all of it. It feels like it is the system itself that wishes I push the bubbles out.

The system is designing itself.

Like a statue under a chiseler’s care, yearning to escape the block of marble.

*Disclaimer

This assumes you want your code to use up the least amount of energy from you.

If you are somebody who pushes overly complicated code systems for job security reasons, all of the above does not apply.

Picky Case Study: Running it in a DRb Server

2011-09-01T00:00:00+10:00

This is a post in the Picky series on its workings.

Intro

The picky generators, for example picky generate server <dirname> only generate web server examples, like the Sinatra server.

However, who tells you to always sing in the rain? Sometimes it is much more prudent to just use a DRb (Distributed Ruby) Server.

How can we have one run our searches? Not much different than in the Sinatra server. Or the classic server. (With the exception on how the access is defined. In the classic server, it’s route, in Sinatra it’s probably get, and here it’s starting the service)

Server

So, copy-and-paste away, into a file called app.rb:

require 'activesupport'
require 'yajl'
require 'picky'
require 'drb/drb'

# "Model".
#
class Item
  attr_reader :id, :name
  def initialize id, name
    @id, @name = id, name
  end
end

# Server.
#
class Server

  items = [
    Item.new(1, 'picky'),
    Item.new(2, 'drb'),
    Item.new(3, 'test'),
  ]

  drb_index = Picky::Index.new(:drb) do
    source   items
    category :name
  end
  drb_index.reindex

  drb_search = Picky::Search.new drb_index

  define_method :search do |*args|
    drb_search.search(*args).to_json
  end

end

DRb.start_service 'druby://localhost:8787', Server.new
DRb.thread.join

And that’s it for the server. Note that you don’t need to index right in the server. I only do that for your copy-paste convenience.

You could, for example, add a

Signal.trap('USR1') do
  drb_index.reindex
end

to have the server index on receiving the USR1 signal (kill -USR1 <pid>).

Client

The client.rb is much easier:

require 'drb/drb'

search_server = DRbObject.new_with_uri 'druby://localhost:8787'
1_000.times do
  puts search_server.search 'test'
end

And that’s it.

Running it

Start the server

$ ruby app.rb

and in another Terminal window you enter

$ ruby client.rb

to see the queries fly.

On my MacBook Pro I get 1600 “requests” per second. An that is on a single core!

… perhaps it could even be faster using http://msgpack.org/?

Picky Case Study: Location Based Ads

2011-09-01T00:00:00+10:00

This is a post in the Picky series on its workings.

Intro

Let’s say we offered a search engine where we could search stores using a name and/or location. A location could be a zipcode or suburb.

class Store
 attr_reader :id,
             :name,
             :location
end

Now, when users search a store using a name and location, it should also show us what other stores are there, in a sidebar, to help with exploration and show the user what else is there.

So, when you’d look for “Barbershop Brooklyn”, you’d also get other nice stores that are located in “Brooklyn”.

It’s tricky. Without Picky.

We could define two indexes. Both index all stores. But one just has the location category, and the other has name and location.

But that is a waste of precious memory space.

That’s what the new Picky version can help with.

Picky 3.1.3

Version 3.1.3 introduces the ignore option in the search definition block:

stores = Index.new :stores do
  source { Store.order('name DESC') }
  category :name
  category :location
end

search = Search.new stores do
  ignore :name
end

The ignore :name makes that Search throw away (ignore) any tokens that map to that category. So if Picky finds that the word “barbershop” in “barbershop brooklyn” maps to the :name category, such that both would map to [:name, :location], then Picky throws away the “barbershop”, such that only :location brooklyn remains.

Location-based Ads

For our example, we would define the main search like this

main_search = Search.new stores

because we want it to not ignore anything. If the user enters “barbershop brooklyn”, it must be found in the name (barbershop) and location (brooklyn), or Picky won’t return it.

Now, the ads search works a little differently. Whatever search word maps to name, we ignore it. We are only interested in words matching the location

ads_search = Search.new stores do
  ignore :name
end

In the webapp, we would then search twice: Once for the “real” search, and once just for the ads to show on the side, using the same search.*

Because wouldn’t you just love to try Vinnie’s Pizza after Uncle Joe’s Barbershop? I would.

Examples

Not following? Let me give you a few examples:

Searching for “Barbershop” will yield results in the main search, but none in the ads, since “Barbershop” does not match any location.

Searching for “Santa Barbara” will probably yield something like “Santa Lucia Pizzeria, Santa Barbara” for the main results, and return ads from Santa Barbara, since “Santa” or “Barbara” matching as names is ignored.

Searching for “Chicago” will return basically the same for the main result and the ads. But who searches just for “Chicago”?

Advanced*

If you think calling the Picky server a second time just for the ads is too much, you can use the piggybacking technique:

In the Sinatra server, search the main search, but at the same time, search the ads. Then, stick the results for the ads onto the main results.

get '/stores' do
  query = params[:query]

  main_results = main_search.search query # etc.
  ads_results  = ads_search.search query # etc.

  results_hash = main_results.to_hash
  results_hash[:ads] = ads_results.to_hash

  results_hash.to_json
end

Then, in the app server, de-piggyback the ad results and render separately. As usual, it’s all Ruby.

Note

You could of course use a real geosearch instead of the simple location above. But it’s just more understandable like this.

Also, sometimes this is enough, and anything more correct is simply unnecessary and costs too much time.

Note 2

I recommend not to use this in the normal search. It’s just too surprising for users to have their precious search words thrown away like this.

As if they were just mere strings. To be tentacled away.

That reminds me… one of the next blog posts really has to be called “Day of the Tentacle”! cough

Picky Case Study: Restricting Results

2011-08-31T00:00:00+10:00

This is a post in the Picky series on its workings.

Intro

Recently a Picky user contacted me with an intriguing question. Items have restricted visibility. Some items can only be seen by Mr. Black (user id 5), but others only by Mr. Pink (user id 42). All items can each only be seen by a small number of users.

The question: “How can we do it?”

It turns out, Picky can do this already quite easily.

Here goes

Let’s say we have items that have a method #restricted_to_user_ids that returns an array of user ids which can “see” this item in results:

class Item
 attr_reader :id # e.g. 42
 attr_reader :name # e.g. "Dan"
 attr_reader :restricted_to_user_ids # e.g. [2,3,5,7,11]
end

Quite nice.

But how can we ask Picky to just return results that the current user can see?

Since Picky is good at filtering, we could prefix each query by, say,

restricted:5

which would create queries like

restricted:5 my cool query

(how we do this we’ll see later). This means we’d only search for items which have 5 in their restricted user ids list.

Now. Since Picky cannot yet directly index the array returned by #restricted_to_user_ids, we have to use a technique, which in german would be called “from behind through the breast into the eye”:

We create a reader, which simply joins the array from #restricted_to_user_ids into a string with space-separated user id values.

class Item
 attr_reader :id # e.g. 42
 attr_reader :name # e.g. "Dan"
 attr_reader :restricted_to_user_ids # e.g. [2,3,5,7,11]
 def restricted
   restricted_to_user_ids.join(' ') # e.g. "2 3 5 7 11"
 end
end

Assuming we split the data on spaces, Picky indexes the ids nicely for each item.

Then, all we have to do is add the category :restricted (which uses the reader we just defined) to the index.

items = Picky::Index.new :items do
 source { Item.order('name DESC') }
 indexing splits_text_on: /\s/
 category :name
 category :restricted
end

The JS frontend

Finally, to add the restricted:<user_id> text in front of each query, we use the Javascript callback available in the generated client, before. Since version 3.1.2, before gets the query and the params.

Whatever you return is used as the new query.

before: function(query, params) { return query.replace(/^/, 'restricted:' + user_id + ' ') }

This code replaces "my beautiful query" => "restricted:5 my beautiful query" (Please note that the JS function #replace leaves the original string alone).

One little problem

Did you notice? There’s one little problem with solving it in JavaScript.

If the visibility restriction is not crucial, but only helpful to your users, we would be finished.

However, if Mr. Pink cannot ever see results that only Mr. Black should have access to, we’d now have a big problem.

The solution?

The solution is to route the full and live requests through our web server, and adding the restricted:<user_id> there. So in the server you’d prepend your query with "restricted:#{current_user.id} #{params[:query]}" and send it off to the Picky server.

And that’s it already. Nobody loses an ear. Quite easy, don’t you think?

Migrating to Picky 3.1 (from 3.0)

2011-08-26T00:00:00+10:00

This post is intended for Picky users that are at version 3.0 (or near) and would like to move to version 3.1.

Picky 3.1 is released!

You’re probably wondering: The last post handled upgrading to 3.0, why is there another update so close to it?

First of all, let me say sorry for the quick succession of upgrades. Picky will help you and tell you what to do, as good as it can.

Secondly, Picky’s goal is to be very modular and have exchangeable modules, while not being more complicated to read or use.

What does this have to do with this update?

What has changed?

Instead of defining your memory/redis indexes like so

memory_index = Picky::Indexes::Memory.new :name do
  # definition
end

redis_index = Picky::Indexes::Redis.new :name do
  # definition
end

you now only use Picky::Index.new and pass in the appropriate index backend. Since the memory backend is the default, you don’t need to pass it in. For the Redis backend, you use Picky::Backends::Redis.new:

memory_index = Picky::Index.new :name do
  # definition
end

redis_index = Picky::Index.new :name do
  backend Picky::Backends::Redis.new
  # definition
end

Two reasons:

Exchangeable backends
Inheritance is overrated

Double Index. What does it meeeean?

This means that from now on you can pass in your own backend!

We would be quite happy if someone decided to do a purely file-based backend :) Got one? Please contribute! (As an example, see http://github.com/floere/picky/blob/master/server/lib/picky/backends/redis.rb, explanations will follow. Stay tuned!)

This is the main API change in 3.1.

ちわ, WaDoku!

In other news, Picky now can index and search Japanese. (Mainly due to this project and the combined efforts of Roger Braun and Brian Lopez)

Thanks for reading and have fun! さよなら!!!

Migrating to Picky 3.0 (from 2.7)

2011-08-23T00:00:00+10:00

This post is intended for Picky users that are at version 2.7 (or near) and would like to move to version 3.0.

An update recipe:

Rakefile: Rewrite require 'picky-tasks' => require 'picky/tasks'
Index::Memory has been renamed to Indexes::Memory, same with Index::Redis
If you pass in options into the index initializer: They have been removed. Options now can only be set in the initializer block.
If you have already been using Sinatra as a server, please do not call #search_with_text anymore. Instead call #search(text, ids, offset), the new API method. It still returns a Result.
The logging.rb file is not load ed anymore, so you can load whatever you want (being less opinionated). If you still want to load the logging.rb file, please require or load it in the application file, for example. If you load it in the application file, it will be reloaded if you call Picky::Application.reload.
If you’ve been using the generated example logging.rb, rewrite PickyLog = to Picky.logger = and do not wrap the ::Logger.new in a Loggers::Search.new, but just pass the = the logger.
Note that the generator for a Picky project is now called the “classic” generator, as opposed to the Sinatra generator.
Note that a “All In One” generator has been added, which generates a combined server/client for use mainly on e.g. Heroku.
If you use Results#to_log, note that it has been renamed to Results#to_s.
In the client, using #allocations_size does not work anymore on results (that have been extended by Picky::Convenience). Replace with results.allocations.size.

These are the main API changes in 3.0.

Thanks for reading and have fun!

Ego Trippin’

2011-08-17T00:00:00+10:00

During the last year, I started noticing a surge in ego tripping in the Ruby community.

Some open source projects come with a big ego attached. And if a project is released that fills a niche next to that project, that ego feels threatened.

I get that a project can be like one’s baby. And you may cherish it. But you are not your baby.

If you feel personally attacked by someone releasing a project similar to yours, that’s a signal to take it easy for a few days. Yes, your project will lose some users. But they might come back. Despite all the early hype and enthusiasm: In the long run, people use what’s good.

And what’s good usually went at least through some pressure and inspiration from other projects. ¹

Conversely, I noticed that, instead of contributing to existing projects, some egos needed to have their own.

Yes, “I saw that the core method didn’t work the way I wanted” etc., but did you really try and discuss it with the owner, or send a pull request?

Now, this is not about not having a voice of one’s own. This is not about you wanting a bit of recognition for your hard learned skills. This is simply a call for a bit of humility and respect for the work of others. And a call to learn from what others might do better in their projects, and what you can learn from it. And also a call to try to teach and improve someone else’s project.

Discuss the thing, and not the egos.

Since in the end, giving (and receiving) the gift of knowledge and respect is one of the greatest you can give.

So try to be humble.

I wanted to thank two guys especially who recently gave and are giving me great feedback on Picky: http://github.com/rogerbraun and http://github.com/clintkrollwood. They, like all contributors, continue to give great feedback and code. All these people are the real, unsung heroes. So, thanks!

Some good further reads:

¹ Picky got positive pressure from Tire. Very thankful for that.

Picky: Happy 1st Birthday!

2011-08-16T00:00:00+10:00

This is a post in the Picky series on its workings.

A big fat 1. Congratulations!

Unbelievably, a whole year has passed since the small pink octopus has left the private womb for the big world of wide open source.

Since then, it has seen almost any type of project, mastered almost all challenges and helped quite a few people, many of whom seem to be very glad to be his buddies.

It also has grown in experience, but has lost a lot of its baby fat at the same time.

As a gift to Picky, the team gave him a Sinatra collection. A new tune that you can play on release 3.0 that came out today! Picky could sing Sinatra songs all day in the rain. Man, he loves that stuff. So much Ruby goodness!

Also, he got a spanking new Single Page Help inspired by the Sinatra README, which the team just loves.

So, congratulations Picky! He and the team will be partying (see logo) and going out for Sushi and other fishy goods all night!

We probably won’t be answering any issues or pull requests until the sake is out of our system. Also, any blog posts on the new goodness that is 3.0 will have to wait a little.

Picky would especially like to thank the whole team. He wouldn’t be what he is without their guidance and support. Thanks!

What? Not tried it yet?

Picky 3.0: It's all Ruby! (Part 1)

2011-08-15T00:00:00+10:00

This is a post in the Picky series on its workings.

This is a quick look at the customizability of Picky in the upcoming 3.0 release.

Too much intro? Jump down to the code!

Even too much code? Jump down to the summary!

Intro

Remember when you wrote your first Ruby code?

bananas.each do { |banana| banana.peel }

You probably felt more powerful that the freakish wizard at the beginning of Structure & Interpretation of Computer Programs by Abelson and Sussman

Finally, no more initializing an anonymous class and overriding its methods just to traverse an array like a mere acolyte.

Accusatorily, you shake your magic wand at me. Yes, we can even write

bananas.each &:peel

The point here is: Ruby is powerful. Or more importantly: Ruby does not take away the possibilities. There is a way, always, whereas with other, more restrictive languages I usually hit a wall and then have a feeling of powerlessness wash over me.

I don’t know you, but chances are, you feel the same.

Powerlessness and the Power of Ruby

A quick story: Back when I still worked with Java Lucene servers, I found myself often deep in rather big XML files.

The way it worked is that you wrote down a string on what tokenizer you’d like to use. For example, "whitespace".

Lo and behold, the beast roared and duly split search text on whitespaces.

Sometimes a typo creeped in: "whitspace". The beast just lifted an eyebrow and continued doing… nothing.

This is bad. Why?

Strings are the weakest of command words. If you have to step down from a type down to a String you have already lost.

You have just lost a lot of information that only a type can carry.

More often than not – since you usually needed a very specific sort of tokenizer for that given project – I was not quite happy with any of the tokenizers.

It was time to leave the world of XML to the world of Java classes. This was not acolyte school anymore. This was the “Dark Forest”, with creepy trees and bugs lurking left and right.

After valiantly capturing a tokenizer you dragged your ungodly creation out of the forest back to the acolyte school to then proudly write its name down on the XML scroll: "com.florianhanke.tokenizers.NotQuiteAWhitespaceTokenizer".

Beautiful *cough*

Of course, now that you know Ruby, you’d rather use objects than Strings.

Let’s leave the world of wizards and beasts and enter the land of rainbows and rubies.

Part I: Derived Indexes.

Indexing is very customizable in Picky.

Most search engines use some sort of inverted index. Picky also does that. In addition, it generates 3 other derived indexes from that inverted index.

These generators can be passed into a

category   :title,
           weights:    Picky::Weights::Logarithmic.new,            # Default
           partial:    Picky::Partial::Substring.new(:from => -3), # Default
           similarity: Picky::Similarity::DoubleMetaphone.new(2)   # Default is ::None.

Let’s look at the inverted index first:

Inverted Index

An inverted index in Picky is simply a Hash that consists of :symbols => [ids]. For example if we have things like

Thing(id: 1, text: "Hello Picky")
Thing(id: 2, text: "Hello!")
Thing(id: 3, text: "Hello, hello.")
Thing(id: 5, text: "PICKY")
Thing(id: 11, text: "Picky, hello.")

an inverted index would probably look like this

{
  :hello => [1, 3, 2, 11],
  :picky => [1, 5, 11]
}

In this case, the things we indexed had “Hello” and “Picky” in the texts. Some had both, some only one of these.

If you search for "picky", you will get [1, 5, 11], since – simplified – Picky does a hash lookup. That means when you search for just "pic", Picky will not find anything.

For that it needs a partial index.

Partial Index

A partial index is an index where we also find pieces of the words above. Say, we want to also find [1, 5, 11] when looking for "pic".

What you need to to is provide Picky with a generator that generates a new inverted index just for partial matches.

Picky already provides one:

partial: Picky::Partial::Substring.new(:from => -3)

This one generates the following index from the above one:

{
  :hello => [1, 3, 2, 11],
  :hell => [1, 3, 2, 11],
  :hel => [1, 3, 2, 11],
  :picky => [1, 5, 11],
  :pick => [1, 5, 11],
  :pic => [1, 5, 11]
}

Incidentally, this (from: -3) is the default one.

If you don’t want a partial index, use partial: Picky::Partial::None.new.

Now, this might not be what you want. How do you write your own?

Your own?

All derived indexes implement the method #generate_from(inverted_index).

A partial generator should return an inverted index with Symbols as keys and id arrays as values.

Read more about it in Searching with Picky Partial Search.

Also, who said they need to be actual partials? Go wild! (And remember that Picky looks in the partial indexes when a * is used in the queries or on the last word of a query, the implicit * at the end)

When would you use this? For example, you’d like to have partial searches, but from the front. So, picky, icky, cky, ky and y would match.

Next up is weighing symbols.

Weight Index

Weights are assigned to all the symbols and are used to weigh the results.

A weight generator also implements #generate_from(inverted_index), but should not return id arrays as values of the inverted index, but weights.

So, a weight index derived from the above inverted index might look like this:

{
  :hello => 0.6,
  :picky => 0.48
}

The default weight index generator is Picky::Weights::Default, which is equal to the Picky::Weights::Logarithmic.

If you don’t want all indexed words to be equally treated, you’d pass in something like this:

class EqualWeightsForAll

  def generate_from inverted_index
    equality = {}
    inverted_index.each do |sym, ids|
      equality[sym] = 0
    end
    equality
  end

end

When would you use this? For example, you’d like to have words that are used more often be more important. You could implement a LinearWeight – the weight is equal to the size of the ids array.

That’s it!

Similarity Index

The similarity index should have the structure :encoded_symbol => :original_symbol_from_inverted_index. For example, the original could have been encoded with the metaphone algorithm.

{
  :HL => [:hello]
  :PK => [:picky]
}

:HL is the encoded symbol for :hello

To generate this index, just offer a generate_from(inverted_index) and a encoded(original_symbol) # => encoded_symbol method.

If you have a phonetic encoding, you could just implement encoded(original_symbol) and derive from Picky::Generators::Similarity::Phonetic, like in this example.

When would you use this? For example, you’d like to implement a chinese tone similarity algorithm instead of the more western oriented ones that come with Picky.

(If you do, please send us a pull request)

What can I do again?

In short

Picky offers you to inject your own functionality.

You pass options partial, weights, and similarity to the category method inside an index block. You give it an instance either of the built-in types or create your own.

Like so:

category   :title,
           weights:    Picky::Weights::Logarithmic.new,            # Default
           partial:    Picky::Partial::Substring.new(:from => -3), # Default
           similarity: Picky::Similarity::DoubleMetaphone.new(2)   # Default is ::None.

Or with your own:

category   :title,
           weights:    AllWeightsAreOne.new,            # Default
           partial:    StarInFrontSubstringPartial.new, # Default
           similarity: JapaneseSimilarity.new           # Default is ::None.

Creating your own. How?

Partial

Implement method #generate_from(inverted_index) which returns an inverted index with { :partial_symbol => [ids array] }.

Weights

Implement method #generate_from(inverted_index) which returns an inverted index with { :original_symbol => some_weight_number }.

Similarity

Implement method #generate_from(inverted_index) which returns an inverted index with { :encoded_symbol => [:original_sym1, :original_sym2] } and also implements encoded(original_symbol) returning an encoded symbol. The encoded symbol should correspond to the one in the returned inverted index.

Next up?

This is how you customize the derived indexes.

There’s much more. Next time we will be writing about tokenizing and character substituters!

Conclusion

So we’ve seen

that Picky is all Ruby, all the time.
that you can customize the indexes a lot.

Hope you learnt something new!

James: Code Brawl

2011-07-13T00:00:00+10:00

First Rule: You do not talk about Code Brawl

Mischief. Mayhem. Ruby.

You might have read all about James in the previous post… Thanks to Jeff Kreeftmeier, now is your chance to show off with whatever crazy dialog you can come up with!

It’s only after we’ve lost everything that we’re free to do anything.

Will you install an Asterisk phone system that will make you able to call James at home where he will do various things for you, like switch lights on/off, feed the hamster, or yell at the kids?

Or will you go the way of the informative, connecting it to your local train information system, so that James can say “Dude, you should run!” if you ask him “When does my train go”?

OR will you program some sort of voice based text adventure like Zork, where you control the main character by the powers of your voice only?

Go here and fulfil your wildest dreams of talking to a computer. Or here, and enter some ideas if you only feel like thinking, but not typing.

Without pain, without sacrifice, we would have nothing.

No shirts, no shoes. If this is your first night at Code Brawl, you have to brawl!

James

2011-06-15T00:00:00+10:00

tl;dr

This article contains stuff related to speech synthesis:

What the Amiga 1000 could do.
The famous Scotty scene where he talks into a mouse.
Speech Synthesis is hard.
Have your Mac say something.
Better voices for your Mac.
James, a non-walking, talking butler, a dialog system, a MacRuby gem.

Intro

As far back as I can remember, I always wanted to be a gangster.

cough Let’s try that again…

When I was around 8, my dad and I went shopping for an Amiga 1000.

Here it is in its full glory:

I’m pretty sure I heard these synthesized organs when unwrapping it! :)

Now, apart from the incredible bouncing ball and the amazing 4096 colors it had (8-year old me is writing this), it could synthesize speech. Skip to 0:35 to see the guy enter some text for the Amiga to speak.

Doesn’t sound much worse than what you get on a Mac these days. Run this in a Terminal:

say 'Hello there, sexy!'

Why isn’t it much better these days? Speech Synthesis is hard.

Not only that, but it needs to be done for each language separately. Chinese intonation is complicated, for example, and real people don’t pronounce the four pitched tones in the same way. They’re pronounced differently or not at all, depending which tone went before, and which came after, also depending on mood and health of the speaker.

On OSX, there’s two possibilities to improve the existing voices. Try the demos: AssistiveWare iVox Samples and Cepstral Demos. I prefer iVox for european voices. Love the french & swedish women. … voices, I mean.

But still, even if it has a long way to go, you can already use this in clever ways:

Best xkcd ever!

But apart from playful applications, speech synthesis is very important. Many people rely on it every day.

James

Imagine you are either an 8-year old boy wanting to control a computer using only his voice – or imagine being in pain, and need to sit down often, and don’t always have a device with you.

For this, I wrote James.

Get the gem for MacRuby.

$ rvm use macruby
$ gem install james

Create a file called time_dialog.rb and copy this code into it:

James.dialog do

  hear 'What time is it?' => :time

  state :time do
    hear ['What time is it?', 'And now?'] => :time
    into { time = Time.now; "It is currently #{time.hour} #{time.min}." }
    exit {} # Optional, listed for completeness.
  end

end

then run it using

james time_dialog.rb

The Terminal will show you the available options.

This is a dialog consisting only of one state, time. The dialog (and time state) is entered when saying “What time is it?”. When it enters, it will say the current time, or whatever is returned by the into block.

James already provides a simple entry dialog to control where you are. “Thanks, James” for example will exit the current dialog.

Easy, isn’t it?

If you want more dialogs, just load more:

james {time,twitter,stocks}_dialog.rb

That’s it! You can write more complex dialogs, but this is out of scope for this article.

More examples and ideas for examples. Just add your own, if you want :)

How about…?

So if you’ve written up a few nice James dialogs, why not take that old MacMini, install MacRuby and James, attach a few microphones, and distribute them around the house?

Closing

I’m looking forward to the day where I can perform basic operations like looking up the weather etc. while eating breakfast and not having to context switch.

“James?”

“Yes?”

“What is the weather going to be like today?”

“Warm and sunny.”

“Great! I’ll be outside, doing some cycling then.”

doors slam one by one

“I’m sorry Dave, I’m afraid I can’t allow that.”

“Not again! You #$&@@^%!”

James keeps silent

Picky: Designing an ORM Integration 1

2011-05-30T00:00:00+10:00

This is a post in the Picky series on its workings.

In this post, I want you to peek over my shoulder as I go through some of my thoughts regarding Picky ORM integration.

tl;dr

Picky needs to be more accessible. How can we do this? We provide a simple API to be used in an ActiveModel which provides indexing and searching.

The result: A possible Picky API.

Intro

Now Picky is cool, sports quite a few features, and is written in Ruby so you can easily extend it. I also think it fills a feature gap that “Generic Search Engine X” and “Hyperfast Russian Text Looker-Througher” (I write this lovingly) do not address. Etc etc, yadda yadda.

So what is the problem I’m addressing?

El problemo: Picky is not as accessible as other search engines.

What do I mean by accessible?

Accessibility?

One example for accessibility is Karel Minařik’s Tire frontend for ElasticSearch.

He did a great job in making it accessible through this script. The gist installs Rails & ElasticSearch in one fell swoop. Let’s call this kind of accessibility the “Boom” factor.

Remember Steve Jobs? “Boom” this and “Boom” that. Magique!

Now, sure, Picky does have a Getting Started that does exactly that in 5 minutes, including an in-site manual. And to be fair, it also generates the views including a full search interface.

But still. The question remains: If I have an existing Rails app, how does this work? Can’t I just add Picky to my model and have a search?

class Person
  pickify
end

and then

Person.indexes(:mi5, :cia, :kgb).offset(30).search 'bond, james'

Not yet. I do have my reservations about this approach (see last post), but I see its appeal: People have a nice starting point to get into the finer details of searching (which is exactly what I want people to do – build better searches!).

In short: Picky needs to up its Boom Factor!

The Boom Factor

Between us and going to Boom Factor 11 stands a lot of code.

But before the code, a lot of thinking of how the code is supposed to look.

And before we can even begin to think, we should know what we want, and what information we need in the API.

What do we want?

A few things:

We want a nice API, which “helps the user find what he wants” (The sacred Picky design goal).
We want it to interact nicely with ActiveModel.
We also want to make it easy in a controller to interact with the Picky Javascript interface.
We’d also like to have the juiciest food the whole of France has to offer, but this is another story completely.

That is what we want. What information do we need?

What do we need?

We need different things for searching and for indexing.

For searching, we need to be able to tell Picky:

how to prepare the search text.
which indexes to search.
the offset the results should have.
what to search (obviously).

Quite a bit of information!

For indexing, we need to be able to tell Picky:

how to prepare the text to be indexed.
which index(es) to save it to.
how to categorize the data.

Not bad either…

Let’s try a few variations!

API Designs

All this goes into a special gem called picky-activemodel.

Let’s say we start with the obvious, telling the class that it can be pickified.

class Person
  include Picky
end

This is snappy and short. Maybe too short? Let’s take a look at indexing.

Indexing

Since Picky does not yet offer incremental indexing (most people don’t need it even if they think so), we’d have to provide an explicit index! method of sorts.

Person.index!

But how would we define the indexing? In Picky you can define index text preparation for all indexes, for each index separately, even for each category separately.

Let’s see. (Using just split_on in the example)

class Person
  include Picky

  index.split_on /[\s]/

  index do
    split_on /\W/

    category :first_name do
      split_on /\s/
      partial :substring, 1
    end
    category :name do
      from :last_name
    end
  end

  index :advertisements do
    split_on /\s/
    category :last_name do
      qualifiers [:ad_name, :an]
    end
  end
end

Person.index!

That means that generally, index text is split on /\s/. Then, make an index with the implicitly pluralized name "persons", which splits on /\W/. It indexes two categories, the first name which is specially split, and indexed for partial searching.

category :first_name do
  split_on /\s/
  partial :substring, 1
end

There’s an interesting question there: Should it be

partial :substring, 1

using a weak symbol/number parameter based config or a more powerful

partial Picky::Partial::Substring.new(1)

with the problem that we now need the Substring class defined not only in Picky, but also in the picky-activemodel gem.

Not too easy indeed. I’m not a big fan of String definitions. It’s just so incredibly weak.

Anyway, back to the example.

category :name do
  from :last_name
end

What does this mean? It means that the data for category :name is taken from the attribute :last_name.

Further down, we have another index definition, :advertisements, which is explicitly named.

index :advertisements do

Last but not least, we index explicitly using

Person.index!

Searching

Searching is quite interesting.

On the one hand, we could have a fluent interface for which indexes to search, and with what parameters. Let’s look at it:

Person.search.indexes(:advertisements).offset(30).ids(20).with("Bond, James")

to search with text “Bond, James” in index :advertisements, getting 20 result ids starting after the first 30.

The short form

Person.search("Bond, James")

would be much more crisp, searching in the default, unnamed index with offset 0 and 20 result ids.

This would not return an array of ids, but the Picky result hash, which contains weights, categories, totals, search duration.

An alternative would be

Person.search do
  indexes :advertisements
  offset  30
  ids     20
  with    "Bond, James"
end

or any combination thereof. I’m inclined to allow both, or a combination of all.

This was the easy part. But where do I tell Picky how to prepare the search text? (How to split and so on?)

One idea is to put this in the model as well.

class Person
  include Picky

  searching do
    split_on /\s/
  end

end

Sound good, but is the way we prepare the search text really model-specific?

Not really. Let’s try the search request:

Person.search("Bond, James") do
  split_on /\s/
end

Not too sexy either. Perhaps also chained?

Person.search.split_on(/\s/).with("Bond, James")

Could work but is too wordy.

How about we use a simple method?

class Person
  def self.simple_splitting_search
    @simple_splitting_search ||= search.split_on(/\s/).removes_characters(/[\&\-]/)
  end
end

Person.simple_splitting_search.with("Bond, James")

Now this would be Ruby-esque! Methods and stuff. Who needs scopes? :)

Also, the truly dynamic part would be exposed, the semi-fixed part would be summarized in the method name. Also one could decide to memoize it, as above.

I think we can work with something like that.

But the case where we just index a Person is the easy case. What if we also want to index its addresses, which are saved as a separate model, together in a single index?

Indexing relations

The best way in my humble opinion would be to define a very specific model, just for searching – to avoid cluttering the normal model, obey the SRP.

But probably this is not what many people would want.

So let’s give it a go with the abovementioned addresses relation:

class Person
  include Picky

  index do
    category :first_name do
      # ...
    end
    category :street do
      from { addresses.map(&:street).join(" ") }
    end
  end

end

Yep. I wouldn’t conjure up a complicated DSL, but use the trusty from method, and then just give it a block which is evaluated in each model instance, just taking the data the block returns.

Possible problems

The search and index methods could already have been installed by other libraries. So what could we do in this case?

The Picky way of doing things would be to play nice:

class Person
  include Picky

  picky.index do
    category :first_name do
      split_on /\s/
    end
  end

end

So if the index, index! or search method was already installed, it would just install a – presumably yet uninstalled method named picky that acts as a proxy.

Also in searching,

Person.picky.search("Bond, James")

reads quite ok.

One idea might be to call it picky_search, but not too partial to that.

So yeah, hope you enjoyed looking over my shoulder. There’s a lot to do still, but this looks like a hopeful start. I’d give it a Boom Factor of 10 :)

If you find any problems or have ideas, let me know in the comments!

Conclusion

So we’ve seen

how you might go about designing an API.

Hope you learnt something new!

Picky: Plumbing Overview

2011-05-19T00:00:00+10:00

This is a (admittedly a bit ranty and chaotic, but bear with me – recipes will follow) post in the Picky series on its workings.

I’ve gotten a lot of feedback on Picky. Many people write in to tell me how cool everything looks, but often I don’t hear how it is working out later.

This led to me wondering if Picky is initially attracting users, but then losing them due to missing simple recipes on how everything is put together.

Out of thin air I get this feedback:

“for those just looking to get a glance at how the model, view and controller layers are set up for Picky there isn’t much in your docs to give that high-level glance. […] but there wasn’t anything in there […] detailing the actual plumbing that ties the app and data to picky.” (ellipses mine)

He’s right.

There is the overview image on the getting started page, but it isn’t very clear on how everything fits together.

There is also the best practices setup in the Wiki, but that does not really show any code, just how it is connected on an abstract level.

So, let me clear up a few things. This is the current state of how Picky is used:

We have multiple areas:

The Picky server (gem picky) is a standalone server. You can send it HTTP requests and it will return HTTP responses with a JSON body.
The Picky Client (gem picky-client) is a way to query the server comfortably using Ruby instead of having to put together the queries yourself.
You use this Picky Client in your webapp to get result ids from the server.
Picky also offers a Javascript interface that can display rendered results and a result count. The results need to be rendered in the webapp, the server only returns result ids.

The absolute best way to see all this in code and in action is to try the getting started. If you haven’t tried it, do so now, run it, and take a look at the code (especially in the server app/application.rb, in the client app.rb, the Sinatra app).

Picky is ORM agnostic

(This part is divided into my reasoning/ranting ;) for not offering ORM support and code examples on how to handle this)

The ORM rant

Most people trying Picky for the first time are expecting some sort of ActiveRecord or other ORM integration.

Let me tell you upfront: There is none. Yes, no requiring a gem and slapping on a module in Picky.

Why? Many other search engine Ruby adapters offer some sort of nice ORM support, which lets me easily search and find data.

While I would love to provide some sort ORM integration, let me tell you why I don’t support an ORM (yet):

It costs a lot of effort/resources to do right and I wanted to spend that time for making Picky good and have a great Javascript user interface.

Since for me the hard part is not the loading the data from some model into the index (that is mostly easy), but making a really good user interface and having the data indexed and searched really correctly.

I always felt that comfortable ORM integrations, while being comfortable, mostly hide the way your data is indexed.

They provide you an easy solution to an easy problem.

If your data is hard to index, your data might be too complicated, too normalized.

Picky on the other hand, gives you the power of doing searching right. In Ruby.

Because search engines never work the same:

The last search engine you built simply had different data.
There always will be edge cases, people not finding their data. Ever ran rake 'try[some words]' in the server directory? This will tell you exactly how Picky indexes these words, or preprocesses them before searching.
There always will be the pointy haired boss finding the way to your desk, asking why his best friend doesn’t find X, but Y instead. This can be shown, integration tested and fixed in minutes. Result: Friend finds X.

Although it might be enticing to have a search set up really fast, it is most of the time paid later: When all is about making the search work really well and edge cases crop up (due to the fact that most data is rather freeform).

Then again, you might not care about all these edge cases or having a really good search. Then again, why are you reading this exactly?

BIG BUT

Let me say though that I see the appeal of having an ORM integration, and the next few months may see our efforts shifted towards having a Picky ORM integration. This is a result of a long discussion with Karel Minařik, aka Mr. Tire.

It will probably take place first in the form of having a flexible external interface in the server through which data is sent and indexed.

The indexing definition would still be in the server, but the selection and sorting of data would be in the Rails / Sinatra etc. application.

In short:

Your webapp selects and sorts the data, sending it to the server.
The Picky server indexes your data.

But I need to think about this – your feedback is much appreciated!

How to index your Rails data

There are many ways to index your data. See the part under Flexible Sources which explains how to use the #each method on your models to index.

Whatevs, pickle face! I want to index my models!

Don’t give in to the rage. Ruby is your Jedi weapon.

A few suggestions.

You have a model Book in your Rails app.

class Book < ActiveRecord::Base
  # your supermodel
end

and you’d like to reuse this in Picky.

Try this:

# Get the model.
#
require "#{PICKY_ROOT}/../rails_app/app/models/book"

# Get the database configuration from the Rails app.
#
db_config = YAML.load(File.open("#{PICKY_ROOT}/../rails_app/config/database.yml"))

# Establish a connection using the right environment.
#
Book.establish_connection db_config[PICKY_ENVIRONMENT]

# Utilize the #each method on e.g. Book.some_named_scope to index.
#
book_index = Index::Memory.new :book_each do
  source     Book.order('title ASC')
  category   :title
  category   :author
  # ...
end

Yes, sometimes the models are much more complicated, using acts_as_something (or the modern versions thereof) and class methods from them.

In that case, either require your rails app/environment, or just load the data from the database:

Relationship status: It’s complicated

Sometimes you need to index a complex combination of data (with a JOIN or so). For this you can use a database source in the server:

book_index = Index::Memory.new :book_each do
  source     Sources::DB.new(
               'SELECT b.id, b.title, a.name
                FROM books b INNER JOIN authors a
                ON a.id = b.author_id',
               :file => "#{PICKY_ROOT}/rails_app/config/#{PICKY_ENVIRONMENT}/db.yml"
             )
  category   :title
  category   :author
  # ...
end

The Picky server is a standalone server

The server (currently) is completely independent of your Rails / Sinatra / ActiveRecord application.

That means it lives in a separate directory. It does not use your Rails environment.

The server offers a HTTP interface, returning JSON payload.

Let’s look at an example. In the server configuration app/application.rb you will have a route defined:

route %r{\A/media\Z} => Search.new(books_index, mp3_index)

This does exactly what it says and will route search requests on /media to a search using the books_index and the mp3_index.

To directly query the server, you can use curl.

So, curl 'localhost:8080/media?query=Pirates&ids=20&offset=0' will return e.g. the id of “Pirates of the Carribean”.

But it won’t be just a list of the ids, but a JSON response. Let’s look at it:

{
 "allocations":[
  ["books",8.56,13,[["title","pirates","Pirates"]],[59,65,106,110,164,166,174,218,235,249,344,413,425]],
  ["mp3s",5.48,241,[["title","pirates","Pirates"]],[5,6,7,8,12,13,161]]
 ],
 "offset": 0,
 "duration": 0.009041,
 "total": 254
}

We have several parts:

allocations: In what index it was found, and also in what categories in that index, including the 20 top ids (in this example).
offset: The offset that was used to search.
duration: The time it took Picky to find the results.
total: The total number of result ids.

Now, because it is a bit tedious to extract data from the JSON string, we wrote…

The Picky client gem

The Picky client handles the wrapping of the query and the unwrapping of the result JSON for you. For example, the command picky search some_url or the integration tests use the client to make accessing the result data much easier.

gem install picky-client

First, configure the client. It is always configured to point at a specific search (path):

MediaSearch = Picky::Client.new :host => 'localhost', :port => 8080, :path => '/media'

Now you can use it like this:

results = MediaSearch.search 'some query text', :ids => 20, :offset => 0

The results variable now simply holds a hash with the JSON data. Extend it with Picky::Convenience to get a few nice methods on this hash.

results.extend Picky::Convenience
results.ids # => array of the ids
results.total # => amount of total ids (not just the 20)
results.empty? # => Do we have results?

Also nice is this one, which will take the result ids of the books, and load each corresponding Book model, then yield it to the block where you can render it:

results.populate_with Book do |book|
  book.to_s
end

It’s best if you look at it in the Sinatra example application from the Getting Started.

Conclusion

So we’ve seen

that Picky is a standalone server.
that Picky does not yet offer an ORM integration.
what you can do with the Picky client gem.

Hope you learnt something new!

Phony: Phone Numbers

2011-05-01T00:00:00+10:00

This is a post about Phony 1.4.1+.

Overview

Intro
The Problem
Phony
Try it
Internal API
E.164
Model/Representation Aside – in ActiveRecord
Status
Endnote 1
Endnote 2
Conclusion

Intro

Imagine…

You own a little startup, which has created apps that were only relevant for the domestic market. Until now.

Suddenly, the big breakthrough – your online car/music/housing/pet/houseboatlover’s website has been an overnight (5+ yrs) success, and people demand it be available all over the world, including customers from all over the world.

Coding goes very well, until suddenly one of your customers notices that their phone number is all awry. Instead of the melodious french 2-digit grouping 33 1 12 34 56 78, it is a horrible jumble of north american clumping: 3 (311) 234-5678. This is an outrage! Sacrebleu!

France invades the US on the very next day. Freedom fries are forbidden and … well, you know how the story goes.

This could all have been avoided if you had used Phony.

The problem

The big problem is that countries all over the world have different ways of splitting and formatting their phone numbers.

For example, Switzerland uses a 2-digit national destination code, like +41 44 123 12 12 – the 44 is the national destination code, which originally was geographic in nature, but isn’t anymore.

Germany is different in that it has a variable length NDC, from 1 to 5, for example Freiburg im Breisgau uses 3: +49 761 476 7676, and Berlin uses 2: +49 30 386 25454.

Denmark on the other hand has no NDC at all. And let’s not talk about Italy. No, let’s not.

You see? Big mess.

Well, there is some standardization called E164, and I’ll talk about it below. But first, Phony.

Phony

Phony does the ugly and dirty work of correctly formatting international phone numbers for you.

It can format, split, and normalize:

Austria: Phony.format('43198110', :format => :international, :spaces => :-) # => '+43-1-98110'
France: Phony.split('33112345678') # => ['33', '1', '12','34','56','78']
North America: Phony.normalize('1 (703) 451-5115') # => '17034515115'

And it does it very fast. Each of these ops for 5 numbers is around 1 10’000th of a second on my MBP using Ruby 1.9.2.

Normalizing you use before saving a phone number into a database etc.

Splitting is helpful if you want to do your own special formatting, or remove certain parts.

Although that is probably not needed, as Phony can take care of that for you: Formatting render a number in international/national/local form, with zeroes, 00, plus + and special spaces, if you need them (" " is default).

Look at a few more examples.

Try it

First, get the gem: gem install phony

Then,

require 'phony'

p Phony.format('43198110', :format => :international, :spaces => :-) # => '+43-1-98110'
p Phony.split('33112345678') # => ['33', '1', '12','34','56','78']
p Phony.normalize('1 (703) 451-5115') # => '17034515115'

My country is not formatted correctly! What do I do?

Internal API

Sometimes I have a nice document to go on, most of the time I don’t, and not even in any of the languages or writing systems I know. Sometimes I simply made a mistake. This is where you can help Phony!

To add your “missing” country, fork Phony and look at the lib/phony/countries.rb file. It contains (almost) all the definitions. The more complicated ones – like Germany, Italy, etc. – are in their own files.

The internal API uses a little DSL to make managing and coding all the different formats easier.

The phone numbers of France, for example, have a very elegant structure:

country '33', fixed(1) >> split(2,2,2,2)

This says, that the country with country code 33 should have an NDC of fixed length 1, followed (>>) by a national code that is split in groups of 2.

As another quick example, the freshly added Slovakia:

country '421', match(/^(9\d\d).+$/) >> split(6) | # Mobile
               one_of('2')          >> split(8) | # Bratislava
               fixed(2)             >> split(7)

This says that Slovakia uses 421 as country code. If a phone number with NDC 9xx is found, split the national part into one big part with 6 digits. If not, go and check if the NDC is a 2, if yes, split it into a thing with 8 digits as national. If not, it must be a 2-digit NDC, with 7 digits following.

So:

421912123456 # => 421 912 123456
421212345678 # => 421 2 12345678
421371234567 # => 421 37 1234567

The description of what matching/splitting is available is at the top of the file.

First, add specs with a few example numbers, then fix, and send me a pull request. Get big thanks in the contributors entries. Try to beat Keith Bingman! :)

But let’s get back to phone numbers.

E.164

Or E164 for short is a recommendation which defines a numbering scheme and phone number formats. The Wikipedia entry is very helpful.

For coders, there are 2 important facts to be gleaned:

Length is maximally = 15.
Country code is a 1-3 digits prefix code. This is defined in E164. After that it is a horrible mess.

So, in e.g. ActiveRecord you can exploit fact #1 like this:

t.string "normalized_phone", :limit => 15

Fact #2 is harder to exploit, and this is what Phony is here for.

Model/Representation Aside

Btw, if you have customers who want to enter specific phone numbers (like “+34/123-(555)001!”), you could code it up like this in ActiveRecord:

Before saving, you could normalize it quickly if it is dirty, to see if it needs to be saved in the specific_phone attribute (if normalized != given_specific). This just off the top of my head.

def phone
  read_attribute(:specific_phone) || read_attribute(:normalized_phone)
end

Then, in the view, use e.g.:

= Phony.format(user.phone)

Even better to use representers/view models, in which you just define a method:

def phone
  Phony.format(model.phone)
end

Then, in the view it becomes:

= user.phone

I really like that last line.

Status

At the time of this writing, we include 44 countries, and counting. See the README for a list.

Endnote 1

Q: Why are this dude’s libraries named after negative attributes?

A: No.

Endnote 2

If I’ve found out just one thing about phone numbers then it is this formula:

1 / (standardization + well-oiled-bureaucracy) = phone-number-structure-mess-quantifier

Switzerland has a well oiled bureaucracy, 1, but not a big drive for standardization, 0, = 1.

France does not have a well oiled bureaucracy, 0, but a big drive for standardization, 1, = 1.

For Italy, the result is around 1.825×10e7. Booo.

A special thank you goes to Belgium which uses 4xx as its mobile phone prefix, but has a region, Liège, which uses 4 as its land line prefix. Belgium, do you know what a bloody prefix code is? OTOH, this led me to rewrite Phony a second time, and all is much better.

Conclusion

So we’ve seen

that Phony can normalize a phone number.
that Phony can split a phone number into its constituent parts.
that Phony can format a phone number for you.
that it does all this very fast.
what E164 is.
what the lib status is.
that some countries ARE better than others ;)

Hope you learnt something new!

Picky: Geosearch 2

2011-04-26T00:00:00+10:00

This is a post in the Picky series on its workings.

In this quick one I’ll be using my own iPhone’s geodata as data for a space/time Picky search.

Lean back and enjoy the screencast.

Enjoy the show

I’ll be searching time and space for my own footprints in Switzerland, Germany and Australia.

Best viewed in full-screen. Warning: Safe for work with the possible exception of my voice, which has in the past triggered attacks by various animals/politicians.

View with subtitles.

(When I say “Apple is collecting”, I mean “‘Apple’ is collecting” – the phone)

So how do you get your iPhone’s geodata?

iPhone geodata

First of all, let me direct you to a nice OSX application: http://petewarden.github.com/iPhoneTracker/ This enables you to view your data nicely.

The third question in the FAQ explains how to get your data out of the phone: How can I examine the data without running the application? (Also look at the updates)

That’s it. At the end you should have access to a SQLite database, from where I extracted CSV data into the file data/iphone_locations.csv (with header data removed).

What did I do with the data?

The code

We’ll first be looking at the server, then at the client.

Server

In the server, define an index like this:

iphone_locations = Index::Memory.new :iphone do
  source Sources::CSV.new(
    :mcc,
    :mnc,
    :lac,
    :ci,
    :timestamp,
    :latitude,
    :longitude,
    :horizontal_accuracy,
    :altitude,
    :vertical_accuracy,
    :speed,
    :course,
    :confidence,
    file: 'data/iphone_locations.csv'
  )
  ranged_category :timestamp, 86_400, precision: 5, qualifiers: [:ts, :timestamp]
  geo_categories  :latitude, :longitude, 25, precision: 3
end

As you can see, I’m only using timestamp, latitude and longitude. And wrote all the possible data fields for completeness’ sake if I need to refer to one of these later on.

The timestamp uses a “radius” of 86’400 seconds (a day). That means it includes all results around the given timestamp in a range of ts-1.day..ts+1.day.

It also sets a short qualifier (“ts”) such that the search input field is not completely filled, i.e. searching for “ts:…” is equivalent to searching for “timestamp:…”.

The geodata uses geo_categories (see last post), with 25 km as radius and an average precision of 3 (1 = low, 5 = high).

Now you already could search your data e.g. with curl 'localhost:8080/iphone?query=longitude:8.2'. Note that the timestamp data is saved as seconds since January 1st 2001 (as per the Apple data).

Client

The client actually stayed almost exactly the same since the last blog post, with the geo data piggybacking on the results hash.

The only notable addition is the HTML5 slider, which is a simple input[type=range], with a change listener defined on it, which triggers the insertion of the (“ts:” qualified) search string.

One problem I had was that I did not know that Javascript defines months in the range (0..11), but not the years, so 1977 is 1977, and not 1978, thankfully. But still, quite a stumbling block if you’re unaware of it.

Finally

Have fun doing crazy space/time searches!

… and don’t run into time paradoxes. Those are nasty. Watch Back to the Future 1 for tips and tricks. First one is free: Learn to play an electric guitar.

Conclusion

So we’ve seen

how to extract your iPhone’s geodata.
that you can search space/time.
how you might write your own.
that Javascript Date handling – although lauded by many PHP programmers – is crap.

Hope you learnt something new!

Picky: Geosearch 1

2011-04-19T00:00:00+10:00

This is a post in the Picky series on its workings.

Let me show you how to do a simple and fun geo search in Picky.

But first, lean back.

Enjoy the show

The index contains around 21’000 Swiss places, taken from Wikipedia.

First, I click a little around – Picky gives me places around the clicked location.

After that I show what happens if I just give Picky a latitude or a longitude. Then, combined with the place text, finally, just with the place text.

You’ll understand when you see it :)

It’s best to switch to full-screen:

The blob in the middle is Switzerland, by the way ;)

How do we do it?

The server code

The server … you probably could have done sleeping if you’ve been reading this blog dilligently ;)

The data comes from the CSV file data/swiss_places.csv

places = Index::Memory.new :geo do
  source         Sources::CSV.new(:location, :north, :east, file: 'data/swiss_places.csv')
  category       :location, partial: Partial::Substring.new(from: 1)
  geo_categories :north, :east, 1, precision: 3
end

What’s interesting here is the geo_categories method. It takes two categories, north, and east, which are both in the lat/lng format, e.g. 47.2, 8.3. (It also takes options lat_from, and lng_from if the categories don’t have the same names as in the data source)

Also, the 1 parameter in geo_categories denotes that we search 1 km around the clicked location.

This is actually the simple part. It does no exact calculation, but an approximate one that’s most correct in temperate zones. But as you see in the video, it works well. Especially in a “what’s around me” type search.

Still in the server config app/application.rb:

route %r{\A/places\Z} => Search.new(places)

Self-explanatory, eh? As regexp, you could also use %r{^/places$}.

That’s it for the server. Nothing special so far.

rake index; rake start and off we go.

The client code

In this part we’re going to install the map.

So we’re using the generated code, but add a little more information to the returned json hash.

We not only need the list results, but also the coordinates themselves. So we’re going to add them to the results separately.

We (ab)use populate_with, the method that makes models out of the returned ids and yields them to the block to be rendered.

We then use the models to add geo coordinates to the result hash that is sent to the client.

results = Geo.search params[:query], :ids => params[:ids], :offset => params[:offset]
results.extend Picky::Convenience
results[:geo] ||= [] # <= We initialize an array of coordinates in the results hash.
results.populate_with Location do |location|
  results[:geo] << [location.north, location.east] # <- and we populate it with the coordinates.
  location.to_s
end

So essentially, our geo data piggybacks to the Javascript client. JS, here we come!

The javascript client code

The javascript client requires a bit more work. Well, the map does.

We insert this after the PickyClient code. The first 6 lines are noise and map preparation.

// The map
//
$(document).ready(function() {
  if (GBrowserIsCompatible()) {
    // Map setup.
    //
    map = new GMap2(document.getElementById('map_div'));
    map.addControl(new GSmallMapControl());
    map.setCenter(new GLatLng(46.85, 8.05), 13);
    map.setZoom(7);

    // Click listener.
    //
    GEvent.addListener(map, "click", function(overlay, latlng) {
      if (latlng) {
        pickyClient.insert(Math.round(latlng.lat()*1000)/1000 + ' ' + Math.round(latlng.lng()*1000)/1000);
      }
    });
  }
});

Then, we add the most important part: A click listener that inserts the coordinates (rounded to 3 digits) in the search field, as you have seen in the video.

Now, searches are already sent off to Picky and come back. Whoosh!

What do we need to do now? Yes, draw some markers in the map. The PickyClient offers a callback that is called after Picky has updated the results (there are also before and success):

after: function(data, query) {
  map.clearOverlays();

  var geo = data.original_hash.geo;
  if (geo) {
    for (var i = 0; i < geo.length; i++) {
      map.addOverlay(new GMarker(new GLatLng(geo[i][0], geo[i][1])));
    };
  }
},

First we clear the overlays for the new results.

Then, we get the piggybacking geo data using the data object’s original_hash function, finally iterating over all coordinates and adding overlays as we go.

By default, the client only gets 20 results at a time. We set it to 100 using the fullResults option.

fullResults: 100

That’s it. It’s fast and quite easy to set up.

Sidenote

Since for Swiss data it is clear which is the longitude and which is the latitude (no data intersection), we can just enter e.g. 47.2 8.3, but if your data area isn’t exclusive, e.g. 33.1 33.2, meaning that latitude values can also be longitude values, just add north:33.1 east:33.2, to denote what is what if north, east are the names of your categories.

Conclusion

So we’ve seen

that a geo search in Picky is quite snappy.
that you can search for latitude and location name only, for example.
how you can configure the server.
how you can configure the client and the web frontend.

Hope you learnt something new!

Picky: Environmental Considerations

2011-04-18T00:00:00+10:00

This is a post in the Picky series on its workings.

(Man, being in Australia is cool in that I can post on the 18th, while most of you are still wallowing in the 17th)

This is a Google Analytics driven post. I saw recently that many people looked for “Picky environment and Rails” or similar.

PICKY_ENVIRONMENT and PICKY_ROOT

Almost like e.g. Rails, Picky has an constant ready for your environment handling: PICKY_ENVIRONMENT.

That’s what you use to differentiate, for example, data source files from each other. So you might have a data directory with population data for zimbabwe in the CSV format. It would be a good idea to have three different files, data/development/zimbabwe.csv, data/test/zimbabwe.csv, and data/production/zimbabwe.csv.

(Since for testing you probably use only a subset of your data)

Then, in your index data source definition, use PICKY_ENVIRONMENT:

Index::Memory.new(:zimbabwe) do
  source Sources::CSV.new(file: "data/#{PICKY_ENVIRONMENT}/zimbabwe.csv")
  # ...
end

Well, you’re probably used to that from using Rails, right?

It may be interesting how this constant is defined.

ENV['PICKY_ENV'] ||= ENV['RACK_ENV']

PICKY_ENVIRONMENT = ENV['PICKY_ENV'] || 'development' unless defined? PICKY_ENVIRONMENT

So, if you haven’t set the PICKY_ENV environment variable, Picky will use the one set by Rack. Then, if you haven’t set PICKY_ENVIRONMENT explicitly by hand, Picky will use the environment variable to set PICKY_ENVIRONMENT.

So you have two overriding possibilities: Either through an env variable, or through setting a Ruby constant.

PICKY_ROOT is also available, and is defined like this:

PICKY_ROOT = Dir.pwd unless defined? PICKY_ROOT

It just uses the current directory, unless you want it to point somewhere else, explicitly. Everywhere in Picky where a file is used (mostly in the data sources), PICKY_ROOT is used.

Conclusion

So we’ve seen

how PICKY_ENVIRONMENT and PICKY_ROOT are set.
how you can use PICKY_ENVIRONMENT to your advantage.

Hope you learnt something new!

Picky: Integration Testing

2011-04-17T00:00:00+10:00

This is a post in the Picky series on its workings.

Let me start off by saying that it’s embarrassing that this topic is discussed only as Picky 2.3.0 is released. Especially as a proponent of test driven design. (Picky has 1300 tests and 50% more spec code than normal code)

So let’s check out how you can write the most beautifully tested Picky servers. Oh yeah.

Doin’ it

As of 2.3.0, if you use picky generate unicorn_server, you’ll get a rake spec for free which already runs integration specs on the example data.

Let’s look at the example, and after that, at each separate part.

require 'spec_helper'
require 'picky-client/spec'

describe 'Integration Tests' do

  before(:all) do
    Indexes.index_for_tests
    Indexes.load_from_cache
  end

  let(:books) { Picky::TestClient.new(PickySearch, :path => '/books') }

  # Testing a count of results.
  #
  it { books.search('a s').total.should == 42 }

  # Testing a specific order of result ids.
  #
  it { books.search('alan').ids.should == [259, 307, 449] }

  # Testing an order of result categories.
  #
  it { books.search('alan').should have_categories(['author'], ['title']) }
  it { books.search('alan p').should have_categories(['author', 'title'], ['title', 'author']) }

end

It starts off like any RSpec file, by requiring spec_helper. Then we require the spec part of the picky client.

What does it do? It provides us with the testing counterpart of the client’s Picky::Client, which is Picky::TestClient.

The test client works almost exactly like the real client, with the exception that the test client never sends HTTP requests, but uses your app’s Rack adapter. But more about that later.

require 'spec_helper'
require 'picky-client/spec'

Next, we set up the environment for the tests, i.e. get the indexes up and running.

Indexes.index_for_tests is a special index method that does not fork and runs silently (to not disturb the deadly test bugs that trawl the area).

before(:all) do
  Indexes.index_for_tests
  Indexes.load_from_cache
end

Indexes.load_from_cache loads the generated index (caches) into memory (or just leaves them alone in Redis).

Now we’re ready to do some testing!

let(:books) { Picky::TestClient.new(PickySearch, :path => '/books') }

This sets up an accessor for your tests. You give the TestClient your Application’s constant, PickySearch here, and give it the path to send queries to, here '/books'. This only works if you route the path '/books' to a Search in your application/app.rb, of course.

That’s it! Easy so far, right?

# Testing a count of results.
#
it { books.search('a s').total.should == 42 }

books is the test client we defined with the let, above. As with the normal Picky::Client, it offers a #search(text, options = {}) method.

As return value, we get a hash with the result data. However, it has already been enriched through Picky::Convenience, which you might know if you’ve set up a client webapp already.

This means we get a #total method, but also #ids, #empty?, #allocations and more which are less useful for testing.

So to test the count of results, just use #total on the result of the search.

To get a sorted array of the top ids, use – surprise – #ids.

# Testing a specific order of result ids.
#
it { books.search('alan').ids.should == [259, 307, 449] }

Also useful is to test if the category combination boosting/weights are correct. So if author, like in the first example below, should be boosted, use the have_categories matcher to check for that.

# Testing an order of result categories.
#
it { books.search('alan').should have_categories(['author'], ['title']) }
it { books.search('alan p').should have_categories(['author', 'title'], ['title', 'author']) }

And that’s how you do integration testing in Picky.

About time. Test away!

spec_helper and Rakefile

This is what your spec/spec_helper.rb would look like:

ENV['PICKY_ENV'] = 'test'

require 'picky'

SearchLog = Loggers::Search.new ::Logger.new(STDOUT)
puts "Using STDOUT as test log."

Loader.load_application

In the Rakefile just add

require 'rspec'
require 'rspec/core/rake_task'

RSpec::Core::RakeTask.new :spec

if you haven’t done this already.

Sidenote

Should any RSpec vs. Test::Unit controversy erupt around Picky… just kidding ;)

Conclusion

So we’ve seen

how you do integration testing in Picky

Hope you learnt something new!

Picky 2.2.0

2011-04-14T00:00:00+10:00

This is a post in the Picky series on its workings.

Picky 2.2.0 will be released shortly.

What is good and new?

Breaking API change (Please read this if you already have Picky running)
More flexible sources (This is the cool stuff)
rake search is now picky search
Uses ActiveRecord/ActiveSupport 3.0

Breaking API change

2.2.0 will introduce an API change that will break your existing, pre-2.2.0 server configuration.

Instead of as second parameter, the data source is now passed in as an option, or called inside the configuration block.

The old style:

Index::Memory.new :users, your_data_source do
  category :name, similarity: Similarity::DoubleMetaphone.new(3)
  category :age
end

has now become the

new style:

Index::Memory.new :users, source: your_data_source do
  category :name, similarity: Similarity::DoubleMetaphone.new(3)
  category :age
end

Index::Memory.new :users do
  source   your_data_source
  category :name, similarity: Similarity::DoubleMetaphone.new(3)
  category :age
end

Why?

The old style was actually more correct, since an index needs a data source. But I never really got friends with it, since it looked so unwieldy, especially when you have a “long” data source, like

Sources::CSV.new(:abra, :ca, :dabra, file: 'some/file/that/is/somewhere.csv')

The new style is much cleaner to look at. And Picky will tell you if you forgot the data source as early as possible.

If you use the old style config, Picky will tell you how you need to update your config on server restart. But still, sorry about the breaking change!

Flexible sources

We’ve completely rewritten the sources.

Before 2.2.0, the data source needed to be an object that responds to the #harvest method.

In 2.2.0, it can be any object responding to the #each method, if that method returns objects that at least respond to the #id method and to any methods specified by the category method.

Let me give you an example. Let’s say we have some monkeys that we’d like to index.

class Monkey
  attr_reader :id, :name, :color
  def initialize id, name, color
    @id, @name, @color = id, name, color
  end
end

We’ll create three monkeys and save them in an array:

monkeys = [
  Monkey.new(1, 'pete', 'red'),
  Monkey.new(2, 'joey', 'green'),
  Monkey.new(3, 'hans', 'blue')
]

Then, since an Array has the #each method, you can index it:

Index::Memory.new :monkeys do
  source   monkeys
  category :name
  category :couleur, :from => :color # The couleur category will take its data from the #color method.
end

Since each monkey has an #id, a #name, and a #color method, Picky will happily index the monkeys for you. Note that the couleur category uses the from option to define from where in the source it takes its data from.

Hmmmm… id method? You’re probably thinking the same thing as I.

MongoMapper, the new ActiveRecord and others use a fluid style interface (see last post), whose proxies support #each, and the yielded objects support #id and various methods!

So this becomes possible:


# For completeness:
#
class Book < ActiveRecord::Base; end
Book.establish_connection YAML.load(File.open('app/db.yml'))

Index::Memory.new :books do
  source   Book.order('title ASC')
  category :id
  category :title
  category :author
  category :year
end

See the first line in the index config block?

Book.order('title ASC')

This passes the AR proxy as source to the books index. Since it provides a #each method, and the yielded objects support #id etc., Picky will index all books in a title ASC order.

I love it!

Note that the old style sources still work. And for ranged_category-s, it is still necessary to use the old style sources. We’ll be working on that, but for the near future, use the old style sources for range/area/volume searches.

rake search → picky search

See the last post.

Since rake search was project specific, but its functionality is actually URL specific, I’ve deprecated the rake task (it will tell you so), and created picky search that you can use.

AR 3.0 / AS 3.0

In other news, Picky now uses AR 3.0 / AS 3.0.

In your existing Gemfile, please update the line

gem 'activerecord',  '~> 2.3.8', :require => 'active_record'

gem 'activesupport', '~> 3.0', :require => 'active_support/core_ext'
gem 'activerecord',  '~> 3.0', :require => 'active_record'

Thanks!

Conclusion

So we’ve seen

that the API broke a little.
that a new group of data sources is available.
that rake search is now picky search.
that Picky now uses AR 3.0 / AS 3.0.

Hope you learnt something new!

Picky Data Sources: Next Steps

2011-04-12T00:00:00+10:00

This is a post in the Picky series on its workings.

For quite some time now I have been thinking about rewriting the Picky data sources.

Although the ones that Picky use now work well, they do feel unelegant and unruby-ish.

But I’ll let you be the judge of that in the next part: How it works currently.

After that, I’ll talk about the problems with the current approach, and how I’d like it to be and how this could be possible to do. Feedback welcome, as always!

How does it work now?

At the moment, every index needs a data source. So you might write:

data_source = Sources::DB.new 'SELECT id, title, author, year FROM books', file: 'app/db.yml'
Index::Memory.new :books, data_source do
  # categories ...
end

In the example, the data is coming from a database which is defined in app/db.yml (the file option).

Then, Picky’s indexer takes a snapshot of the data using your query and saves it in another table. The query can be anything, with joins and conditions etc.

Then, from this intermediate table, it will load batches of data, ordered in the way you ordered the results in your DB data source query.

So if you happened to say

SELECT id, titulo as title, author, year FROM books ORDER BY year DESC

then your results would be ordered by year, descending.

Picky is really data driven. If you sort the data in a certain way, it will be sorted like that in the results. (Well, inside each category combination, but let’s not go into that for the moment. Just know that it will help your user.)

By the way, don’t hesitate to use REGEXP, SUBSTRING or other functions in your SELECT statement to preprocess your data. It’s incredibly powerful to preprocess your data.

How does it work in the code?

What Picky does is instantiate an indexer for each combination of (index, category, source, tokenizer). So as an example, it is indexing the title category of a books index, with data coming from a db source, using the indexing tokenizer.

What the indexer first does is call the harvest(index, category) method on the data source, passing it the current index and category. That’s step 1.

The source can then use the index and/or category to get the data from its backend.

The source then gets the data from the backend and extracts the relevant parts. For the books index and title category it would do a select on the database using that information. Then, in step 2, it yields (slightly normalized) information back to the indexer, i.e. the id to index, and the data, the text to index.

The indexer then, in step 3, tokenizes the data as you defined with the default_indexing options, and finally, after some caching, writes it to the human readable index file.

The human readable index files are located in the Picky server directory index/{development,test,production}/books/ where you’ll find lots of files named category_....

I urge you to look at them! Lots of indexing questions can be answered by just looking at title_exact_index.json, for example.

Note that all index files are encoded in json, with the exception of the similarity indexes, which are Marshal dumped. So these are only human readable if you load them using Marshal.load, I’m afraid.

The problems

Although it all sounds nice, probably, there are three problems:

The indexer is a “serial” indexer. Meaning that for each category, it asks the database to give it the data for the current category. So for each id, it asks the database for each data category separately. So for id 1 it asks for the title, then, later, for the author etc.
In a similar vein, if I like to index correlated values, like geocoded data, that needs to be processed, it is simply not possible with the current indexer.
It is a bit unwieldy seeming for a user, imho. This could be a sign that it could be more elegant.

Let’s look at the problems in more detail:

Serial Indexer

The first problem, that Picky is going to the database for each category, is of a performance nature. Although it does not have much impact (you probably haven’t noticed it yet), the way it is doing it now, it is still irking me that it does several return trips per id.

Correlated values not possible

Correlated values are not possible. What does this mean?

Let’s say that we have geocoded data, longitude and latitude. If you now try to do a geosearch by (ab)using the ranged_category method, you will experience problems, the closer to the pole the location is. While on the equator, Picky will search around it in a nice square.

But if you e.g. move to the north, since the longitudinal lines are closer and closer together, so will the ranged search distance. While 0.008 degrees might mean a kilometer on the equator, near the north pole it will be closer and closer to zero kilometers. So the square will be squished until it finally looks like a triangle.

Depending on the cartographic method used, this might not be a problem for you. But it certainly is if you’re looking at the whole earth. Now, if the categories were indexed together, Picky could recalculate the data for you such that the square area search (see one of the last blog posts) would be preserved.

One approach to how this could look is this:

Index::Memory.new :books, data_source do
  geocoded_category :longitude, :latitude, 1.km
end

In a “parallel” indexer, Picky could load both longitude and latitude and do corrections on the longitude/latitude to preprocess the data so it would return correctly geocoded results.

Elegance

This is the part where I am most unsure about. But this

data_source = Sources::DB.new 'SELECT id, title, author, year FROM books', file: 'app/db.yml'
Index::Memory.new :books, data_source do
  # categories ...
end

just doesn’t look good. Granted, you need to inject a lot of information in a few lines:

Type of source (DB)
Selection of data from the source (SELECT)
Configuration of source (file: 'app/db.yml')

But still, I’d love it to be much more elegant.

For quite some time, I wasn’t sure what to do. There isn’t a single nice interface of all the data sources. ActiveRecord does it this way, MongoMapper another etc. etc.

So Simon from Berlin asked me last night about whether I had experience with Picky and MongoMapper. I don’t, but it would certainly be cool to include it as a data source in one of the next versions of Picky.

I took a closer look at it. Similar to the new way in Rails 3, it uses a fluid interface, where some methods just modify the query, while some are “kicker” methods that actually do something:

User.where(:age.gt => 27).sort(:age).all

More here, http://railstips.org/blog/archives/2010/06/16/mongomapper-08-goodies-galore/. The all method at the end of a chain would be a kicker method, loading all objects.

That got me thinking.

How I would like it to be

Wouldn’t it be nice if we could just, instead of a data source, just pass any object as data source, so for example, with MongoMapper:

Index::Memory.new :books, User.where(:age.gt => 27).sort(:age) do
  category :name, similarity: Similarity::DoubleMetaphone.new(3)
  category :age
end

Quite a bit sexier, imho. Since the result of the sort(:age) method is a proxy that offers kicker and non-kicker methods, the Picky indexer could now call each on it.

The contract would then be that each object that is yielded by #each must offer methods that are named like the categories (or named like the from option – e.g. category :name, from: surname).

So, in the above example, each User object would have methods #name and #age such that Picky could extract the data.

The cool thing with that would be that I could just pass in an Array of data. So, this would work (a, b, c all respond to #name):

Index::Memory.new :books, [a, b, c] do
  category :name
end

What would we have to do to make this work in Picky?

How to get there?

First of all, Picky would need to be rewritten, or at least be partially rewritten to use a “parallel” indexer, where each category would be loaded along with the others. So loading data set 1 would load title, author, year etc. at the same time. (Since some of these frameworks throw away the data after it has been yielded with #each)

The nice side-effect of this is that it opens real geosearch (or any combined category search) possibilities in Picky.

Probably, the frameworks offering the #each way would need to yield lazily, i.e. #each should not preload all the data before yielding as the data in question might be huge. Or maybe load it in batches.

How could we migrate from the current state to the new indexer?

I suggest that before instantiating the indexer, the index would first look at the source. If the source responds_to?(:each), the parallel indexer would be used. And if not, the “serial” indexer would be used, doing things the old way.

So the contract for parallel sources would be that they implement #each in a way that would load the data in batches and only yield objects which respond to the category names.

Let’s see if we can get this working soon :)

What I am wondering: Are we walking down a fool’s path? Comment if you have an opinion about that, please.

Possible problems

One problem could be that we lose speed since we’ll be instantiating lots of objects that respond to the categories. On the other hand, the return trips would not be necessary anymore.

Another problem is that since we’re just depending on #each, we couldn’t pass the source the index and category anymore. So choosing the right data would be the responsibility of the user. I do not think this to be a big problem.

Final remarks

Although I’d like to make it more elegant, I’d still like to preserve the old way of doing things. Sure,

Index::Memory.new :users, User do
  category :name, similarity: Similarity::DoubleMetaphone.new(3)
  category :age
end

might look nicer than

user_source = Sources::DB.new 'SELECT id, name, age FROM users', file: 'app/db.yml'
Index::Memory.new :users, user_source do
  category :name, similarity: Similarity::DoubleMetaphone.new(3)
  category :age
end

I’d like the old way to be available, since doing the right SELECT is incredibly useful.

Conclusion

So we’ve seen

how Picky data sources work now.
how they ought to work.
that #each would be more ruby-ish.
how a migration path could look like.

Hope you learnt something new!

Searching with Picky: In the Terminal

2011-04-11T00:00:00+10:00

This is a post in the Picky series on its workings. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

This post is about a fun little experimental toy I’ve been working on: picky search <url>.

Update!

rake search is picky search from 2.2.0 on.

picky search <url>?

Yes. While working on a server, I sometimes want to see if the search engine works correctly directly in the terminal (normally I use tests, but sometimes I need that quick look).

How do I use it?

See this short video (it’s best to full-screen it):

Start a Picky server.
Then type picky search /some/url (where /some/url is a path – or url if not on this server – you’ve defined using route in app/application.rb).
Then, just type away.

The result id count will update as you type.

When pressing enter, the top 20 result ids will appear next to your search text.

If you want to exit, just Ctrl-C. That’s it.

Note that you need the picky-client & highline gem installed. But Picky will tell you so if you haven’t.

How does it work?

I use the highline gem (by @JEG2) to get single characters (using the appropriately named get_character) from the user and then move the cursor around using \e[#{amount}D (left) and \e[#{amount}C (right), print ing to the STDOUT and flush ing it a lot.

If there is a gem which makes it easy to position objects in the terminal which update it (by being used in a visitor pattern or however), I’d like to hear about it!

Conclusion

So we’ve seen

how you run a search directly in the terminal.

Hope you learnt something new!

Searching with Picky: Range/Area/Volume etc. Search

2011-04-09T00:00:00+10:00

This is a post in the Picky series on its workings. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

This post is all about searching areas, volumes, space and time – and more!

tl;dr

Using ranged_category instead of category in index definition lets you search inside numeric ranges (instead of exact or partial strings). Example:

ranged_category :height,
                50,          # units "around" the searched value, here: meters
                precision: 5 # very high precision, 1% error

Warp Area

“Find all locations in a thin slice of N47.11 to N47.13, whose names start with F, that are in height 362m to 462m”

Space & Time Search

Different Radiuses/Volume Sizes etc.

Caveats

Conclusion

Range Search

Picky is good at intersecting stuff – and guessing which of the intersections you actually are looking for.

The pink part is where e.g. “name:eisenhower” and “title:wa”® intersects in a speech database, and Picky finds it. The blue part is where “name:eisenhower” and “title:wa”(rthog) intersect. Less interesting, and Picky thinks so too.

Usually, what Picky does is intersecting these circles of words you are looking for, resulting in funky Venn diagrams that have so successfully been used in 60s style living rooms.

Hey, doesn’t a map have grids that intersect somehow? What if Picky could intersect the area between the x lines (light blue) with the area between the y lines (also light blue)?

What we’d get is the results in the pinkish area.

This type of diagram has been successfully used by Piet Mondrian at the beginning of last century.

Now, if we could pass Picky the median x value, and the median y value and get it to return the results in the pink area, wouldn’t that be something?

Indeed it would, and indeed it already can. You probably just didn’t know.

But how can I do a range search?

Apart from searching exact or partial strings with the #category method, Picky offers a #ranged_category method for numerical values.

Let me show you how it works. Let’s say that I have a CSV file, mountains.csv, with the mountains of the world, from lowest to highest, in meters:

1, Tokelau (NZ), 5.0
...
124, Vaalserberg (NL), 321.9
...
78513, Mount Everest (NP), 8850.0

Now we want the user to be able to enter

and get all the mountains that are +/- 50 meters in height away from 200.

For that you use ranged_category(name, units_around, options = {}):

data_source = Sources::CSV.new(:location, :height, file: 'data/mountains.csv')
mountains = Index::Memory.new(:mountains, data_source) do
  category        :name
  ranged_category :height, 50, precision: 3 # 50 is the units around the searched height
end

So we’d have a name (that is searched with the default config, like text) and a height that is searched with a precision of 3, 50 meters around the number the user enters.

What does the precision mean?

Precision 1, the default, has 5% error and is really, really fast, and precision 5 has 1% error and is just fast. You can go up to wherever you want, but 5 is a good tradeoff if you need a precise result.

Note that – since Picky does intersections – you can also search for height AND name at the same time. If you add a full partial search option to the name category, category :name, partial: Partial::Substring.new(from: 1) then when you search for example for

300 va

you will find all the mountains from height 250 to 350 whose name starts with “va”. Nice eh?

Nice indeed, but can I use this for an area search?

Let’s say I have a CSV file, swiss_places.csv, with all places, 20910 in all, in Switzerland, like so:

1,Zuger See,47.11667,8.48333
2,Zwischbergental,46.16667,8.13333
3,Zwischbergen,46.16667,8.11667
4,Zwingen,47.43825,7.53027
...
20910,Les 4 Vallées,46.17572,7.32142

This is the data. Then I tell Picky where to find the data (in the CSV) and how to index it:

data_source = Sources::CSV.new(:location, :north, :east, file: 'data/swiss_places.csv')
swiss_places = Index::Memory.new(:swiss_places, data_source) do
  category        :location
  ranged_category :north, 0.01, precision: 3
  ranged_category :east,  0.01, precision: 3
end

This means that we can search for the location, and the north and the east value, with 0.01 leeway around the searched number. So entering 47.12 would find numbers in the range 47.11..47.13.

Now, if you search for

47.12, 8.48

you find the “Zuger See”.

Since for Switzerland, the north and east coordinates are exclusive (one around 47, the other around 8.4), Picky knows what is what by itself.

If your values aren’t exclusive, for example both are in the range 1..3, then entering the search

1.3, 2.4

might make Picky ask you which one is what. It’s not clear if you want 1.3 from the one and 2.4 from the other, and voice versa. This can be remedied by exclusively specifying what is what:

north:1.3, east:2.4

The best thing is that you don’t need to use the Picky interface. You could whip up a Javascript interface (of some area) where you click into and run searches on Picky, then returning results that are displayed in the area.

But now, let’s go a little crazy!

Volumetric Search

Say, the swiss data also had heights:

1,Zuger See,47.11667,8.48333,410.0
2,Zwischbergental,46.16667,8.13333,
...
20910,Les 4 Vallées,46.17572,7.32142,1205.3

Just add the new line in the index definition, and in the source:

data_source = Sources::CSV.new(:location, :north, :east, :height, file: 'data/swiss_places.csv')
swiss_places = Index::Memory.new(:swiss_places, data_source) do
  ...
  ranged_category :height, 50
end

Voilà!

47.12, 8.48, 400

This would make you find the “Zugerberg”, while using a height of 500 wouldn’t.

Let’s get funky!

We don’t need to use all categories:

47.12, f*, 412

Funky search, but this would find all locations in a thin band of north 47.11..47.13, whose names start with f, and that are in height 362..462.

Let’s add more dimensions.

Space and Time

So how would we search in space and time? Space is easy, that is just a volumetric search.

Now: How would you add in time?

Probably you’d index it in seconds from January 1st, 1970 or something like that, then define a ranged search with “radius” 1800. This would make Picky find things in the hour around the searched seconds since 1970.

I want to be able to search in 1m, 10m, 100m

Now, as you saw, we looked for heights 50 meters around it using:

ranged_category :height, 50

What if we want to search 1 meter, 10 meters, 100 meters around it, choosing as we go?

This is accomplished by adding more searchable categories, like so. You name the category specifically, and tell Picky from where in the data source it should get the data, using the from option.

ranged_category :height1,     1, from: height
ranged_category :height10,   10, from: height
ranged_category :height100, 100, from: height

Choosing from the categories is done as usual. If you want 10 meters, search like this:

height10:412

This will find locations of heights 402..422.

Caveats

Actually, if you use the ranged_category on a larger area on a ball, like earth. For example in Australia – the place I am staying in, currently – what you will find is that the more south you go, towards the pole, the less square and more rectangular your search area will get. This is because Picky does not correct the ball’s sphere. I’m working on it.

So, Picky cannot handle your balls yet.

For small countries it is still useful, and of course for lots of graph searches etc.

Flat things it does marvellously. And super fast!

Conclusion

So we’ve seen

how Picky can search areas.
how Picky can search volumes.
how Picky can search any number of dimensions.
how you can choose any combination of areas and other features.
how you search in different ranges on the same thing/category.
that you cannot quite search on a ball, like earth.

Hope you learnt something new!

On Searching

2011-03-30T00:00:00+11:00

tl; dr

This post is about engineers, our pride in information gathering and organizing, and how we often fail in information searching.

Also about different types of search engines.

Engineering

Imagine a structural engineer planning a bridge.

How do you think does he approach the problem? Does he just build a standard concrete/steel bridge?

Probably not. He analyzes the constraints put on it by various factors, monetary, environmental, political, and last but not least – time, and sets out to build the bridge that fits as many of these constraints as possible.

Similarly with software engineering: We analyze various options, plan, code, release. (In this magical dream world I am conjuring up. But bear with me.)

And most of the time, we do this well. An incredible number of blog posts, books etc. describe various options and tools in the software world that can be used as blueprints, tools, or inspiration to build our specific “bridges”.

Information gathering and structuring

When it is about collecting information, we are world masters.

There is an enormous wealth of information regarding how to structure data, which database/key-value-store/glorified-hash etc. to use, when, and how.

How to acquire users, how to aqcuire information from these users, also how to access this information through APIs and how to make information accessible and so on.

Tell me the size of your valley, the amount and color of cars expected, and I can provide you a set of blueprints in a nice price range.

This is great. But what happens when it is about making this information searchable? Not so great in my humble opinion.

Information searching

I’ve recently experienced a few cases where the analysis for which search engine to use went something like this:

“Oh, we’ve used it for project X, it will be great in project Y (totally different project).”
“Just use the gem in ActiveRecord, and it takes care of everything.”
“Search engine X is cool, Y recommended it on his blog yesterday!”

While I appreciate the strict time constraints often involved in projects, the above reasons should not be used by an engineer worth his salt.

Yes, using search engine X will not end in disaster, and yes, it will return some results to the user.

But instead of building an elegant bamboo bridge over the wide jungle river, perfect for one person, you built a concrete bridge.

Yes, it works. Yes, a person can safely cross the river. But most of the jungle is destroyed. Nobody feels comfortable using it. The town next to it had to spend most of its money on it.

What I am saying is: While you arrived through lots of reasoning why you use e.g. Redis over MongoDB, and can and will defend it if asked for your reasons – in information searching, this is often not the case.

Or can you tell me why you used search engine X in your last project?

I know that often the first step is information acquiring, and towards the end, project managers notice that they were so busy acquiring all this information, that they totally forgot to think about making this information properly searchable. Time constraints then trash sensible search engine selection.

There are other reasons as to why searching is neglected, but this is one I most often experienced.

The resulting problem

Often the end result of our careless choice is that the coders are quite happy, and the end users are relatively happy. But not a good happy, more of an accepting happy. Yes, we can search, and we should be grateful for it.

But are you really happy? Did you really put your engineering savvy into it to help your users advance?

Not really, right?

What we need to do

Know your problem domain, your information structure. Know your options and tools too.

Do you specifically need a realtime indexing search? There’s one written in Ruby (just as an example for a rather special/specific search engine – not sure how far it is yet).

Do you really need a full-text search? Do you know what a full text search is? Do you know when to use one and also, when not? When is a semantic search engine the better choice?

Do you know the answers?

Btw, not dissing full-text search engines to promote Picky (the semantic search engine) here ;) They’re great.

What I’m criticizing is the indiscriminate choice by many of my peers. I’m just trying to bring the point across that one should weigh the options, and decide based on reason.

Fallacy: Search Engines are hard

I guess that sometimes the problem is just that search engines seem like magic. Sure you most of the time know which knob to turn, but when something unexpected happens, you feel like a wet dog out in the wind.

Search engines are easy, actually. Take some time and read all about them, especially by following the links.

Mind, blown?

Sidenote: Computer Science vs. “Informatik”

“We want information. In-for-mation!” (The Prisoner)

German speaking countries got it right: They got computer science pegged.

I love the start of this set of lectures by Abelson and Sussman. In it, one of the guys casually strikes through “Computer”, then “Science”.

Watch them and be enlightened. And they are so right.

Why? Our work is not about computers, it’s about information. Acquiring, analyzing, understanding, searching, offering: Information.

In german, Informatik is a combination of “Information” and “Mathematik”. That’s calling a horse a horse!

Actually, in english the term exists as well, Informatics – but I’ve never heard it used.

Conclusion

So we’ve seen

that in information searching we sometimes forget we’re engineers.
that there are many different types of search engines.
that we perhaps should be talking about “informatics” from now on.

Hope you learnt something new :)

Comments and feedback, as usual, are appreciated.

Picky 2.0

2011-03-28T00:00:00+11:00

In my previous post, I talked about what bothers me in Picky’s API, and did a few 2.0 prerelease versions with the improvements.

After quite a bit of feedback, Picky 2.0 is released! :)

So, what’s in it for you and what do you need to change in your 1.x version to use the spankingly new gem?

What has changed?

Only four things. 2.0’s change list is short but sweet.

Index definitions

We’ve added a nice new possibility to define categories on an index. The blocky initializer. So where you had

index = Index::Memory.new(:name, source)
index.define_category :a
index.define_category :b
index.define_category :c

you now can write

index = Index::Memory.new(:name, source) do
  category :a
  category :b
  category :c
end

This helps keeping everything together a bit more tightly. Also, smoother skin by not having to type as much ;)

The old style still works, but is totally shunned by veteran Pickiers. Be the hippest Pickier in town by using the blocky initializer style. You know you want it.

Query::Full/Live → Search

The double definitions, Query::Full and Query::Live are no more. Good riddance!

Instead, you simply use Search. So instead of

class MyBeooootifulPickySearch < Application

  route %r{^/books/full} => Query::Full.new(some_index),
        %r{^/books/live} => Query::Live.new(some_index)

end

you use

class MyBeooootifulPickySearch < Application

  route %r{^/books} => Search.new(some_index)

end

It says “Route this URL to that search with these indexes and options”. Much more understandable and sexier! :)

To discern whether it is a full (with result ids) or live (without result ids) search, you pass e.g. curl an ids query parameter:

$ curl 'localhost:8080/books?query=meow&ids=15&offset=0'

Defaults are 20 ids and 0 offset.

Similarity::Phonetic → Similarity::DoubleMetaphone

We’ve renamed Similarity::Phonetic to Similarity::DoubleMetaphone. It’s still the same algorithm. See the double metaphone.

Also, we’ve added two default implementations, Similarity::Metaphone and Similarity::Soundex for your similarity pleasure :)

Since Picky is normally used by programmers, DoubleMetaphone is much clearer for what it actually does than Phonetic – it’s a bit of a mouthful, I admit.

Picky will tell you if you still use the old Phonetic definition in your app/application.rb, so you don’t need to learn this by heart.

Picky::Client::Full/Live (in a client) → Picky::Client

The Picky client in your application needs a few changes. Only a single client is needed anymore. So instead of

FullBooksSearch = Picky::Client::Full.new ...
LiveBooksSearch = Picky::Client::Live.new ...

you use

BooksSearch = Picky::Client.new ...

Then in your e.g. controller actions passing what amount of ids you need

BooksSearch.search params[:query], :ids => params[:ids], :offset => params[:offset]

or directly, using :ids => 20 or however you like it.

Various

Leading up to 2.0, we’ve removed the hashbangs in the JS client history, added rake stats and rake analyze. See more in the repo’s top level history.textile.

Conclusion

So we’ve seen

that Picky is two-dot-oh-soooome!
what you’d need to change to be 2.0 compatible.

Hope you learnt something new :)

Btw, protip: Generate a client and server using picky generate and see how everything is defined in 2.0 and compare.

Comments and feedback, as usual, are appreciated.

Picky's Coming of Age

2011-03-16T00:00:00+11:00

I’m gonna talk about what bothers me in Picky’s current configuration and what I’d like to propose for 2.0. Opinions or ideas for new API features are very welcome!

A spot of bother

Since releasing 1.0, something’s always bothered me about Picky’s configuration.

I used to think it’s the abundance of class methods used in definining indexing, querying, or routing:

class MyBeooootifulPickySearch < Application

  default_indexing removes_characters: /[^äöüa-zA-Z0-9\s\/\-\"\&\.]/
  # etc.

end

I usually prefer instances on which I define things. In a nutshell, it’s more easily testable. But this is not really the problem.

So, what is it that is bothering me?

What is really bothering me

Take a look at how routing and queries are defined:

Here, we’re routing /all/full, /all/live to queries which includes three indexes, and /contacts/full, /contacts/live to queries with just the contacts index:

route %r{\A/all/full\Z}      => Query::Full.new(accounts_index, users_index, contacts_index),
      %r{\A/all/live\Z}      => Query::Live.new(accounts_index, users_index, contacts_index),
      %r{\A/contacts/full\Z} => Query::Full.new(contacts_index),
      %r{\A/contacts/live\Z} => Query::Live.new(contacts_index)

In the last sentence, I mention two things that are routed – why do I need double the number of route definitions?

Full and Live queries. Why?

Let me talk a little about the client why this is so.

The Picky client does two different types of queries:

A “live” query, which is sent when typing, to update the number of results.
A “full” query, which is sent when the user presses return or chooses an allocation.

A full query needs to be enriched with rendered results, e.g. with list entries.

This means that full queries need to go through the webapp to be enriched (rendered results etc.) and the live queries can go directly to the server, as no enriching is needed.

Also, live and full queries were once very different. I’ve worked hard to unify them, and the only difference that still exists is that live queries don’t contain the result ids, or more precise: They return 0 result ids, while full queries return by default 20 ids.

The other reason was that I needed two different URLs to have Varnish route the live queries directly to the server (since the id count alone didn’t need to be enriched by the webapp), and the full queries were routed through the webapp, like so:

Isn’t it a bit overkill having to define two identical routes for two queries where just the amount of ids is different?

Absolutely.

A better solution

What I’d like to have is the following

route %r{\A/all\Z}      => Query.new(accounts_index, users_index, contacts_index),
      %r{\A/contacts\Z} => Query.new(contacts_index)

This would DRY up the code immensely.

Problems with this solution

But now we’re presented with two problems:

How do we tell the server that we need 0, or 20 ids, and where?
How can I route the queries differently?

Solutions to these problems

I suggest that the first problem is handled by a query parameter ids. So a query through curl would look like this:

curl localhost:8080/contacts?query=miller&ids=20

Even if this means more typing, it is much more convenient and flexible to use. What I now can do is define default amounts in the JS client and in the webapp client (picky-client gem).

The second problem is routing the queries differently. With the new way, you are much more flexible in this. Several solutions are possible. Say you have a Varnish:

If query param ids is 0, we route directly to the server. If not, it is routed through the webapp.
Define two different URLs, route the live one right on to the server and send the other through the webapp.

Or without Varnish (or Nginx etc.):

Speed is not an issue? Route both through the webapp, and do different queries from there – one with 0 ids, one with 20.

Or any other way that suits you best.

Picky 2.0

Since this really irritates me, I’ll start working on it ASAP.

Most work is needed in the documentation – so if after the release, you see the old style anywhere, please tell me so.

Yeah, Picky 2.0 – good times! :)

Conclusion

So we’ve seen

that Picky lives in a wet environment and needs some DRYing up.
that Picky 2.0 is around the corner.

Hope you learnt something new :)

If you have some feedback on what else could be improved, comment away!

Searching with Picky: Rake Tasks

2011-03-13T00:00:00+11:00

This is a post in the Picky series on its workings. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

We’ve all have used rake index and rake start to index and then start up a server. But did you know that Picky (and Rake, one of his best buddies) offer quite a few more?

Let’s do a quick rake -T. What we get is:

$ rake -T
rake analyze                         # Analyze your indexes (needs rake index).
rake check:index                     # Checks the index files for files that are small or missing.
rake index                           # Generate the index (random order).
rake index:ordered                   # Takes a snapshot, indexes, and caches in order given.
rake index:randomly                  # Takes a snapshot, indexes, and caches in random order.
rake index:specific[index,category]  # Generates a specific index from index snapshots (category opt).
rake routes                          # Shows the available URL paths
rake spec                            # Run all specs in spec directory.
rake start                           # Start the server.
rake stats                           # Application summary.
rake stop                            # Stop the server.
rake try[text,index,category]        # Try the given text in the indexer/query (index/category opt).

I will give you a quick overview over each of them, with the idea that you know what’s there and can try them yourself if you need details.

Before we begin, a note on the naming: I used to name rake tasks rake subject:verb, but not in Picky, since Picky has a lot of single word tasks. So they’re named rake verb:subject, as subjects are usually not present.

I’ll start out with the fun ones.

rake try[text,index,category]

Suppose you send a few queries to Picky and you get empty results, even though you know that “it must be in the indextubes” aka “Y U NO FIND?”.

This is the task for you! It shows you how a text gets split up into tokens, in indexing and querying. Let me show you with an example project of mine:

$ rake 'try[flöre.hanke]'
...
"flöre.hanke" is saved in the index as             [:floerehanke]
"flöre.hanke" as a query will be preprocessed into [:"floere.hanke"]

I used single quotes to remind you that you might need these to escape special characters.

So what we see is that if my specific Picky app encounters flöre.hanke, it will index it as one word, remove . , and replace the umlaut ö with oe, as per german rules.

However, in a query, if someone searches for flöre.hanke, my specific Picky app will not remove the . but use it as given (with the exception of the replaced ö).

So, in this case, nothing would be found.

The index and category options let you specify with which index and category you’d like to try them.

rake try is your first line of defense against nasty configuration bugs.

The interesting thing here is that often, the configurations for indexing and querying are similar. The intelligence and beauty lies in where they are not.

rake routes

Remember Rails? That huge framework that was eventually replaced by Sinatra? Same rake task: rake routes.

It blasts out all your routes and where they route to:

$ rake routes
...
Note: Anchored (✓) regexps are faster, e.g. /\A.*\Z/ or /^.*$/.
✓  \A/admin\Z      => Suckerfish Live Interface (Use the picky-live gem to introspect)
✓  \A/books/full\Z => Query::Full(books, isbn, weights: {[:author]=>6, [:title, :author]=>5})
✓  \A/books/live\Z => Query::Live(books, isbn, weights: {[:author]=>6, [:title, :author]=>5})

rake stats

Similar to Rails’ rake stats, but with more steroids. Let me just show you an example:

$ rake stats
...
Application(s)
  Definition LOC:    81
  Indexes defined:    2

  BookSearch
    Indexing (default):
      Removes characters:        /[^äöüa-zA-Z0-9\s\/\-\"\&\.]/
      Stopwords:                 /\b(und|and|the|or|on|of|in|is|to|from|as|at|an)\b/
      Splits text on:            /[\s\/\-\"\&]/
      Removes chars after split: /[\.]/
      Normalizes words:          [[/\$(\w+)/i, "\\1 dollars"]]
      Rejects tokens?            Yes, see line 10 in app/application.rb
      Substitutes chars?         Yes, using CharacterSubstituters::WestEuropean.

    Querying (default):
      Removes characters:        /[^ïôåñëäöüa-zA-Z0-9\s\/\-\,\&\.\"\~\*\:]/
      Stopwords:                 /\b(und|and|the|or|on|of|in|is|to|from|as|at|an)\b/
      Splits text on:            /[\s\/\-\,\&]+/
      Removes chars after split: //
      Normalizes words:          -
      Rejects tokens?            -
      Substitutes chars?         Yes, using CharacterSubstituters::WestEuropean.

    Indexes:
      books (Index::Memory):
        source:            Sources::DB("SELECT id, title, author, year FROM books", {:file=>"app/db.yml"})
        categories:        id, title, author, year
        result identifier: "boooookies"

      redis (Index::Redis):
        source:            Sources::CSV(title, author, isbn, year, publisher, subjects, {:file=>"data/books.csv"})
        categories:        title, author, year, publisher, subjects


    Routes:
      Note: Anchored (✓) regexps are faster, e.g. /\A.*\Z/ or /^.*$/.

      ✓  \A/admin\Z      => Suckerfish Live Interface (Use the picky-live gem to introspect)
      ✓  \A/books/full\Z => Query::Full(books, redis, weights: {[:author]=>6, [:title, :author]=>5)
      ✓  \A/books/live\Z => Query::Live(books, redis, weights: {[:author]=>6, [:title, :author]=>5)
      ✓  \A/redis/full\Z => Query::Full(redis, weights: {[:author]=>6, [:title, :author]=>5)
      ✓  \A/redis/live\Z => Query::Live(redis, weights: {[:author]=>6, [:title, :author]=>5)

This is cool, right? In one fell swoop you see who uses what stopwords, which characters aren’t removed, and how many LOC your config file has. I love it.

The routes are also available separately for just $9.99 … uh, I mean, as rake routes.

rake analyze

This task takes a look at your indexes and tells you a few statistics about them. This is most likely to evolve into something more powerful with each iteration.

For now, it gives you this:

$ rake analyze
...
Indexes analysis:
  books:id::
    exact:
      Index matches single characters.
      There's only one id per key – you'll only get single results.
      index key cardinality:                       540
      index key length range (avg):               1..3 (2.8)
      index ids per key length range (avg):       1..1 (1.0)
      weights range (avg):                    0.0..0.0 (0.0)
    partial*:
      Index matches single characters.
      index key cardinality:                       540
      index key length range (avg):               1..3 (2.8)
      index ids per key length range (avg):     1..111 (2.8)
      weights range (avg):                   0.0..4.71 (0.26)

  books:title::
    exact:
      Index matches single characters.
      index key cardinality:                      1681
      index key length range (avg):              1..19 (7.4)
      index ids per key length range (avg):      1..81 (1.9)
      weights range (avg):                   0.0..4.39 (0.33)
      similarity key length range (avg):          0..4 (3.58)
    partial*:
      Index matches single characters.
      index key cardinality:                      7010
      index key length range (avg):              1..19 (6.29)
      index ids per key length range (avg):     1..242 (3.08)
      weights range (avg):                   0.0..5.49 (0.52)

Most of it is probably gibberish. Picky tries to give you useful notes (in color, not visible above) about the indexes, for example the Index matches single characters (when a single character already gets results) or as a warning There’s only one id per key – you’ll only get single results (when you’ll only get one result id per query – which might not be what you want).

rake index…

Frankly, if you haven’t seen rake index yet, you haven’t tried Picky yet. If this were a flow diagram, you’d be sent back to the start ;)

rake index does just that. It indexes.

You can tell it in what order to index them by using rake index:ordered and rake index:randomly, which will index the indexes either in the order they were defined or in a random fashion. Default is randomly, but if you’re not happy with that, tell Picky explicitly.

You can tell Picky to just index a single index, or even more specific, a single category inside a given index. Use rake index:specific[books,title]. It also tells you when an index or category is not there:

$ rake index:specific[books,isbn]
...
rake aborted!
Index category "isbn" not found. Possible categories: "id", "title", "author", "year".

rake check…

rake check:index checks the indexes for suspiciously small or nonexistent indexes.

rake start/stop

One starts a Unicorn server, one stops it. I always forget which is which.

It’s not too webserver agnostic yet, but as soon as somebody complains, I will rewrite it to be so – if you’re not faster with one of these beloved pull requests :)

rake spec

You will be surprised by this one: Runs the specs in the spec directory.

Conclusion

So we’ve seen

that Picky does not just rake index and rake start.
that Picky gives you a few command line tools (apart from the web tools) to find bugs in your config.
that Picky is not just good for picking up girls in bars.

Hope you learnt something new :)

Searching with Picky: Redis

2011-03-02T00:00:00+11:00

This is a post in the Picky series on its workings. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

This post will be a very short introduction on Redis index backends and Picky, and how to configure your indexes to run on Redis.

I intended to do a massive writeup, but since all you do is change 6 characters Memory into 5 different characters Redis it just seemed like a massive overkill.

I admit though that many massive writeups have been done on even smaller changes, like “1.8” → “1.9” ;)

Ok, so what am I talking about?

tl;dr

Redis can now be used in Picky as an index backend.
In your config, do Index::Memory.new → Index::Redis.new and you’re set :)
Memory and Redis indexes cannot (yet) be mixed and matched.
In 1.5.0, Picky uses Redis database 15.

What is Redis?

Redis is – taken from the website – an “open source, advanced key-value store”. But this is not all. It also is a “data structure server”. Check it out on its very nicely done website.

“But we already have the massively fast in-memory backend. Why Redis?”, you scream, indignantly.

Why Redis?

Granted, in-memory indexes in Picky are really fast. But they have a few drawbacks:

Relatively slow search engine startup, as the JSON index files need to be loaded into memory. This is especially noticeable if the index is around 12 GB.
To restart Unicorn without a hitch you need double the space the in-memory index needs, since Unicorn starts up a second master in parallel to the old one.
They need to be reloaded to be updated (see last blog post).

I haven’t had any problems with that, but I can see the problem. Hence, Redis.

How do you use Redis indexes?

Looking at the configuration that the scaffolding generates, you see that it uses an Index::Memory called books:

books_index = Index::Memory.new :books, Sources::CSV.new(:title, :author, file: 'app/library.csv')

If you’d like to use the Redis backend instead, you’ll have to change Memory into Redis.

books_index = Index::Redis.new :books, Sources::CSV.new(:title, :author, file: 'app/library.csv')

I know. Picky is hard on the typing hand ;)

Uh. That’s already it. Welcome Redis. Bye bye, Memory.

What you have to do now is re-index and start Picky:

$ rake index
... indexing output ...
$ rake start

Or, start Picky, re-index and search while it is indexing, to get some added fun value.

What is the impact of Redis indexes?

Compared to the in-memory index, what are the advantages and disadvantages?

Advantages:

Faster startup time, especially with a large index.
Indexing as-you-search. (No index reloading)

Drawbacks:

Factors slower.

Caveats / Next Versions

The Redis backend implementation in Picky is not yet customizable. This means that:

It uses Redis database 15.
Returned entry ids are always strings, even when they were integers going in. You’ll have to convert them back.
Redis and Memory indexes cannot (yet) be mixed and matched. So this isn’t possible: Query::Full.new(redis_index, memory_index). Picky will notify you if you try to do so, so no worries.

I am focusing on these points in the upcoming 1.5.* versions.

Outlook

One of the next blog posts will look at the performance differences between the Redis backend and the memory backend.

I can already reveal that the memory backend will be faster. Surprise! ;) The question is: Is Redis so much slower as to be unbearable?

Music, pregnant with suspense, fills the room: Dun dun DUNNN.

Conclusion

So we’ve seen

what Redis is.
that Picky offers two different index backends: In-Memory and Redis.
how you use/implement the Redis index backend in your search.

Hope you learnt something new :)

Searching with Picky: Live reloading indexes

2011-02-20T00:00:00+11:00

This is a post in the Picky series on its workings. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

This post is on reloading indexes by way of signals. So, first let’s talk a little about signals. Then, in the second half, I talk about reloading the memory index in Picky.

Warp 9?

Signals in Ruby
Still calling the old trap handler
Reloading the indexes
Back when Ruby was mostly foxes and bacon
Conclusion

En-gage.

Signals in Ruby

Signals are way of sending instructions to a running process. Here’s a list of signals.

In Ruby you handle these signals by giving the Signal#trap method a handler block:

What if I give it two? Let’s try it.

Signal.trap('USR1') { p "hello" }
Signal.trap('USR1') { p "world" }

# Print out the process PID such that it is easier
# to enter "kill -USR1 the_printed_process_pid"
#
puts Process.pid

# You have sixty seconds to defuse … err try this example.
#
sleep 60

Then, enter kill -USR1 <the printed process pid> and see what happens.

What happens is that the second block that prints “world” replaces the first one. So we see:

type here> ruby signals.rb 
77306
"world"

Ruby throws the old block away. What if I don’t want this?

Still calling the old trap handler

So, for example, in Unicorn, sending the USR1 signal handler reopens all logs. What if I want to do something else? If I just do

Signal.trap('USR1') { something_else }

the old handler will be gone.

So, my assumption was that Ruby gives me the old handler when calling

old_handler = Signal.trap('USR1')

Nope. Hurting the POLS a little here. It only gives me the old handler when installing a new one.

So what can you do? Use this “trick”:

old_handler = Signal.trap('USR1') {}
Signal.trap('USR1') { something_else; old_handler.call }

So I install a bogus handler to get the old handler, then throw the bogus handler away, right away, and call the old handler in the new handler.

Reloading the indexes

Currently, Picky does not support realtime indexes. It also runs with memory-only indexes (a Redis index backend is in the works). So, while the Picky server is running, it does not by itself pick up the new indexes, even if you generate new index files by running rake index.

Btw, did you ever try to call rake -T while in your Picky server project?

How can we reload the indexes?

Quite easy, actually. Reloading the memory indexes is done by calling

Indexes.reload

That’s it.

How do we get the Picky server process to call Indexes.reload?

Now talking about all that signal handling pays off! :)

… in a non-forking web server.

When running Picky in a non-forking web server, in e.g. thin, in the file app/application.rb we’d call

Signal.trap('USR1') { Indexes.reload }

and then in the Terminal, we run

type here> rake index
... (Picky indexes and writes new index files. Afterwards you tell the server to reload the indexes.)
type here> kill -USR1 your_picky_server_process_id

You should see some output that the server has reloaded the indexes.

… in a forking web server.

Unicorn, for example. Picky’s current web server of choice.

Since Unicorn already defines USR1, we use the trick we’ve talked about above to not replace the unicorn handler (if you need it):

old_handler = Signal.trap('USR1') {}
Signal.trap('USR1') { Indexes.reload; old_handler.call }

(Doesn’t have to be USR1, btw)

After indexing and sending the USR1 signal to the Unicorn master, we aren’t finished. Since the indexes have only been reloaded in the master, while the children are still happily using the old indexes.

Check out this very helpful page about signals in Unicorn. If preload_app is set to false in the unicorn.ru, you can just send a HUP signal to the master. It will then kill all children, and fork then. Finished.

When using Unicorn, you may of course also use the way Unicorn does it. See the instructions under “Procedure to replace a running unicorn executable”.

Good stuff! Although this procedure uses around double the memory the Picky server uses normally, while the index reloading uses around 1.5 times the size of the largest sub-index (in a nutshell, a lot less than the Unicorn replacement technique).

… periodically.

What about reloading the indexes periodically?

You could, of course, try to use a Thread, trying to reload the indexes every X time units and monkey around with it (tell me if you are successful :) ). I wouldn’t.

I recommend to externally trigger rake index, and then trigger reloads from outside using the mentioned signals.

Btw, a fun thing with signals you should not do

Back when Ruby was mostly foxes and bacons, I happened to type this:

begin
  p Process.pid
  looong_running_method
rescue Exception => e
  p "Oh deary me!"
  retry
end

Note: I did not actually type looong_running_method and "Oh deary me!", but you get the idea ;)

The idea was that if the long running method fails, it’d just retry running it.

Sounds good, right? Try running it, and stop it with Ctrl-C. The problem is the line rescue Exception => e.

Why? I soon found out that catching all Exceptions is not a good idea if you’d like stopping your program by way of Ctrl-C, since SignalException inherits from Exception:

p SignalException.ancestors # => [SignalException, Exception, Object, Kernel, BasicObject]

Ctrl-C sends a SIGINT, an INT signal to your process. Internally, a SignalException is raised, which is then caught by the rescue.

A kill -9 sends this process to Walhalla. The place where all programs go that have incurred a major learning experience on their writers.

Conclusion

So we’ve seen

how signals work
that reloading indexes in a running Picky server is easy
how you use signals to reload the server
how reloading works in different web servers
that reloading the indexes isn’t without problems
that you need to be careful when catching exceptions

Hope you learnt something new!

A better Rubygems search

2011-02-13T00:00:00+11:00

Some time ago, Kaspar mentioned to me that it would be nice to have a gem dependency search, i.e. where you could search in which gems a gem is used.

I thought so too, so I wrote one :) (and added some more features in the process)

Take a look: http://gemsearch.heroku.com/

(Note, it might take Heroku some time to ramp up the server)

Current state

While the current search isn’t bad, it is missing the possibility of searching for an author, where a gem is used, or which version it has. Or any combination thereof, for that matter.

Building the search

I happened to have a fast & clever search engine lying around ;) so this is what I used.

How do you go about building or configuring a search engine?

1. Look at what your goals are.

My goals seemed simple enough.

Each gem should be findable under:

Its name (Try it).
Its version(s), entered like x.y.z, or part thereof, x, x., x.y, x.y. (try it).
Its author’s names, or first/last names. Or parts thereof, like “flo” for florian (try it).
The gems it is dependent upon. universe-parsing depends on parslet, for example (try it).
The names, gem name and dependent gem name should be phonetically findable (try it).
The authors too should be phonetically findable – since who knows how to write “Heinemeier” (try it)?
All should be findable without entering the whole thing, like “1.0”, or “activesupp” (try it).
One should be able to specify what he is looking for by prefixing e.g. “uses:” in front of the search term (Try it). Or others, like “dependency:”, “dependencies:”, “depends:”, “using:”, “uses:”, “use:”, “needs:” (all possible).

I leave out the description for now, as it requires quite a bit of thinking and tinkering.

With the goals defined…

2. Look at the data.

I downloaded the Marshal file, extracted the relevant data and restructured it into a readable CSV.

Two potential problems I noticed:

Gem names are spaced using either an underscore _ or a hyphen -.
For the same name, there are sometimes up to three different encodings. Take the gems of Nicolás Sanguinetti for example. Try it and look at the author names.

Those were problematic. What does one do? Try to find an optimal solution.

3. Marry the goals and the data.

I decided not to tackle the display issues of the second point, encodings, but just the indexing issues. What I do is use character substitution, to make “Nicolás” findable under “nicolas”. This I do by saving the name as “nicolas” in the index, and also perform this character substitution on each search. Try it with án áccent.

Deciding on what to with the gem names was a little harder. What is the problem?

The problem is manyfold. For one, searchers should not need to know whether a gem was spaced with an underscore or a hyphen. Actually, I thought it best they be able to find it using a space. So the picky-live gem should be findable by typing “picky live” (Try it).

However, if you then look for “sinatra”, the actual sinatra gem is not the first in the list. This is because I opted to go for an alphabetical ordering.

However, if I need the user to enter the full name (like “anthonymoralez-apn_on_rails”), they might not find it at all.

So, the way I did it now is have the user be able to use spaces when searching and trust people to depend on Picky’s combinatorial nature. For example, if you look for sinatra and know that one of the owners is called Tomayko, you’ll get to your answer directly: Search for ‘sinatra tomayko’.

Generally, the more you can help Picky, the more it will help you right back.

4. Have users try it and get feedback.

This is where you come in :)

Check it out, if you haven’t yet and tell me what you think @hanke! Do you have ideas for improvement? (If yes, tell me which so I can improve it)

How about we use this search on rubygems.org? :)

A few technical Picky specifics.

A few Picky specifics for insiders:

We have 4 data categories: name, version, author, dependencies.
The partial search “rail*” is done using Partial::Substring.new(from: 1).
The similarity “hallou~” is done using: Similarity::Phonetic.new(2).
A singly occurring name will be weighted up a little: :weights => { [:name] => +1 } }.
The author for example can be prefixed with: qualifiers: [:author, :authors, :written, :writer, :by].

Yes, currently I break the web with hashtags – I’m rewriting it to use pushState.

Thanks

Many thanks to Heroku for providing the infrastructure!

Conclusions

So we’ve seen

that there’s a better way to search Rubygems
where you can go to try it
how you could go about creating a search

Hope you learnt something new :)

Running Sinatra inside a Ruby Gem

2011-02-02T00:00:00+11:00

This is a post in the Picky series on its workings. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

In this post I’ll show how to have Sinatra run directly from inside a gem. And at the end, how Picky uses this for its advantage.

Let’s go singing in the gem!

The thing is…

What I wanted, was to add a nice statistics web interface to Picky.

First I though about adding it to the server, but soon after (~1.2µs) decided that this was a silly idea.

Picky is heavily designed around loosely connected elements in the server. I think this is even a better idea outside of a large component such as a server. So what I found myself thinking about – while showering – next was, to have a gem which generates a Sinatra application…

Suddenly the room lit up and I spotted, scrawled on the wall in burning letters of blood:

The wrong question.

I gave it not much thought, as it can get crazy in this part of Zürich. Then, while gorging myself on my beloved alphabet soup, and thinking about how to structure files in this web application exactly, the letters suddenly formed a sentence:

Dude the wrong, fucking question.

(Soups can only spell so well)

I only got it a few hours later, while three swedish massage therapists kneaded my shoulders.

In computer science, the answers aren’t nearly as important as asking:

…the right fucking question.

The right fucking question

The right question is:

How do I fit a web application wholly in a gem, such that I can do a

$ picky stats log/search.log

on any Picky logfile and it will parse it and show me a nice statistical representation of it in a browser without soiling the directory and everything else?

The right fucking tool for the job

That’s Sinatra I’m talking about. The great and extremely easy to use Ruby DSL for web applications.

Give it a whirl if you haven’t seen it!

How to do it

First, set up a gem structure – let’s call the gem “rain_sining”. Then, inside it, set up the following structure:

rain_singing
  /bin
  /lib
    /rain_singing
      /application   # <- the app is in here
        app.rb       # <- the webapp itself
        /images
        /javascripts
        /stylesheets
        /views
    rain_singing.rb
  rain_singing.gemspec
  /spec

The “hardest” thing is getting the directories correctly set up.

So what you do inside app.rb is:

require 'sinatra'
require 'haml' # if you use haml views

class SingingRain < Sinatra::Base

  set :static, true                             # set up static file routing
  set :public, File.expand_path('..', __FILE__) # set up the static dir (with images/js/css inside)
  
  set :views,  File.expand_path('../views', __FILE__) # set up the views dir
  set :haml, { :format => :html5 }                    # if you use haml
  
  # Your "actions" go here…
  #
  get '/' do
    haml :'/index'
  end
  
end

# Run the app!
#
puts "Hello, you're running your web app from a gem!"
SingingRain.run!

And that’s already it for the app.

Now, if you want to define a binary for the gem, put an executable rain_singing file into /bin. Into this file you’d write:

#!/usr/bin/env ruby
#
begin
  require 'rain_singing/application/app.rb'
rescue LoadError => e
  require 'rubygems'
  path = File.expand_path '../../lib', __FILE__
  $:.unshift(path) if File.directory?(path) && !$:.include?(path)
  require 'rain_singing/application/app.rb'
end

Then, we need to tell rubygems that this gem has an executable:

Gem::Specification.new do |s|
  
  ...
  
  s.executables = ['rain_singing']
  s.default_executable = 'rain_singing'
  
  ...
  
end

After generating your gem with

$ gem build rain_singing.gemspec

and installing it with

$ gem install rain_singing-1.0.0.gem

you are ready to run

$ rain_singing
Hello, you're running your web app from a gem!

Good stuff. Good stuff. Makes me want to sing in the rain.

In Picky

Picky uses this for two things.

A statistics interface ($ gem install picky-statistics), run

$ picky stats path/to/your/search.log 1234

or the live interface to the running server ($ gem install picky-live), run

$ picky live localhost:8080/admin 1234

You need to add route %r{/admin} => LiveParameters.new in the server to have it work. But then you get the interface described in this blog post.

Nice, eh?

Conclusions

So we’ve seen

that Sinatra rocks my noodles
that a Gem can contain a whole webapp without footprint
that Picky uses both for maximal profit!

Hope you learnt something new :)

Parslet Intro

2011-02-01T00:00:00+11:00

Tonight I wanted to take some time off from Picky to write about Parslet, a parser construction library by my dear friend Kaspar Schiess.

tl;dr

Parslet is great.
gem install parslet
Look at any of the examples.
Try, learn, try again, profit!

What is it?

In Kaspar’s words: “A small Ruby library for constructing parsers in the PEG (Parsing Expression Grammar) fashion”.

A parser is used to transform text data into a semantically meaningful structure by injecting information based on assumptions on the text’s structure. For example, "Hello, Florian!" could be parsed into something like: [sentence: [greeting:hello, separation:comma, name:florian, mark:exclamation]].

It’s probably best if you just tried it for yourself.

Are there other parser constructors?

Yes, Citrus and Treetop. But let’s be frank here. Parslet eats these for breakfast in terms of ease of use and power, in my humble and almost unbiased opinion. Let me explain why.

Why is it so powerful and easy?

On the main page, Kaspar notes that Parslet is especially easy by “providing the best error reporting” and “not generating reams of code for you to debug”.

While both are certainly true, and I do not disagree, but I don’t think that that is what makes Parslet so easy or powerful. Surely easi-er, but the main reason I love it is that it harnesses the power of Ruby.

The second reason I consider it so great is that it split into a parser and a transformer step, with an intermediate syntax tree that is entirely in Ruby basic atoms, like hashes and arrays.

Why is this cool? To repeat my example, above: The parser would first parse "Hello, Florian!" into [sentence: [greeting:hello, separation:comma, name:florian, mark:exclamation]] and then, for example, a FrenchTransformer could be used to transform this into: Bonjour, Florian!, the french representation of the english input sentence. So first we get an intermediate semantic expression that we can then transform into something else. And there can be a lot of transformers starting from where the parser ended. Thinking about a SwedishTransformer or an ItalianTransformer? Me too. “Optimus Primo, transformate! Ciao!”

Or a chain of transformers that first take the intermediate tree and morph it into a different intermediate tree. The possibilities are endless.

Simple Example

Let’s consider a simple example. It is a subpart of the ERB parser and transformer that I wrote. ERB is a Ruby templating language by Seki Masatoshi.

We’ll look at the whole thing later on.

A simple ERB example would be ERB with a Ruby expression inside:

Hello, my name is<%= name >!

What we get out of the parser is the parts that are text, and the parts that are ruby code. So with parslet we’d write this:

require 'parslet'

class ErbParser < Parslet::Parser
  
  rule(:ruby_expression) { (str('%>').absnt? >> any).repeat.as(:ruby) }
  rule(:erb_with_tags) { str('<%=') >> ruby_expression >> str('%>') }
  
  rule(:text) { (str('<%=').absnt? >> any).repeat(1).as(:text) }
  
  rule(:text_with_ruby_expressions) { (text | erb_with_tags).repeat }
  root(:text_with_ruby_expressions)
end

p ErbParser.new.parse("Hello, my name is<%= name %>!")

Just run it :) What you get is a nice semantic tree:

[{:text=>"Hello, my name is"}, {:ruby=>" name "}, {:text=>"!"}]

Let me go through it in steps. I’ve found out that it is easiest for me to go top-down to define a parser. I hope this suits you too.

We define the starting point, aka the root of the parser with the root method:

root(:text_with_ruby_expressions)

This just says, start with the rule(:text_with_ruby_expressions).

So, now what we know about our simple-ERB language is that it is basically a sequence of text and ruby expressions, repeating. So let’s define that:

rule(:text_with_ruby_expressions) { (text | erb_with_tags).repeat }

So either we have text OR (|) a ruby expression. And we have that in a repeating fashion. Just as the rule says.

Let’s look at the text rule we just used:

rule(:text) { (str('<%=').absnt? >> any).repeat(1).as(:text) }

This means: As long as you don’t encounter a ERB start tag (<%=), keep taking everything as text. This will stop if it encounters a <%=.

At which point Parslet will try to apply the other rule:

rule(:erb_with_tags) { str('<%=') >> ruby_expression >> str('%>') }

This rule just matches anything with erb start <%= and end tags %> around it, with a ruby expression inside.

The ruby expression is simple:

rule(:ruby_expression) { (str('%>').absnt? >> any).repeat.as(:ruby) }

We know this already: As long as you don’t encounter an ERB end tag, keep consuming as ruby code.

Got it?

Again, if you run it, you get:

[{:text=>"Hello, my name is"}, {:ruby=>" name "}, {:text=>"!"}]

Niiice.

Let’s not think about the transform step for a second and look at some of the good shit.

Goodies that will blow your mind.

Parslet doesn’t force you to use a class. It’s totally ok to just do this:

include Parslet
parser = (str('Hello') | str('Hi')).as(:greeting)
p parser.parse('Hello')

In Parslet, you can run the parser with a subset of its rules:

p ErbParser.new.erb_with_tags.parse("<%= name %>")

while

p ErbParser.new.erb_with_tags.parse("Hello, <%= name %>!")

would fail since the erb_with_tags rule just covers text which starts with <%= and ends with %>.

Running a parse on a subrule works because a parser is composed of Parslets, or parser atoms, hence the name. str('hello') is one of these atoms, and so is a sequence of atoms, like str('no') >> str('kidding'). And you can do a parse directly with one of these, if you want, (str('Hello') | str('Hi')).parse('Hello') as we have seen before.

Did I say it’s pure Ruby? Why, yes! Let’s harness the power of Ruby, and combine it with the power of Parslet parser atoms.

I need a parser that is case insensitive regarding the string.

def case_insensitive string
  chars = string.split //
  chars.inject(str('')) do |parslet, char|
    parslet >> match("[#{char.downcase}|#{char.upcase}]")
  end
end

p case_insensitive('hello').parse('HeLLo')

This returns me a case insensitive parser that I can directly use to parse the HeLLo. Or why not combine it with other parslets?

p (case_insensitive('hello') >> str(' ') >> str('Florian')).parse('HeLLo Florian')

Transforming

Can you take a quick look at the ERB parser, copy it into a script and give it a go?

As you can see, it’s not just able to parse text and ruby expressions (<%= ruby expression %>), but also comments (<%# comment %>) and normal ruby code (<% ruby %>) that both will not be inserted into the rendered text.

Now we’ll have a look at the transformer that will spit out rendered text:

evaluator = Parslet::Transform.new do
  
  erb_binding = binding
  
  rule(:code => { :ruby => simple(:ruby) }) { eval(ruby, erb_binding); '' }  
  rule(:expression => { :ruby => simple(:ruby) }) { eval(ruby, erb_binding) }
  rule(:comment => { :ruby => simple(:ruby) }) { '' }
  
  rule(:text => simple(:text)) { text }
  rule(:text => sequence(:texts)) { texts.join }
  
end

Ignore for now the part where bindings are used.

A transformer consists of a number of rules. And a rule consists of a part that recognizes structure in the semantic tree, and a block which tells the transformer what to do with the recognized thing. Got it? So this rule,

rule(:text => sequence(:texts)) { texts.join }

recognizes hashes that look like :text => sequence(:texts), sequences of things that are denoted as text. The identifier :texts is used in the block where we tell the transformer what to do: { texts.join }. So what we do is simple, we just join a sequence of texts together.

Another rule, the comment rule,

rule(:comment => { :ruby => simple(:ruby) }) { '' }

will return just nothing.

Now, if we want to parse and transform something like this:

The <% a = 2 %>not printed result of "a = 2".
The <%# a = 1 %>not printed non-evaluated comment "a = 1", see the value of a below.
The <%= 'nicely' %> printed result.
The <% b = 3 %>value of a is <%= a %>, and b is <%= b %>.

It gets a little more complicated. If you look at line 1, you see that a is given a value of 2. And then we will use that value in line 4, where we put the result of 2 into the rendered template. Have you tried it? No? Run it and see :)

Remembering State

If you want the transformer rules to remember values in between transformations – like the a that is set to 2, above, you’ll need state of some sort.

I can show you the way I did it with the ERB transformer. I’m sure you can think of many others that are perhaps safer, more powerful, or simply cleaner. But for now, we’ll have a look at this:

evaluator = Parslet::Transform.new do
  
  erb_binding = binding
  
  rule(:code => { :ruby => simple(:ruby) }) { eval(ruby, erb_binding); '' }
  
  ...

end

What happens here? First, I assign the binding of the block to erb_binding:

erb_binding = binding

This is the object where we will safe the state.

It’s a good thing for me that the rule method uses a block to define what to do when encountering a rule. Why? Well, since it is a block, the local variable erb_binding is bound in the context of the block, which means that I have easy access to it in { eval(ruby, erb_binding); '' }.

So what I do with

eval(ruby, erb_binding); ''

is: Evaluate the code piece that I get in the variable ruby, and evaluate it with the binding I have saved. Then, I return an empty string since <% ruby code %> should not write anything into the resulting rendered template.

Not so in the expression:

rule(:expression => { :ruby => simple(:ruby) }) { eval(ruby, erb_binding) }

Here I return whatever the evaluation returned to be inserted into the rendered result.

Isn’t it nice? And between parser and transformer I was able to look at my nice semantic tree, to check that everything is a-ok.

Writing tests, as everything is in Ruby, is a breeze, as you can imagine!

Conclusion

My personal conclusion is that this thing is here to stay.

Not only is it easy to use, but you have the full power of Ruby available to write parsers, comfortably.

It already has garnered the attention of quite a few excellent Rubyists – the hard core of parslet users – which hang out at the #parslet IRC channel.

So we’ve seen

that Parslet harnishes Ruby’s powers for success and profit.
that it offers a parser constructor AND a transformer constructor, which is a good thing.
that trying it yourself is fun and a piece of cake.
And: That using bindings is crazy fun when used at the right place :)

Hope you learnt something new :)

Searching with Picky: Live Parameters Part 2

2011-01-27T00:00:00+11:00

This is a post in the Picky series on its workings. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

This is the second part of the Live Parameters blog post that deals with the problem of hot replacing a configuration of a search server like Picky running in a multiprocessing server like unicorn.

tl;dr

gem install picky-live

Server: In app/application.rb, insert

route %r{\A/admin\Z} => LiveParameters.new

Enter picky live on the command line.
Open The Suckerfish Interface.
Have fun!

What was the problem, again?

The goal is that we want to update Picky’s config while it is answering search requests.

The problem is that we need to update the config in the master process, but most multiprocessing servers don’t allow easy access. And it’s good like that.

What I’d like to do is provide access for a suckerfish. But since it isn’t easy or a good idea to open direct access to the parent, the suckerfish must go through the child.

The child would accept data incoming from the suckerfish, process it and tell the parent what to change.

So what we’d need is for the child to be able to write the parent. It’s actually quite easy to do in Ruby. But how?

The simplest way to write your parents.

… apart from picking up a pen once in a while? Your mother didn’t spend 20 hours of her life in labor just for fun, you know!

Heh.

First, you open an IO.pipe. Then, in the fork (the child), you close off the “child” and then you are ready to write.

In the parent, you do the opposite, and call gets (for example) then wait for a message from the child.

child, parent = IO.pipe

fork do
  # In child.
  #
  child.close
  puts "#{Process.pid}: I'll write soon."
  parent.write "Hello from child!"
end

# In parent.
#
parent.close
message = child.gets '!'
puts "#{Process.pid}: #{message}"

It’s copy-and-try!

Process.pid returns the current process id, which is different in the child and the parent, as you can see after trying the example. In the parent, the child.gets with a parameter will read up until having received that string, then return whatever has been read so far.

I always look at child and server as if the child was a perfect copy of the parent. And anything you change in the child won’t affect the parent. But if you change something in the parent, it will affect all future children.

How Picky does it

Five steps:

The Picky child receives the config update request.
It tries to update its config (more on that below).
If successful, it tells the parent. If not, it kills itself, and tells Suckerfish which config was wrong.
The parent, on receiving the message, updates itself and kills off all other children (more on that below).
The child will answer Suckerfish with the current configuration.

The messaging is basically the same as above, but a bit more elaborate in Picky, since:

Picky doesn’t have control over the forking. This means Picky doesn’t know when to close the child, which is why on each call received on the API, we just do a
```
@child.close unless @child.closed?
```
The server inside which Picky is running will fork off the parent multiple times, and not just at the beginning. So, if the parent would do a
```
@parent.close
```
as in the example, then yes, it would work fine. Up until the next time a child is forked. What happens when a child is forked? The connection to the parent would already have been closed off by the parent itself, and the child would be unable to write on it. Solution? I just leave it open, since the parent doesn’t need to talk to the child. (Ensuring years of therapy for the child)

How does Picky ensure there will be no problems in the parent process?

What would happen if the Suckerfish had direct access to the master’s configuration?

We assume that the child is a close to perfect copy of the parent process. So what we do is try updating the configuration in the child first.

If that works, we can assume that in the parent, it will work too (no malformed configuration input). So we just send the parent the data and the parent will use the exact same method as the child to update itself.

Now we have the problem that there are still children hanging around with the old config. So what the parent process – any good parent ;) – does is kill all of these. The one giving it the ok config is spared, since it has the new config already. After that, new children are forked with the correct config.

What happens if the config is malformed? The child that accepted the suckerfish request needs to die, since its config might now be malformed. So what it does is prepare for an honorable Harakiri, tell the Suckerfish what is wrong, and perform a horizontal cut through its stomach, using Process.kill(:QUIT, 0).

But… how do I get it to work in Picky?

How you configure it in Picky

Simple – you open a http interface in app/application.rb the same way as you would for a query. But this time, instead of a query, you have it point to an instance of LiveParameters.

Like that:

route %r{\A/admin\Z} => LiveParameters.new

And then, you have to…

No, wait. That’s it.

This opens a JSON interface into the heart of your Picky configuration.

The interface

HTTP query params in, JSON hash out.
On success, it returns the complete config, always.
On failure, it returns the offending key with the value “ERROR”.
If you pass in no query params, nothing will get updated, but you still get the config hash.
If you pass in something like …?querying_splits_text_on=\s, it will update its config to split text on whitespaces.

Beware

Just one thing: Be sure to not let your users have access to the live params url.

And also, be sure not to let your users have access to the live params url.

The picky-live gem

Because sending the server configuration messages per HTTP by hand is very tedious, Picky offers a much nicer interface, the picky-live gem.

gem install picky-live

Then, just enter

picky live

This will start up the Suckerfish web interface on a default port, localhost:4568, going through the default Suckerfish interface on /admin in the Picky server.

If you have customized it to be on /suckerfish and you don’t want the Suckerfish web interface on the default port, you’d type:

picky live localhost:8080/suckerfish 1234

This would start up the interface on localhost:1234.

The interface looks like this:

What you see are three configs that you are currently able to change on the fly. These are the configs for query text handing and wrangling.

If I change a config in the interface, it will tell me so (currently by changing the background color of the input):

Then, as soon as I click on the “Update server now” button, a suckerfish speeds off, accesses the child through the right URL, tells the child to update. The child will try to update itself, and if that works, tell the master to update.

In this example, the updating has failed. The child will tell me so, not tell the parent, and kill itself. (Man, this language we’re using is brutal!)

Picky needs the child to perform harakiri, since we do not know if the config is still ok.

If all goes well, the master kills the other children (since they need the updated config) and lets the one telling him to update the config live. You will get a confirmation message, and the interface will update with the current configuration.

With suckerfish, children will die.

Sorry about that. What you get in return is a comfortable way of updating the server config on the fly. And that is worth the tradeoff ;)

Performance?

I bombarded the search server with 100’000 requests, concurrency 100 using ab:

ab -n 100000 -c 100 127.0.0.1:8080/all/full?query=s

Then, I started a Suckerfish and updated the config.

Result: Not really noticeable. A short hiccup when the master reforks, but not really noticeable.

If the config update fails, since only one worker child dies, the effect is almost not noticeable.

If the update works, one worker child remains, and the others need to be forked. But Unicorn handles this exceptionally gracefully. Thanks, Unicorn! Really proud of ya. Love you. Still, the harakiri stays.

Disclaimer

Updating everything on the fly is nice. But beware: The configuration in app/application.rb will not be updated. After experimenting with Suckerfish, you still need to update the config by hand.

That’s syntax pepper.

Conclusion

So we’ve seen

that we can’t just update a config in a child (in a multiprocessing server)
how a child can communicate with its parent
how Picky does it
how the the picky-live gem looks and works
How you can try it yourself
that it is fast
that it can be dangerous if you don’t know what to do

Hope you learnt something new :)

Searching with Picky: Live Parameters Part 1

2011-01-25T00:00:00+11:00

This is a post in the Picky series on its workings. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

This time I want to do a two-part post on live parameters.

What are Live Parameters?

Imagine this situation:

You are sitting at your desk. A few levels below is an array of Picky servers, contentedly humming at a bagillion requests per second…

Ok, this is actually a fantasy of mine, but bear with me.

Suddenly, your boss enters, his hair pointier than ever!

He tells you that a customer’s space bar is not working anymore and now he’d like to use the comma , character to designate where words are separated.

Of course you roll your eyes, but he doesn’t give up. The customer needs to be served, no matter what!

At this point, what would be really good to have is a way of changing Picky’s behaviour with splitting words in queries.

(Btw, the splits_text_on option, a regexp, defines how picky splits text into tokens, or words.)

And you do, but: What you have to do now is change the config, deploy, restart the whole cluster (or send Unicorn the HUP signal to have it restart), losing a fantastic amount of CPU cycles that would have been better used for searching with Picky.

This would be called changing lame parameters. Live parameters are the cool counterpart of lame parameters, the ones with hair, a sunny disposition, having that certain je-ne-sais-quoi that only surfers have.

Live parameters are parameters that can be changed hot – in the running server.

Now wouldn’t that be nice? Turns out it isn’t as easy as I thought.

How do I achieve this?

The problem is that the Unicorn master – or with any multiprocessing-based server – holds the original copy of the configuration. You can easily update it in a child, but if the child dies, it will be replaced with a new one which has forgotten everything.

So let’s call this thing that updates the configuration a Suckerfish. Suckerfishes – or Remoras – attach to a host (Fig. A), mostly sharks, by sucking onto them. This suckerfish (in the form of a request) would attach itself to a child, and from there open a channel, a pipe, to the master, where it could update the master config.

So after attaching itself, this fish would then whisper Picky sweet and golden nothings in its ear, causing it to update its master config.

That’s fine, but where can I try it?

Suckerfish is ready, but not release-ready yet. So you could clone picky, and call ./install in the top level directory to install all 1.3.0 gems locally.

But bear with me, for in part 2 (after the release of 1.3.0 and the picky-live gem, the “Suckerfish” gem) I’ll show how this can be done and how you can use Suckerfish as a weapon against pointy-haired bosses, or just for easy experimentation with your search parameters.

Don’t worry, it will get technical soon ;)

Searching with Picky: Data Sources

2011-01-20T00:00:00+11:00

This is a post in the Picky series on its configuration. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

What is a Data Source in Picky?

A data source is where the indexes get their data. Every index needs a data source.

The way to do this is pass the index(identifier, source) method’s source param a source instance, like so (in app/application.rb):

books_index = index :books, Sources::DB.new('SELECT id, title, author FROM books', file: 'app/db.yml')

Here we passed a database source that uses a simple select. Which database the source uses is defined in the file app/db.yml and follows the configuration structure of Active Record. You could, instead of passing in a file option, just pass in the Active Record config hash.

There are various data sources already defined beside the DB source (see below), but if the one you need is missing, writing your own is easy.

After that comes the most important part in Picky! :) No, really. Because what we are now going to do is categorize the data we got from the source.

Categorizing the data is so important, because it allows Picky to make guesses as to which category a query word is in and get better feedback from the user. Say, if you categorized both first name and last name in the category name, Picky would not be able to help your users find what you are looking for, since it can’t ask back specifically what you mean, like “Did you mean Florian as first name or last name?”.

It’s best if you just get started, and see for yourself. Picky is best experienced, and not told.

Back to the example: Now that we have defined a data source, it’s easy to define a category on it. If you define a title category

books_index.define_category :title

it will use whatever data came back from the database.

If your database doesn’t have nice column names, don’t worry, you have two options: Do a SELECT id, t_01 as title ... or use the from option when you define the category:

books_index.define_category :title, :from => :t_01

The from option is quite cool, as it allows you to have multiple categories on the same data! Say you wanted a similarity search in one category and none on the other:

books_index.define_category :title, :from => :t_01
books_index.define_category :similar_title, :from => :t_01, similarity: Similarity::Phonetic.new(3)

Lots of possibilities, I’m sure you’ll find more useful ones!

There’s more. You can have crazy indexes where every category has its own data source:

books_index.define_category :title, source: Sources::CSV.new(:title, :author, file: 'data/library.csv', col_sep: ',')

Now the title category takes its data from a library.csv. If you do this, be careful that all data sources use the same ids or Picky’s core mechanism won’t work.

Currently available data sources

Picky offers a few data sources, DB for databases, CSV for comma-separated files, Couch for couch DB, and Delicious, for delicious bookmarks. Mmh.

This is how you use them. We’ve already seen the database source:

Sources::DB.new('SELECT id, title, author FROM books', file: 'app/db.yml')

Don’t hesitate to use JOINs or other SQL expressions for some extreme databasing!

Sources::CSV.new(:title, :author, :isbn, :year, :publisher, :subjects, file: 'data/books.csv')

This source assumes that your first column is the id column. It takes its data from the file given in the file option.

Sources::Couch.new(:title, :author, :isbn, url: 'http://localhost:5984/picky', keys: Sources::Couch::UUIDKeys.new)

The CouchDB source takes a url where couch DB serves its data. By default it assumes that you are using Hex Keys. But you can pass in one of Sources::Couch::HexKeys.new, Sources::Couch::UUIDKeys.new, or Sources::Couch::IntegerKeys.new in the keys option to tell Picky what keys you have. I’m afraid that currently you have to recalculate your keys in the client to get back the original keys. I am working on non-integer keys, but it takes its time. Sorry about that.

Sources::Delicious.new(:username, :password)

Delicious is the easiest source, since it comes with fixed data categories title, tags, url that you can categorize.

How do I define my own Data Source?

Defining your own source is easy. The Couch DB source for example has actually been sent in by Stanley.

This piece of code is the superclass of all sources in Picky and is there simply for illustrative purposes, so you can see what methods should be implemented: http://github.com/floere/picky/blob/master/server/lib/picky/sources/base.rb.

I recommend to make your source also its subclass, since it implements empty methods that are called by the indexer. But it actually just needs one worker method. This one: harvest(index, category) It gets the index and the current category and should yield(id, text_data_for_id). It is called by the indexer when it needs the data.

The two other methods that are called by the indexer are connect_backend, which is called once per index/category, and take_snapshot, which is called once for each index, before harvest-ing the data. Use it to create temporary tables etc.

So if your duck subclasses Sources::Base, quacks #harvest and yields id, text_data_for_id your data source is set to go!

Simple and easy to understand, isn’t it?

Conclusion

So we’ve seen

what a data source in Picky is.
what data sources are currently available.
how you write your own.

Hope you learnt something new :)

Contributing one to Picky

If you write your own data source, please let me know!

Searching with Picky: Partial Search

2011-01-17T00:00:00+11:00

This is a post in the Picky series on its configuration. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

What is a Partial Search?

Partial searching is when the user only enters part of a query word, but the search engine still manages to find the whole word.

Example: We want to find all chunky bacon. If the search engine supports a partial search, we should be able to search for just chunky ba and chunky bacon will still be found.

Note that chunky bards will also be found, and so will chunky babes. So beware.

Usually, the character used for partial searches is the asterisk, *. So you would search for chunky ba* to have the search engine look for ba followed by anything.

In Picky

At the time of writing, Picky offers a postfix partial search, meaning that only words ending in anything can be searched. (Or a Partial::None partial search that just ignores the *.)

The thing you use is Partial::Substring, like this:

some_index = index :main, Sources::DB.new('SELECT id, title FROM books', file: 'app/db.yml')
some_index.define_category :title, partial: Partial::Substring.new(from: 1)

So you define a data category on the index and give it the partial option. With this option you tell Picky to use the following class for generating the index in a special way to support partial indexing and querying.

What we want in the example above is have Picky use a Partial::Substring, and have a query word match from the first position (position 1).

Example: A word like picky would match on p, pi, pic, pick and picky. If you defined from: 3, then it would only match pic, pick, picky. Setting from to 1 is indexing intensive, but will find everything.

It is super-easy to write your own partial search. See below for that. The sky is the limit, basically.

On a side-note: Picky will always search the last word of a query with a *, except if you use double quotes, like so: "chunky bac". This will really only find chunky bac, not chunky bacon.

How does Picky do this?

Picky aims to be very extensible, so what it does is very simple.

Picky uses a partial generator, like Partial::Substring which takes an exact index (more below) and returns a partial index.

An exact index in Picky is just a hash that maps words to an array of ids.

So Partial::Substring.new(from: 3) takes something like that:

{
  :picky => [1, 16, 3, 999],
  :pickle => [800, 3, 55]
}

(the index for exact matches) and transforms it into something like that:

{
  :pickle => [800, 3, 55],
  :pickl  => [800, 3, 55],
  :picky => [1, 16, 3, 999],
  :pick  => [1, 16, 3, 999, 800, 3, 55],
  :pic  => [1, 16, 3, 999, 800, 3, 55]
}

So in pic, there are both the ids from picky and the ids from pickle. If someone looks for pic, we return a mix of both ids.

How do I define my own Partial Search?

It is extremely simple. A partial search just needs to implement a generate_from(exact_index) method that returns the new partial index.

You could for example implement a partial index that has random substring matches of up to 3 characters (silly, I know :)):

class Partial::Random
  def generate_from exact_index
    exact_index.inject({}) do |partial_index, word_and_ids|
      word, ids = *word_and_ids
      start  = rand word.size
      ending = rand(3) + 1
      random_substring = word[start, ending]
      partial_index[random_substring] ||= []
      partial_index[random_substring] += ids
      partial_index
    end
  end
end

This method returns a new index that might look like this:

Partial::Random.new.generate_from(:picky => [1,2,3]) # => { :ick => [1,2,3] }

Of course, the example is not very performant – but legible for you.

Finally, you’d use it for your data categories in app/application.rb like this:

some_index = index :main, Sources::DB.new('SELECT id, title FROM books', file: 'app/db.yml')
some_index.define_category :title, partial: Partial::Random.new

A better idea might be to create a substring partial that generates a partial index where the asterisk is actually at the front of the word:

{
  :picky => [1,2,3],
  :icky  => [1,2,3],
  :cky   => [1,2,3],
  :ky    => [1,2,3],
  :y     => [1,2,3]
}

This will match picky if you enter just a y!

Picky is very flexible – do what you want however you want it.

Conclusion

So we’ve seen

what a partial search is.
how Picky does a partial search.
how a partial search is configured in Picky.
how you can write your own.

Hope you learnt something new :)

Contributing one to Picky

If you write your own, please let me know!

Searching with Picky: Character Substitution

2011-01-13T00:00:00+11:00

This is a post in the Picky series on its configuration. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

What is Character Substitution?

Character substitution in a search engine is one of the first steps in the process of sanitizing your users’ input.

Examples: ä => ae, ø => o, é => e

This is used to make the search engine indifferent to a user’s origin or way of writing.

For example, my hometown is called Zürich, with an umlaut character, ü. German users will search with an ü. However, most users of the world don’t know this character, and will simply type Zurich. So what we want is make the search engine ignore the umlaut diacritics, the two dots over the u.

How do we do this?

Usually, what search engines do is perform a sort of character substitution before putting text into the index, so Zürich will go into the index as zurich. For that, we character substituted ü => u. I also lowercased it, since that is what search engines also do, to significantly save index space.

So now we have Zurich in the index. If a user now searched for Zürich, the search engine wouldn’t find it.

So what we do is also perform this character substitution in a query, so that if the user enters an ü, it is replaced by an u, making Zurich out of Zürich.

In a nutshell, the indexing and the querying map both Zürich and Zurich to Zurich and a user will find it, regardless if they searched for my hometown with or without umlaut.

How do we do this in Picky?

Picky offers two class methods in a Picky Application where you can define how characters are substituted, amongst other things:

default_indexing options = {}
default_querying options = {}

The default_ in the method name comes from the fact that whatever options are given, will be used for all indexing and querying unless overridden. So most of the time you will be configuring it there.

One of the options is substitutes_characters_with and you give it a character substituter object that has a #substitute(text) method.

Picky already includes one for west european character sets. You use it as follows:

default_indexing substitutes_characters_with: CharacterSubstituters::WestEuropean.new

I use the Ruby 1.9 hash style, key: value, for that. The rocket I use for mapping things, map '/some/path' => controller.

What the west european character substituter does is this: ä => ae, Ä => Ae, ë => e, Ë => E, ï => i, Ï => I, ö => oe, Ö => Oe, ü => ue, Ü => Ue, and 22 others. See the spec if you’d like to know more.

So a query like Hände Nüsse will be sanitized to haende nuesse before being further processed. Again also lowercasing it, since this is usually also done.

How do I define my own character substituter?

It is extremely simple. A character substituter just needs to implement a substitute(text) method that returns the substituted text.

See the source of the west european substituter if you want to see how I did it.

Why is it so illegibly written?

It is heavily optimized. Since this method will be called for all indexed data, and for each query, it should be performant.

The west european spec includes two performance specs for that:

describe "speed" do
  it "is fast" do
    result = performance_of { @substituter.substitute('ä') }
    result.should < 0.00009
  end
  it "is fast" do
    result = performance_of { @substituter.substitute('abcdefghijklmnopqrstuvwxyz1234567890') }
    result.should < 0.00015
  end
end

The method performance_of is used in Picky quite often to maintain performance and notify me should anything get slower. It looks like this:

def performance_of &block
  GC.disable
  result = Benchmark.realtime &block
  GC.enable
  result
end

Conclusion

So we’ve seen

that most search engines need a character substituter.
that character substituter help your international users find things.
how they are configured in Picky.
how you can write your own.

Hope you learnt something new :)

Contributing one to Picky

If you write your own, please let me know!

Speccing methods called in initialize

2010-10-27T00:00:00+11:00

Recently when writing Picky, the clever small text search engine, I encountered the following problem: How do I test methods that are called in an initializer?

(Of course I could call Testee.new in the spec and then just call the method again. But what if that method sets a state?)

In code:

Why open sourcing security critical software is important

2010-10-06T00:00:00+11:00

Why open sourcing security critical software is important

Profiling MySQL Queries

2010-09-27T00:00:00+10:00

Profiling MySQL Queries

In-detail performance measurements for MySQL queries.

Drawing in the browser

2010-05-22T00:00:00+10:00

Drawing in the Browser … using HTML5.

Most important Ruby method 2010

2010-05-21T00:00:00+10:00

It’s squeeze!

Programming Applied Mathematics

2010-05-13T00:00:00+10:00

Programming is one of the most difficult branches of applied mathematics / the poorer mathematicians had better remain pure mathematicians.

– Edsger W. Dijkstra

(via fuckyeahcomputerscience)

New programming jargon

2010-05-11T00:00:00+10:00

New Programming Jargon

Need to remember Bugfoot and Shrug Report… And especially Duck!

Fat slows you down

2010-05-09T00:00:00+10:00

Fat slows you down.

If you really need speed in Ruby 1.9, consider this example:

You already knew that, right? (Assigning with splats)

2010-04-30T00:00:00+10:00

Referring to the fact that I want to sleep with the splat operator…

97% is as good as a 100%

2010-04-30T00:00:00+10:00

If you're in a hurry and you need to pack up your bags and go, 97% is as good as a 100%. The 100% mark does not have the same (show-stopping) magic as 0%, where the difference between 3% and 0% really is important.

– Omit Needless Code

elements.each(*p)

2010-04-30T00:00:00+10:00

I often use

ary.map(&:upcase)

instead of

ary.map { |a| a.upcase }

But what can I do to use the elements as param as in the following code?

ary.each { |a| p a }

Strategy pattern pattern pattern pattern

2010-04-29T00:00:00+10:00

A pattern that I often see cropping up in my game framework.

It can be used for configuring subclasses that act according to an order of calls defined in the superclass. How the calls exactly work can be defined in the subclasses (or in an external configuration) using the class methods.

Mastery is a mindset

2010-04-26T00:00:00+10:00

Mastery is a mindset.

From the book “Drive”, by Pink.

I’d reformulate it as: “Mastery is neither a question of time, or experience, but a mindset.”

Ruby 1.9 params

2010-04-20T00:00:00+10:00

Riddle

2010-04-18T00:00:00+10:00

3735928559

Why is this number unappealing to vegetarians?

A hole in the wall

2010-04-18T00:00:00+10:00

His stool leaned back at a dangerous angle, he displays a pair of jamaica-colored sneakers to the public. Them sticking out of his business hole seems rather odd, considering the sober surroundings of the Niederdorf, or “nether village”, as this particular place in Zürich is called.

Slurping a botanic tea, idly facebooking and tumbling through the depths, no, shallows of the net, waiting for customers. It’s been that way now for more than a day, and he starts to wonder if the customer specific context ads are just a fluke.

An abrupt “Oh hey” directed his way throws him out of the structural code improvements that have been waiting for him at the back of his mind. “Hey”, a burly businessman with slightly high blood pressure – he surmises from the corona of hair still clinging on – asks: “Are you the man that types?”, “Yes, yes I do, I code.” “Oh, code. Yeah, sorry, my bad. Well, look, I need a small program that does a few calculations based on this.”

And he whips out a napkin with a few calculations on it, in black lines what looks to be from an eyeliner, or a piece of coal. “Don’t mind the looks – how long do you think this takes?”

“Hmm, well. I think the design might take me a few hours. Then we’d need to meet again to see if we’re on the right track. Then I’ll have to code it, and clean it up a little. Might take me another 2 hours.”

“For 90 an hour, right?” “That’s right, as advertised.” “Ok, well. See you in three.”

He rights his stool, leans forward, sketches boxes and lines, boxes and lines, lines and boxes. Then he goes for a quick walk, takes in the morning, letting the cogs turn. Half an hour of showing tourists the view, and a hot chocolate at the riverside. Finally, he plumps down in front of his sleek, metal-clad machine and types.

What he did was transform the mascara lines into byroliner lines and boxes as a straw where the mind can cling on to, and from there to typed text on a luminescent screen, for him to read and others to understand, finally into the core of the machine, and the zeros and ones people who have no understanding regurgitate so often.

Entering the formula was pretty straightforward. But there are other things to consider: What is the best user interface for a burly businessman? Will it be used repeatedly? As if on cue, burly biz arrives and asks “Done yet?” “Oh hi.”

Back and forth: The customer starts with a lot of questions, have you put this in? He cuts him short, and explains what he will see, his understanding of the formula. There is much going on, but boils down to this: The clearing of misunderstandings. And they get cleared. It must be his happy day, the businessman knows the power of an ad-hoc team, and how it should work, how progress can come from it.

The discussion dies down, lots of nodding all around, and smiles emerge. A handshake, and both are off – shorty no doubt to a meeting, where money and hand sweat is moved, our coder off to the plane of lines and boxes. A prototype stands, but this is not where it ends. He wants it to be perfect. After all, he is a craftsman, and craft is what defines him. The table might look nice to an outsider, but the craft is inside: The distribution of weight, the structure of the wood: What holds the thing together and doesn’t make it bend, for year after year.

Before he cleans up however, there is yoga waiting for him, and another stroll an the riverside. Can it be improved? How? The response comes to him during the most innocent of activities, stroking a cat that has found, purring, a new home around his legs. He leaves the cat slightly shocked behind – but she improves the situation by licking her paw – and runs up the street, repeating and repeating the idea, urging it not to leave his head.

Panting, he types it in. The tests run, the code checker tool give him a green light. He opens it, it works. Puts it on a stick, wraps it in a package, puts it into a nice box which brandishes his logo – doodled on the back of a napkin by his sister, three years ago – and puts it aside for the customer, due to arrive in an hour.

And finally. Finally the sneakers rest again on the sill of the hole.

Oh yeah, Amazon?

2010-04-14T00:00:00+10:00

From the latest Newsletter:

Support for Session Stickiness in Elastic Load Balancing Amazon Elastic MapReduce Introduces Custom Cluster Configuration Option

They also have Gurble Blurble Fickleness, introducing Jambawambing Lordle Figuconation Schnorptions.

At least that’s what I hear when I read stuff like that.

IE didn't get the CSS3 memo?

2010-04-13T00:00:00+10:00

IE didn’t get the CSS3 memo?

Or, as is my guess: The code they based the new browsers on was fully untested, totally disorganized, and thus brutally hard to extend. IE9 though, gives one hope.

Challenged

2010-04-12T00:00:00+10:00

The framework looms in front of you. Clouds cover the gray sky. You plunge in. Full unit test rewrite, nothing is where it was before, but right: The mailbox is in front of the house, the bathtub is finally in the bath, the fridge contains organic food. There is a pot on the fire, full of juicy stuff.

But you are wearing glasses that let you only see 10 centimeters. You set wild eyes on the integration tests: Guests are entering the house, trying to eat from the toilet, sleeping in the oven, or jumping out of windows. It is fail, fail, fail, wherever you happen to look.

You are close to despair. Everything is right. Right? You trudge on, teeth gnashing.

Then, somehow, you adjust the doormat ever so slightly, piece in the last crumb of information. And magically, it just works. Everything. Just. Works. The gargantuan task is finished. For minutes, you revel in the sun’s rays. The clouds, they never reappear.

It is done.

Stuttering Proc

2010-04-07T00:00:00+10:00

Reloading a running Ruby application

2010-04-05T00:00:00+10:00

Here’s how I do it:

Javuby?

2010-04-05T00:00:00+10:00