CocoaPods Search Design

ruby / picky / search design

You probably have heard of CocoaPods, an Objective-C library dependency manager. The project was initiated by Eloy Durán.

Let me tell you it’s good stuff!

Intro

This post is about designing a search engine for CocoaPods. I’m using Picky for it, with moderate modifications.

Chances are you know RubyGems. CocoaPods use a slightly different approach, one I personally find very elegant: After creating a podspec (similar to a gemspec), you ask for it to be included in the central repository via a pull request. If it is accepted, from then on you get commit rights to push other pods.

Since I think the rubygems search is too slow, and not very impressive, I tried to make the CocoaPods search an example of how such a search should be designed. Try it! :)

(Note: I’m not just criticizing, but also putting code where my mouth is regarding the rubygems search – try my alternative take on it and read about it here)

Many ideas for the CocoaPods search come from the old gem search alternative, but a few features are new, compiled in the…

Highlights

Automagic index updates via Github post receive hooks

The challenge was to have Picky automatically update the search index without restarting, and without polling.

The fact that the CocoaPods specs live in their own repository is fantastic – it means that we have the full power of Github’s repo features at our disposal.

The feature we use is post receive hooks. Every time someone pushes a new spec, or updates a spec, the search engine sinatra app is notified via a garbled URL, as follows:

post "/my_example_hook_url/#{ENV['GARBLED_HOOK_PATH']}" do
  # index updating code here
end

Every time this URL is called, Picky downloads the zip file from github, unzips it, and indexes the loaded specs. All while running. That’s it.

HOLD ON!, you say, why don’t you just do a git pull? I wish I could. But currently, Heroku doesn’t allow git pull, or tar, or gunzip. So currently, the search engine always downloads the zip file.

Making composite names searchable

Pod names do not use spaces but are camelcased, e.g. “BlocksKit”. Like most search engines, Picky would index this as one word.

Another issue with pod names is that authors sometimes prepend their initials to it. So, for example, “Mocky” would actually be called “LRMocky”.

However, getting back to the “BlocksKit” example, we want people to be able to find it when they type blocks kit, or just kit.

In Picky lingo: If the data contains "BlocksKit", how do we index it as "BlocksKit Blocks Kit"?

Turns out there is a snazzy Ruby regexp for that:

"BlocksKit".split /([A-Z]?[a-z]+)/ # => ["", "Blocks", "", "Kit"]

Nice, eh? As a bonus works fine with numbers :)

The Pod model offers a prepared_name method, using the above split, returning "BlocksKit Blocks Kit", which Picky uses for the name category and consequently indexes all three words.

category :name,
         similarity: Similarity::DoubleMetaphone.new(2),
         partial: Partial::Substring.new(from: 1),
         qualifiers: [:name, :pod],
         :from => :prepared_name # <= :from indicates which (data) method to call in the source object

Try it with dynamic delegate! :)

Filtering by OS

This is a more advanced Picky trick, which might only be interesting to pros.

Like Ruby gems, pods can run on multiple OSs: On iOS and/or on OS X.

We always want to filter by either both (AND), iOS, or OS X. This means we always prepend the platform filter to the query like so: "on:some_platform rest of the query".

This is problematic since it uses a lot of input field space, and also confuses the user.

We would like to not show the OS in the search field, but use the value from the iOS style radio buttons.

Picky helps us by offering multiple JS callbacks. If you copy a search link like http://cocoapods.org/?q=on:osx%20Kiwi into the URL bar, Picky runs a few JS callbacks, in the following order:

  1. beforeInsert(query) // Before inserting the query into the search field.
  2. before(query, params) // Before sending the query back to the server.
  3. after(data, query) // After receiving the query back, before rendering.
  4. success(data, query) // After the view/results have been updated.

(data is the JS PickyData object)

We need both beforeInsert and before.

In beforeInsert, we remove the os part, before it is inserted into the search field. In before, before sending it to the backend, we add the OS back into the query, taken from the radio button value.

In code (the Picky JS search client options), it looks like this:

// Before a query is inserted into the search field
// we clean it of any platform terms.
//
beforeInsert: function(query) {
  return query.replace(platformRemoverRegexp, '');
}

The regexp to remove the platform search term looks like this:

var platformRemoverRegexp = /(platform|on\:\w+\s?)+/;

And before sending the search request to the backend, Picky calls the before callback where we remove any OS parts, prepending the selected one (the iOS style radio buttons have the values on:ios on:osx, on:ios, and on:osx).

before: function(query, params) {
  query = query.replace(platformRemoverRegexp, ''); // Clean the query.
  var platformModifier = platformSelect.find("input:checked").val(); // Get the selected OS.
  return platformModifier + ' ' + query; // Prepend it to the query.
}

However, the complete query, including the OS is still inserted into the URL, ready for you to copy and send to friends.

5 lines of nicely customizable code :)

Removing duplicates from results

This is another more advanced Picky trick, which might only be interesting to pros.

I often get requests on how to remove duplicates from search requests.

Why are there duplicates in Picky’s search results anyway?

Picky returns categorized search results. For example, it might deem the combination of categories "first_name", "last_name" more important, before all search results found in the categories "street", "last_name". But this also means that the same entry can be contained in both combinations of categories!

Many Picky users just use results.ids to extract a list of ids. To get the list of ids, Picky goes through the results in each combination of categories and extracts the ids. This means that Picky may well return [1,3,1,2,3], with results 1 and 3 occurring twice.

Since cocoapods.org only wants to show an uncategorized list of result pods, we wish to remove duplicates to not confuse searchers.

We achieve this by using Picky’s JS success callback. This goes through all combinations of categories (aka allocations) and removes entries from the allocations if we’ve already seen them previously. It ensures we only see unique results.

// We filter duplicate ids here.
// (Not in the server as it might be
// used for APIs etc.)
//
success: function(data, query) {
  var seen = {};
  
  var allocations = data.allocations;
  allocations.each(function(i, allocation) {
    var ids     = allocation.ids;
    var entries = allocation.entries;
    var remove = [];
    
    ids.each(function(j, id) {
      if (seen[id]) {
        data.total -= 1;
        remove.push(j);
      } else {
        seen[id] = true;
      }
    });
    
    for(var l = remove.length-1; 0 <= l; l--) {
      entries.splice(remove[l], 1);
    }
    
    allocation.entries = entries;
  });
  
  return data;
}

We could well do this in the server, but I opted against it, because a possible future search API might want to expose the duplicate results. This is why we do it in the client.

Other fun things to try!

Feedback

We’re very glad for feedback – shoot us a line at http://twitter.com/CocoaPodsOrg, or at http://twitter.com/picky_rb. Thanks!

Thanks also to the CocoaPods team for a great project!

Next Normalizing Indexed Data

Share


Previous

Comments?