Don't be deceived by the cute appearance! Picky is a tough little bastard.
Codewise, Picky is tested by around 900 unit and integration specs. Code size is extremely small (~4000 loc), making it easy to maintain.
Picky uses Ruby and in the case described used Unicorn. We bombarded three servers, running a Unicorn with 8 children each, using ab.
After filling the Unicorn request buffers, Picky predictably went down, but recuperating like fresh spring dew after the torrent was over.
In the mentioned project, Picky has been running without fail now for more than a year. It has been shown to be totally maintenance free.
Picky scales very well with your data.
We recommend setting up a shared /index directory where one Picky indexes, and the others reload the indexes after restarting.
Then, add the server to your proxy of choice, nginx, varnish etc.
We've added two servers to the first one like this, without problems. We just needed to add the new server to the varnish front end round robin list.
Picky is a combinatorial search engine. Meaning: All the data needs to be available for combining.
Admittedly, regarding data, if you have exponential data growth, we do not recommend to use Picky.
If your data is slowly growing (linearly), we recommend to use Picky.
Most people think – since it is written in Ruby – it must be slow.
That attitude usually stops after seeing it in production or trying the example (in getting started).
Lots of consideration went into writing performant Ruby code. For example, each request has a very small memory footprint.
Around 100 specs test for possible performance reductions. Critical parts have been rewritten in C, giving it the edge it needs.
From the ground up, Picky is designed to be very flexible.
Search requests can be routed to any combination of indexes. If you need customized tokenizers etc. you can easily implement them.
Also, it supports quite a few data sources, and new data sources can be easily added.
Picky was used in a telephone search engine, where it was possible to search for address, phone number, names, organization, and many other search features.
Around 150 Million data points. A data point is a phone number, or a name, or an address.
Imagine a table with 10 million records, with 15 varchar fields each.
Almost all data categories (exceptions: zipcode, partially: phone numbers) used a full partial search. Five categories were indexed also using a moderately configured (phonetic) similarity index.
Memory usage was a bit higher than expected, since phone numbers are unique, requiring space. Also, we used a lot of partial indexes, increasing the memory need greatly. The indexes in each server needed 10 GB.
Using more complicated than average queries, Picky on 3 virtual servers, each with 8 processes on 2 cores each (totalling 6 cores), peaked at 120 requests per second.
With relatively simple queries, Picky peaked at about 500-1000 requests per second (around 1.3 billion requests per month).
We noticed a high variability of answer times, probably due to the combinatorial nature of Picky. From 0.00001 seconds up to 2 seconds (in extremely hard and rare cases). Due to this fact, using Unicorn paid off very much.
We had a rather complicated join to combine the data for use in Picky. Also, a lot of the data needed to be cleaned and prepared.
Using a single server with 2 processors, we needed a bit more than an hour to prepare the indexes for Picky.
We were not too happy about the speed, but running the indexing process every night did at least not disturb normal operations.
Also, we got it almost twice as fast by Picky using all the available processors.
Picky is relatively memory hungry. This is in part due to its non-specific index (useable with any data).
Indexing performance could be better when using lots of partial indexes. It cannot (yet) be used if indexes need to be up to date instantly.
On the plus side, it is extremely fast and stable.
It offers a unique user interface that people were quick to learn and use.
Also, we noted that requests by our management were very easy to implement, even if the requests seemed to be very hard to build in at first sight. This is due to Picky's very modular nature.
(In the words of Roger Braun from the original and the Japanese German Dictionary this use case is about)
Picky is used to find dictionary entries in WaDokuJT, the largest Japanese-German dictionary.
There are around 250K entries in the WaDokuJT file, with around 5 main fields.
The file is only 60 MB in size, but the field for the German version of the entries has a lot of internal structure that would in other cases often be modeled with database relations. These relations have a lot of semantic information and have to be remodeled in the indexing step.
All categories are indexed with full partial search. Also, several virtual fields exist that are created at indexing time, like the romaji field, which is generated directly from the Japanese characters.
In the future it is planned to add more virtual fields like headwords, place names etc. Picky makes this very easy, as you can just write these virtual fields with standard Ruby code.
Queries are often just one word searches and not very complex. Picky can usually serve the request in under a millisecond. A "like %"-based SQLite search on the same data took around 2~3 seconds.
Indexing is very fast, too. The server is an Xserve3,1 with a Quad-Core Xeon with 2.26 Ghz.
edv@rokuhara:~/picky_speed_test$ time bundle exec rake index
Loaded picky with environment 'development' in /Users/[…]
peed_test on Ruby 1.9.2.
Application loaded.
[…]
real 8m39.234s
user 11m46.977s
sys 1m11.744s
Using Picky is one of the things that makes wadoku.eu good and easy to use. Having just one search field instead of the usual "advanced search" is great and we expect it to be a great advantage. This still has to be tested by our users, though.
Having your search completely seperated from your database design is a huge relief and makes it easy to change and optimize both search and database functions separately.
Picky can also serve as a lightweight search API for third party services that want to use our data without any additional work on our part.