Searching with Picky: Data Sources

series / ruby / picky

This is a post in the Picky series on its configuration. If you haven’t tried it yet, do so in the Getting Started section. It’s quick and painless :)

What is a Data Source in Picky?

A data source is where the indexes get their data. Every index needs a data source.

The way to do this is pass the index(identifier, source) method’s source param a source instance, like so (in app/application.rb):

books_index = index :books, Sources::DB.new('SELECT id, title, author FROM books', file: 'app/db.yml')
Here we passed a database source that uses a simple select. Which database the source uses is defined in the file app/db.yml and follows the configuration structure of Active Record. You could, instead of passing in a file option, just pass in the Active Record config hash.

There are various data sources already defined beside the DB source (see below), but if the one you need is missing, writing your own is easy.

After that comes the most important part in Picky! :) No, really. Because what we are now going to do is categorize the data we got from the source.

Categorizing the data is so important, because it allows Picky to make guesses as to which category a query word is in and get better feedback from the user. Say, if you categorized both first name and last name in the category name, Picky would not be able to help your users find what you are looking for, since it can’t ask back specifically what you mean, like “Did you mean Florian as first name or last name?”.

It’s best if you just get started, and see for yourself. Picky is best experienced, and not told.

Back to the example: Now that we have defined a data source, it’s easy to define a category on it. If you define a title category

books_index.define_category :title
it will use whatever data came back from the database.

If your database doesn’t have nice column names, don’t worry, you have two options: Do a SELECT id, t_01 as title ... or use the from option when you define the category:

books_index.define_category :title, :from => :t_01

The from option is quite cool, as it allows you to have multiple categories on the same data! Say you wanted a similarity search in one category and none on the other:

books_index.define_category :title, :from => :t_01
books_index.define_category :similar_title, :from => :t_01, similarity: Similarity::Phonetic.new(3)
Lots of possibilities, I’m sure you’ll find more useful ones!

There’s more. You can have crazy indexes where every category has its own data source:

books_index.define_category :title, source: Sources::CSV.new(:title, :author, file: 'data/library.csv', col_sep: ',')
Now the title category takes its data from a library.csv. If you do this, be careful that all data sources use the same ids or Picky’s core mechanism won’t work.

Currently available data sources

Picky offers a few data sources, DB for databases, CSV for comma-separated files, Couch for couch DB, and Delicious, for delicious bookmarks. Mmh.

This is how you use them. We’ve already seen the database source:

Sources::DB.new('SELECT id, title, author FROM books', file: 'app/db.yml')
Don’t hesitate to use JOINs or other SQL expressions for some extreme databasing!
Sources::CSV.new(:title, :author, :isbn, :year, :publisher, :subjects, file: 'data/books.csv')
This source assumes that your first column is the id column. It takes its data from the file given in the file option.
Sources::Couch.new(:title, :author, :isbn, url: 'http://localhost:5984/picky', keys: Sources::Couch::UUIDKeys.new)
The CouchDB source takes a url where couch DB serves its data. By default it assumes that you are using Hex Keys. But you can pass in one of Sources::Couch::HexKeys.new, Sources::Couch::UUIDKeys.new, or Sources::Couch::IntegerKeys.new in the keys option to tell Picky what keys you have. I’m afraid that currently you have to recalculate your keys in the client to get back the original keys. I am working on non-integer keys, but it takes its time. Sorry about that.
Sources::Delicious.new(:username, :password)
Delicious is the easiest source, since it comes with fixed data categories title, tags, url that you can categorize.

How do I define my own Data Source?

Defining your own source is easy. The Couch DB source for example has actually been sent in by Stanley.

This piece of code is the superclass of all sources in Picky and is there simply for illustrative purposes, so you can see what methods should be implemented: http://github.com/floere/picky/blob/master/server/lib/picky/sources/base.rb.

I recommend to make your source also its subclass, since it implements empty methods that are called by the indexer. But it actually just needs one worker method. This one: harvest(index, category) It gets the index and the current category and should yield(id, text_data_for_id). It is called by the indexer when it needs the data.

The two other methods that are called by the indexer are connect_backend, which is called once per index/category, and take_snapshot, which is called once for each index, before harvest-ing the data. Use it to create temporary tables etc.

So if your duck subclasses Sources::Base, quacks #harvest and yields id, text_data_for_id your data source is set to go!

Simple and easy to understand, isn’t it?

Conclusion

So we’ve seen

  1. what a data source in Picky is.
  2. what data sources are currently available.
  3. how you write your own.

Hope you learnt something new :)

Contributing one to Picky

If you write your own data source, please let me know!

Next Searching with Picky: Live Parameters Part 1

Share


Previous

Comments?