Hi,
Is anyone working in a rails plugin? I am starting to work on that as
a part of the migration to elasticsearch of the search backend at my
company (vLex.com), and would be glad to share the effort.
The goal of the project is to provide easy Elasticsearch support to
existing rails applications, exposing as much of Elasticsearch as
possible and with a basic design that enables parallelization of
indexing tasks.
Let me share some of the design ideas.
Indexing:
Creating an index will support two modes. In a "local" mode, the
indexing rake task will iterate all the records, create the full JSON
document for each one of them, and submit it to Elasticsearch.
In the "distributed" mode the indexing jobs will be managed by a a job
queue backend. Initially support will be provided for Resque (http://
github.com/defunkt/resque,
a redis-backed jobs queue) but the plan is to make it pluggable from
day one and add support to Amazon'S SQS soon.
In this mode, when a new index is created, what will happen is that
record ids will be added to the queue, and Resque agents will perform
the job of receiving the document ids, creating the full JSON
document, and submitting it to ElasticSearch for indexing. The initial
task of retrieving object ids should also be distributed. Since you
can launch as many agents in as many machines as you need, this let's
you scale the rails part of indexing.
In our initial tests we have achieved 20x indexing speeds with the
distributed mode compared with the local mode, only limited by the
source databases IO (Elasticsearch did great)
Indexes will be versioned, so if you create a new index named "posts"
what will actually happen is that an index called "posts_$TIMESTAMP"
will be created, and when the indexing job has finished an alias will
be defined to "posts". This enables you to create new index versions
(maybe with new index or mapping options) without disrupting the
users.
Simplicity
When working in local mode, adding basic Elasticsearch support to a
model (without any custom index or mapping options) should be a matter
of adding one line to your model.
When working in distributed mode, the only additional task will be to
set-up Resque (or your chosen queue backend) and launch the agents.
Configuration:
The philosophy is to expose as much of Elasticsearch as possible. I do
not want to create a DSL syntax to define index options, mapping
options, etc. Instead of that, index and mapping options will be
stored in JSON or YAML in Elasticsearch's native format.
If you want to add associations or custom methods to the indexed
content, you will do so creating a #indexed_json_document the returns
whatever document you want to be indexed. Since ActiveRecord's json
serialization has excellent support for associations this is actually
quite pleasant.
class Post << ActiveRecord::Base
def indexed_json_document
to_json(:include => [:categories, :author], :method => :slug)
end
end
Near Real time support
There will be be a "changes" queue distinct from the "batch" indexing
tasks queue, and with higher priority.
When a record is saved or updated, ActiveRecord callbacks will
automatically submit the record id to this "changes" queue. The agent
will receive the task and will update both the "current" index for
this record, and any index that is in the process of being created,
and that already has the record indexed: this ensures that when you
deploy the new index it will be up to date.
Index Partitioning:
I'd really would like to support index partitioning. By this I mean
the ability to break a model in several indexes partitioned by some
field, and automatically restrict the search to the appropriate
indexes.
So for example if you have an Users index and it's partitioned over
"country_id", you would actually have "users_US", "users_FR",
"users_HK", etc indexes. A search by Surname would search in all
indexes, but a search by Surname and Country_ID would search only in
one index.
But to be honest I do not have a workable design for this feature.
Name
I do not have a name for the project
If any wise woman or man sees a gross mistake in the design, please
comment! And as I said if anyone is working in something like that I'd
be happy to join myself to the effort.
Cheers and congratulations to all the Elasticsearch team!: it is an
impressing achievement.