Working in a rails plugin


(angelf) #1

Hi,

Is anyone working in a rails plugin? I am starting to work on that as
a part of the migration to elasticsearch of the search backend at my
company (vLex.com), and would be glad to share the effort.

The goal of the project is to provide easy Elasticsearch support to
existing rails applications, exposing as much of Elasticsearch as
possible and with a basic design that enables parallelization of
indexing tasks.

Let me share some of the design ideas.

Indexing:

Creating an index will support two modes. In a "local" mode, the
indexing rake task will iterate all the records, create the full JSON
document for each one of them, and submit it to Elasticsearch.

In the "distributed" mode the indexing jobs will be managed by a a job
queue backend. Initially support will be provided for Resque (http://
github.com/defunkt/resque,
a redis-backed jobs queue) but the plan is to make it pluggable from
day one and add support to Amazon'S SQS soon.

In this mode, when a new index is created, what will happen is that
record ids will be added to the queue, and Resque agents will perform
the job of receiving the document ids, creating the full JSON
document, and submitting it to ElasticSearch for indexing. The initial
task of retrieving object ids should also be distributed. Since you
can launch as many agents in as many machines as you need, this let's
you scale the rails part of indexing.

In our initial tests we have achieved 20x indexing speeds with the
distributed mode compared with the local mode, only limited by the
source databases IO (Elasticsearch did great)

Indexes will be versioned, so if you create a new index named "posts"
what will actually happen is that an index called "posts_$TIMESTAMP"
will be created, and when the indexing job has finished an alias will
be defined to "posts". This enables you to create new index versions
(maybe with new index or mapping options) without disrupting the
users.

Simplicity

When working in local mode, adding basic Elasticsearch support to a
model (without any custom index or mapping options) should be a matter
of adding one line to your model.

When working in distributed mode, the only additional task will be to
set-up Resque (or your chosen queue backend) and launch the agents.

Configuration:

The philosophy is to expose as much of Elasticsearch as possible. I do
not want to create a DSL syntax to define index options, mapping
options, etc. Instead of that, index and mapping options will be
stored in JSON or YAML in Elasticsearch's native format.

If you want to add associations or custom methods to the indexed
content, you will do so creating a #indexed_json_document the returns
whatever document you want to be indexed. Since ActiveRecord's json
serialization has excellent support for associations this is actually
quite pleasant.

class Post << ActiveRecord::Base

def indexed_json_document
to_json(:include => [:categories, :author], :method => :slug)
end

end

Near Real time support

There will be be a "changes" queue distinct from the "batch" indexing
tasks queue, and with higher priority.

When a record is saved or updated, ActiveRecord callbacks will
automatically submit the record id to this "changes" queue. The agent
will receive the task and will update both the "current" index for
this record, and any index that is in the process of being created,
and that already has the record indexed: this ensures that when you
deploy the new index it will be up to date.

Index Partitioning:

I'd really would like to support index partitioning. By this I mean
the ability to break a model in several indexes partitioned by some
field, and automatically restrict the search to the appropriate
indexes.

So for example if you have an Users index and it's partitioned over
"country_id", you would actually have "users_US", "users_FR",
"users_HK", etc indexes. A search by Surname would search in all
indexes, but a search by Surname and Country_ID would search only in
one index.

But to be honest I do not have a workable design for this feature.

Name

I do not have a name for the project :slight_smile:

If any wise woman or man sees a gross mistake in the design, please
comment! And as I said if anyone is working in something like that I'd
be happy to join myself to the effort.

Cheers and congratulations to all the Elasticsearch team!: it is an
impressing achievement.


(Shay Banon) #2

Hey,

In one word, wow!. If you do open source this, I for one, and I am sure
many elasticsearch users would be grateful. The design from my perspective
sounds very good. I am not a proper rails guy (or ruby), but one suggestion
I can make is to try and build this in layers. I mean, have an elasticsearch
ruby client, and on top of that add all the other features. I think users
would love for a ruby client that is maintained and has a community around
it, as well as a rails and even such infrastructure as the distributed
indexing processes.

I know of several ruby clients, some are more maintained then others. If
possible, I really hope that maybe joining forces make sense. The one that
seems to be active now is the event machine one. The others are listed
here: http://www.elasticsearch.com/products/. Does your project include a
new ruby client?

There are now two very active clients being developers, pyes for python,
and ElasticSearch.pm for perl. Others are being born as we speak (the .NET
one for example). One thing I like about the other clients is the
abstraction on top of the communication layer with elasticsearch, so
switching between HTTP and Thrift for example is very easy (and I do my best
to make it easy on the server side). In any case, I would love for this
mailing list to be a place where people developing clients can share
their experience with one another.

As for a name, it seems like the ruby community is ranked (at least by
me) first at coming up with cool names... . Some ideas I have (playing on
es): escargot, hornets, esty, crest, less, runes.

-shay.banon

On Mon, Oct 18, 2010 at 5:14 PM, Angel Faus angel.faus@gmail.com wrote:

Hi,

Is anyone working in a rails plugin? I am starting to work on that as
a part of the migration to elasticsearch of the search backend at my
company (vLex.com), and would be glad to share the effort.

The goal of the project is to provide easy Elasticsearch support to
existing rails applications, exposing as much of Elasticsearch as
possible and with a basic design that enables parallelization of
indexing tasks.

Let me share some of the design ideas.

Indexing:

Creating an index will support two modes. In a "local" mode, the
indexing rake task will iterate all the records, create the full JSON
document for each one of them, and submit it to Elasticsearch.

In the "distributed" mode the indexing jobs will be managed by a a job
queue backend. Initially support will be provided for Resque (http://
github.com/defunkt/resque,
a redis-backed jobs queue) but the plan is to make it pluggable from
day one and add support to Amazon'S SQS soon.

In this mode, when a new index is created, what will happen is that
record ids will be added to the queue, and Resque agents will perform
the job of receiving the document ids, creating the full JSON
document, and submitting it to ElasticSearch for indexing. The initial
task of retrieving object ids should also be distributed. Since you
can launch as many agents in as many machines as you need, this let's
you scale the rails part of indexing.

In our initial tests we have achieved 20x indexing speeds with the
distributed mode compared with the local mode, only limited by the
source databases IO (Elasticsearch did great)

Indexes will be versioned, so if you create a new index named "posts"
what will actually happen is that an index called "posts_$TIMESTAMP"
will be created, and when the indexing job has finished an alias will
be defined to "posts". This enables you to create new index versions
(maybe with new index or mapping options) without disrupting the
users.

Simplicity

When working in local mode, adding basic Elasticsearch support to a
model (without any custom index or mapping options) should be a matter
of adding one line to your model.

When working in distributed mode, the only additional task will be to
set-up Resque (or your chosen queue backend) and launch the agents.

Configuration:

The philosophy is to expose as much of Elasticsearch as possible. I do
not want to create a DSL syntax to define index options, mapping
options, etc. Instead of that, index and mapping options will be
stored in JSON or YAML in Elasticsearch's native format.

If you want to add associations or custom methods to the indexed
content, you will do so creating a #indexed_json_document the returns
whatever document you want to be indexed. Since ActiveRecord's json
serialization has excellent support for associations this is actually
quite pleasant.

class Post << ActiveRecord::Base

def indexed_json_document
to_json(:include => [:categories, :author], :method => :slug)
end

end

Near Real time support

There will be be a "changes" queue distinct from the "batch" indexing
tasks queue, and with higher priority.

When a record is saved or updated, ActiveRecord callbacks will
automatically submit the record id to this "changes" queue. The agent
will receive the task and will update both the "current" index for
this record, and any index that is in the process of being created,
and that already has the record indexed: this ensures that when you
deploy the new index it will be up to date.

Index Partitioning:

I'd really would like to support index partitioning. By this I mean
the ability to break a model in several indexes partitioned by some
field, and automatically restrict the search to the appropriate
indexes.

So for example if you have an Users index and it's partitioned over
"country_id", you would actually have "users_US", "users_FR",
"users_HK", etc indexes. A search by Surname would search in all
indexes, but a search by Surname and Country_ID would search only in
one index.

But to be honest I do not have a workable design for this feature.

Name

I do not have a name for the project :slight_smile:

If any wise woman or man sees a gross mistake in the design, please
comment! And as I said if anyone is working in something like that I'd
be happy to join myself to the effort.

Cheers and congratulations to all the Elasticsearch team!: it is an
impressing achievement.


(ppearcy) #3

Hey Angel,
Cool ideas. Overall, a sweet design.

I had a couple of comments on the index partitioning feature. I
actually implemented a similar solution on top of a different search
system. Choosing the partition type is always tricky, as there are
normally a few different fields that make sense and you must ensure
that these fields are not multi-valued and never will be. For example,
what if some future requirement dictated a user could be part of
multiple country groups?

Also, this can quickly balloon your index count, which will result in
more resource consumption and probably slower searches that aren't
targeted at a single partition.

If you did want to go this route, I highly recommend doing lots of
benchmarks to ensure this isn't a premature optimization and worth the
extra dev time and complexity. The performance we have seen with ES
has made us shelve every optimization path we've thought of, well,
other than more RAM :slight_smile:

Best Regards,
Paul

On Oct 18, 5:40 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hey,

In one word, wow!. If you do open source this, I for one, and I am sure
many elasticsearch users would be grateful. The design from my perspective
sounds very good. I am not a proper rails guy (or ruby), but one suggestion
I can make is to try and build this in layers. I mean, have an elasticsearch
ruby client, and on top of that add all the other features. I think users
would love for a ruby client that is maintained and has a community around
it, as well as a rails and even such infrastructure as the distributed
indexing processes.

I know of several ruby clients, some are more maintained then others. If
possible, I really hope that maybe joining forces make sense. The one that
seems to be active now is the event machine one. The others are listed
here:http://www.elasticsearch.com/products/. Does your project include a
new ruby client?

There are now two very active clients being developers, pyes for python,
and ElasticSearch.pm for perl. Others are being born as we speak (the .NET
one for example). One thing I like about the other clients is the
abstraction on top of the communication layer with elasticsearch, so
switching between HTTP and Thrift for example is very easy (and I do my best
to make it easy on the server side). In any case, I would love for this
mailing list to be a place where people developing clients can share
their experience with one another.

As for a name, it seems like the ruby community is ranked (at least by
me) first at coming up with cool names... . Some ideas I have (playing on
es): escargot, hornets, esty, crest, less, runes.

-shay.banon

On Mon, Oct 18, 2010 at 5:14 PM, Angel Faus angel.f...@gmail.com wrote:

Hi,

Is anyone working in a rails plugin? I am starting to work on that as
a part of the migration to elasticsearch of the search backend at my
company (vLex.com), and would be glad to share the effort.

The goal of the project is to provide easy Elasticsearch support to
existing rails applications, exposing as much of Elasticsearch as
possible and with a basic design that enables parallelization of
indexing tasks.

Let me share some of the design ideas.

Indexing:

Creating an index will support two modes. In a "local" mode, the
indexing rake task will iterate all the records, create the full JSON
document for each one of them, and submit it to Elasticsearch.

In the "distributed" mode the indexing jobs will be managed by a a job
queue backend. Initially support will be provided for Resque (http://
github.com/defunkt/resque,
a redis-backed jobs queue) but the plan is to make it pluggable from
day one and add support to Amazon'S SQS soon.

In this mode, when a new index is created, what will happen is that
record ids will be added to the queue, and Resque agents will perform
the job of receiving the document ids, creating the full JSON
document, and submitting it to ElasticSearch for indexing. The initial
task of retrieving object ids should also be distributed. Since you
can launch as many agents in as many machines as you need, this let's
you scale the rails part of indexing.

In our initial tests we have achieved 20x indexing speeds with the
distributed mode compared with the local mode, only limited by the
source databases IO (Elasticsearch did great)

Indexes will be versioned, so if you create a new index named "posts"
what will actually happen is that an index called "posts_$TIMESTAMP"
will be created, and when the indexing job has finished an alias will
be defined to "posts". This enables you to create new index versions
(maybe with new index or mapping options) without disrupting the
users.

Simplicity

When working in local mode, adding basic Elasticsearch support to a
model (without any custom index or mapping options) should be a matter
of adding one line to your model.

When working in distributed mode, the only additional task will be to
set-up Resque (or your chosen queue backend) and launch the agents.

Configuration:

The philosophy is to expose as much of Elasticsearch as possible. I do
not want to create a DSL syntax to define index options, mapping
options, etc. Instead of that, index and mapping options will be
stored in JSON or YAML in Elasticsearch's native format.

If you want to add associations or custom methods to the indexed
content, you will do so creating a #indexed_json_document the returns
whatever document you want to be indexed. Since ActiveRecord's json
serialization has excellent support for associations this is actually
quite pleasant.

class Post << ActiveRecord::Base

def indexed_json_document
to_json(:include => [:categories, :author], :method => :slug)
end

end

Near Real time support

There will be be a "changes" queue distinct from the "batch" indexing
tasks queue, and with higher priority.

When a record is saved or updated, ActiveRecord callbacks will
automatically submit the record id to this "changes" queue. The agent
will receive the task and will update both the "current" index for
this record, and any index that is in the process of being created,
and that already has the record indexed: this ensures that when you
deploy the new index it will be up to date.

Index Partitioning:

I'd really would like to support index partitioning. By this I mean
the ability to break a model in several indexes partitioned by some
field, and automatically restrict the search to the appropriate
indexes.

So for example if you have an Users index and it's partitioned over
"country_id", you would actually have "users_US", "users_FR",
"users_HK", etc indexes. A search by Surname would search in all
indexes, but a search by Surname and Country_ID would search only in
one index.

But to be honest I do not have a workable design for this feature.

Name

I do not have a name for the project :slight_smile:

If any wise woman or man sees a gross mistake in the design, please
comment! And as I said if anyone is working in something like that I'd
be happy to join myself to the effort.

Cheers and congratulations to all the Elasticsearch team!: it is an
impressing achievement.


(system) #4