Separating Index and Search

Hi,

My question is about the best practice to divide elasticsearch indexing and
search systems.

Currently, without elasticsearch (and with lucene), our indexing works on
several machines at one location and the search machines are at another
location.

The indexed data is copied or updated periodically from the indexing
machines to the search machines.

We want to maintain a similar structure using elasticsearch.

Is it possible for the elasticsearch nodes on the indexing and search
machines to be on the same cluster,

so the indexing nodes will put replicas on the search nodes and will keep
the replicas updated, while the search client will approach only the search
nodes (ignoring the indexing nodes because they are in a different location
and approaching them will slow the search).

Also – is it possible to set the indexing nodes on the cluster to update
the search nodes periodically (and not constantly so the search performance
won't decrease)?

Thanks,

Ophir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm not entirely positive, so wait for someone with more experience to
confirm/deny...but I don't think this is quite possible in ES right now.
You can fake it with, shard allocation filteringhttp://www.elasticsearch.org/guide/reference/modules/cluster.html,
multiple indices and aliases, however.

First, let's talk about the solution that appears to work, but in fact
does not: forced awareness settings. Forced awareness basically prevents
duplication of data within the same zone, so a primary + replica cannot
live in the same zone.

Imagine you have two nodes in your "indexing" zone, and two nodes in your
"search" zone. Primary shards are allocated in "indexing", replicas on
"search". If you use forced awareness and a node in your "search" zone
goes down, ES will know avoid initializing a corresponding replica in your
indexing zone, since the primary already lives there.

Even better, if you perform searches on the "search" zone, forced awareness
makes ES prefer querying nodes in the same zone. Great!

However, the problem arises if one of your indexing nodes goes down. Zones
enforce data duplication boundaries, but does not interfere with primary
promotion. If one of your indexing nodes goes down, your cluster is now
missing a primary shard. ES has no choice but to promote a replica to a
primary, even if it lives in another zone. Now your indexing node is
actually living in the "search" zone and everything is all messed up.

As an alternative, what you can do is use Shard Allocation Filtering to
separate an "Indexing" index and a "Search" index onto physically separate
nodes. E.g. search nodes are forced to allocate to nodes with the "search"
tag. You then index into your "indexing" index (hah) and when it is ready
for search requests, change the tags on the index over to "search".

ES will automatically transfer the shards over to your Search nodes. When
the transfer is complete, change a top-level alias to switch between the
old and new index transparently, then delete the old index. This method
obviously has a lot of moving parts, and loads the search nodes with
periodic network transfer as you move shards around.

-Zach

On Tuesday, February 19, 2013 7:37:26 AM UTC-5, Ophir Michaeli wrote:

Hi,

My question is about the best practice to divide elasticsearch indexing
and search systems.

Currently, without elasticsearch (and with lucene), our indexing works on
several machines at one location and the search machines are at another
location.

The indexed data is copied or updated periodically from the indexing
machines to the search machines.

We want to maintain a similar structure using elasticsearch.

Is it possible for the elasticsearch nodes on the indexing and search
machines to be on the same cluster,

so the indexing nodes will put replicas on the search nodes and will keep
the replicas updated, while the search client will approach only the search
nodes (ignoring the indexing nodes because they are in a different location
and approaching them will slow the search).

Also – is it possible to set the indexing nodes on the cluster to update
the search nodes periodically (and not constantly so the search performance
won't decrease)?

Thanks,

Ophir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the explanation.

Regarding the second suggestion, is it the same as setting an "Indexing"
cluster, and a "Search" cluster?

The index created by the Indexing cluster periodically copied to the
Search cluster folder (index) to replace the old search folder (index).

Both this and your suggested solution has the disadvantage of not allowing
real time indexed data to be searchable, which is one of our considerations
to move to elasticsearch.

On Tuesday, February 19, 2013 2:37:26 PM UTC+2, Ophir Michaeli wrote:

Hi,

My question is about the best practice to divide elasticsearch indexing
and search systems.

Currently, without elasticsearch (and with lucene), our indexing works on
several machines at one location and the search machines are at another
location.

The indexed data is copied or updated periodically from the indexing
machines to the search machines.

We want to maintain a similar structure using elasticsearch.

Is it possible for the elasticsearch nodes on the indexing and search
machines to be on the same cluster,

so the indexing nodes will put replicas on the search nodes and will keep
the replicas updated, while the search client will approach only the search
nodes (ignoring the indexing nodes because they are in a different location
and approaching them will slow the search).

Also – is it possible to set the indexing nodes on the cluster to update
the search nodes periodically (and not constantly so the search performance
won't decrease)?

Thanks,

Ophir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The infrastructure that we had in place before moving to ElasticSearch had
the same workflow. Indexer nodes create the index which then gets deployed
to searcher nodes.

For probably the same performance reasons, we tried to replicate the same
workflow in ElasticSearch, using many of the techniques that Zachary
highlighted. In the end we found it to be too much of an administrative
hassle. We decided to embrace ElasticSearch to its fullest and let it deal
with the merging of new docs/segments in an efficient manner. If you want
real time indexing, you need to have the searcher nodes handle the indexing.

Aliases are a good way to handle the creation of new indices that are not
meant to be searched yet.

Cheers,

Ivan

On Tue, Feb 19, 2013 at 8:40 AM, Ophir Michaeli ophirmichaeli@gmail.comwrote:

Thanks for the explanation.******

Regarding the second suggestion, is it the same as setting an "Indexing"
cluster, and a "Search" cluster?****

The index created by the Indexing cluster periodically copied to the
Search cluster folder (index) to replace the old search folder (index).***
*

Both this and your suggested solution has the disadvantage of not allowing
real time indexed data to be searchable, which is one of our considerations
to move to elasticsearch.****

On Tuesday, February 19, 2013 2:37:26 PM UTC+2, Ophir Michaeli wrote:

Hi,

My question is about the best practice to divide elasticsearch indexing
and search systems.

Currently, without elasticsearch (and with lucene), our indexing works on
several machines at one location and the search machines are at another
location.

The indexed data is copied or updated periodically from the indexing
machines to the search machines.

We want to maintain a similar structure using elasticsearch.

Is it possible for the elasticsearch nodes on the indexing and search
machines to be on the same cluster,

so the indexing nodes will put replicas on the search nodes and will keep
the replicas updated, while the search client will approach only the search
nodes (ignoring the indexing nodes because they are in a different location
and approaching them will slow the search).

Also – is it possible to set the indexing nodes on the cluster to update
the search nodes periodically (and not constantly so the search performance
won't decrease)?

Thanks,

Ophir

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It is similar to having two separate clusters, except the method I outlined
lets ElasticSearch deal with data migration so you don't have to rsync
folders around manually. Otherwise it is basically the same.

And basically everything Ivan said =)

-Zach

On Tuesday, February 19, 2013 11:40:50 AM UTC-5, Ophir Michaeli wrote:

Thanks for the explanation.

Regarding the second suggestion, is it the same as setting an "Indexing"
cluster, and a "Search" cluster?

The index created by the Indexing cluster periodically copied to the
Search cluster folder (index) to replace the old search folder (index).

Both this and your suggested solution has the disadvantage of not allowing
real time indexed data to be searchable, which is one of our considerations
to move to elasticsearch.

On Tuesday, February 19, 2013 2:37:26 PM UTC+2, Ophir Michaeli wrote:

Hi,

My question is about the best practice to divide elasticsearch indexing
and search systems.

Currently, without elasticsearch (and with lucene), our indexing works on
several machines at one location and the search machines are at another
location.

The indexed data is copied or updated periodically from the indexing
machines to the search machines.

We want to maintain a similar structure using elasticsearch.

Is it possible for the elasticsearch nodes on the indexing and search
machines to be on the same cluster,

so the indexing nodes will put replicas on the search nodes and will keep
the replicas updated, while the search client will approach only the search
nodes (ignoring the indexing nodes because they are in a different location
and approaching them will slow the search).

Also – is it possible to set the indexing nodes on the cluster to update
the search nodes periodically (and not constantly so the search performance
won't decrease)?

Thanks,

Ophir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Zachary –

Suppose I want to use suggestion 1 –
I managed to set zones, it's not clear how to check that it works, meaning:

  1. Primary shards are at indexing zone and replica shards are at search
    zone.
  2. Search is done only at search zone.

Thanks, Ophir

On Tuesday, February 19, 2013 8:30:04 PM UTC+2, Zachary Tong wrote:

It is similar to having two separate clusters, except the method I
outlined lets ElasticSearch deal with data migration so you don't have to
rsync folders around manually. Otherwise it is basically the same.

And basically everything Ivan said =)

-Zach

On Tuesday, February 19, 2013 11:40:50 AM UTC-5, Ophir Michaeli wrote:

Thanks for the explanation.

Regarding the second suggestion, is it the same as setting an "Indexing"
cluster, and a "Search" cluster?

The index created by the Indexing cluster periodically copied to the
Search cluster folder (index) to replace the old search folder (index).

Both this and your suggested solution has the disadvantage of not
allowing real time indexed data to be searchable, which is one of our
considerations to move to elasticsearch.

On Tuesday, February 19, 2013 2:37:26 PM UTC+2, Ophir Michaeli wrote:

Hi,

My question is about the best practice to divide elasticsearch indexing
and search systems.

Currently, without elasticsearch (and with lucene), our indexing works
on several machines at one location and the search machines are at another
location.

The indexed data is copied or updated periodically from the indexing
machines to the search machines.

We want to maintain a similar structure using elasticsearch.

Is it possible for the elasticsearch nodes on the indexing and search
machines to be on the same cluster,

so the indexing nodes will put replicas on the search nodes and will
keep the replicas updated, while the search client will approach only the
search nodes (ignoring the indexing nodes because they are in a different
location and approaching them will slow the search).

Also – is it possible to set the indexing nodes on the cluster to update
the search nodes periodically (and not constantly so the search performance
won't decrease)?

Thanks,

Ophir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Ophir,

Based on this comment by Shayhttps://github.com/elasticsearch/elasticsearch/issues/1352#issuecomment-2166409,
it appears that you can check attribute status through the Cluster State
API, but cannot check the awareness settings. I think the only way you
will be able to confirm zone delineation is to manually check to see if the
nodes you expect in the "indexing" zone are holding all primary shards.
And of course, if (when) an indexing node goes down, you'll have to
manually intervene to reroute primaries back out of the "search" zone.

Search requests going to a particular zone are "preferentially" served by
nodes in the same zone. I'm unsure the degree of this "preference", and if
it will ever spill over to the other zone. Someone more experienced with
zones would have to answer that. You may be able to manually confirm that
search requests are being served by one zone through logging.

But, as mentioned before, this use-case is largely tangential to the
ElasticSearch "philosophy" and zones were not really designed for this kind
of work.

On Saturday, February 23, 2013 10:16:51 AM UTC-5, Ophir Michaeli wrote:

Zachary –

Suppose I want to use suggestion 1 –
I managed to set zones, it's not clear how to check that it works,
meaning:

  1. Primary shards are at indexing zone and replica shards are at search
    zone.
  2. Search is done only at search zone.

Thanks, Ophir

On Tuesday, February 19, 2013 8:30:04 PM UTC+2, Zachary Tong wrote:

It is similar to having two separate clusters, except the method I
outlined lets ElasticSearch deal with data migration so you don't have to
rsync folders around manually. Otherwise it is basically the same.

And basically everything Ivan said =)

-Zach

On Tuesday, February 19, 2013 11:40:50 AM UTC-5, Ophir Michaeli wrote:

Thanks for the explanation.

Regarding the second suggestion, is it the same as setting an "Indexing"
cluster, and a "Search" cluster?

The index created by the Indexing cluster periodically copied to the
Search cluster folder (index) to replace the old search folder (index).

Both this and your suggested solution has the disadvantage of not
allowing real time indexed data to be searchable, which is one of our
considerations to move to elasticsearch.

On Tuesday, February 19, 2013 2:37:26 PM UTC+2, Ophir Michaeli wrote:

Hi,

My question is about the best practice to divide elasticsearch indexing
and search systems.

Currently, without elasticsearch (and with lucene), our indexing works
on several machines at one location and the search machines are at another
location.

The indexed data is copied or updated periodically from the indexing
machines to the search machines.

We want to maintain a similar structure using elasticsearch.

Is it possible for the elasticsearch nodes on the indexing and search
machines to be on the same cluster,

so the indexing nodes will put replicas on the search nodes and will
keep the replicas updated, while the search client will approach only the
search nodes (ignoring the indexing nodes because they are in a different
location and approaching them will slow the search).

Also – is it possible to set the indexing nodes on the cluster to
update the search nodes periodically (and not constantly so the search
performance won't decrease)?

Thanks,

Ophir

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.