Using the internal "transport module" for moving data between clusters


(José de Zárate) #1

transport module is the module elastic search uses for moving shards around
in the cluster.

can it be used somehow to move index data between different clusters? the
point here is avoid the whole scanning in source / indexing in destiny
thing, which is essentially the solution all the
moving-data-between-clusters implementations I've seen are based on.

Now that I have your attention: this is my case.

  • have around 700 indexes. each one of around 7k records. relatively small.

  • the ES cluster does not work well with so many small indexes, it wastes
    too much time deciding which node is master and which not

  • we need to separate indexing from searching

  • one solution is index in one machine and then transfer the index to the
    search machine.

  • if we do it the standard way, it implies indexing the dump from the
    index machine into the search machine, so no performance is gained.

  • One solution would be to move data between source and destiny the same
    way ES moves data inside a cluster, which I bet is much more efficient than
    the dumping/reindex approax

Is this possible?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/465c18c5-22b4-4150-a128-1f616efa0c47%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

Moving shards around is expensive and fiddling with this is no fun at all
except you have no updates.

You can separate index and search with index alias and routing

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html

so only dedicated nodes can index data, while other nodes have another
index in the alias for search.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGG%2BxSVWBjAD%2BgbjZY%2B8q9D8SzopK64Y4oHGGnRMkxWEw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(José de Zárate) #3

Jörg

Sorry, but I don't quite understand what you're saying.
as I've read in the docs, "routing" is a per-document value elasticsearch
uses to determine which shard that document should be into. That way, if
two documents share the same routing value, we can be certain they'll be
allocated in the same shard. AFAIK, that's the only thing that can be taken
for sure. (If two documents have different routing values, it does not mean
the'll end up in different shards).

On the other hand, I can associate an alias with and index name and a
routing value. That means, for instance, that if I use an alias with a
routing value , I can be certain that search operation will go straight to
the same shards to where all documents with that routing value were indexed.

However, when I want to separate indexing from searching, the shards in the
"indexing zone" and in the "searching zone" should contain the same data. I
guess the primary shards being all located in the indexing zone and the
replica shards in the searching zone. I honestly can't see what the routing
values have to do with this. A routing value only determines which shard a
document is going to be located, and that will be the same whether it's a
primary or a replica. I mean, if I search for something in a three-sharded
index with 1 replica with the routing value "A" which corresponds to the
first shard, both the first of the primaries shards and the first of the
replica shards would comply with that condition. I see no way of
"directing" search request to only the replica shards and the "index"
request to only the main shards using routing and aliases.

Also, the doc in here
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.htmlallows
me to specify in which nodes I want certain indexes shards to be
located in, but, again, I can't see how I can use that to "separate" index
and searching.

That being said, I know there has to be a way.(In fact, in
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.htmlthey
talk about "indexing" and "searching" aliases), but I just can't see.
If anyone would enlighten me I would really appreciate that.

txs!

On Thu, Feb 6, 2014 at 2:57 AM, joergprante@gmail.com <joergprante@gmail.com

wrote:

Moving shards around is expensive and fiddling with this is no fun at all
except you have no updates.

You can separate index and search with index alias and routing

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html

so only dedicated nodes can index data, while other nodes have another
index in the alias for search.

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/ZDCtIWACqE4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGG%2BxSVWBjAD%2BgbjZY%2B8q9D8SzopK64Y4oHGGnRMkxWEw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
uh, oh http://www.youtube.com/watch?v=GMD_T7ICL0o.

http://www.defectivebydesign.org/no-drm-in-html5

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKNaH0UKw8hc%3DZVYD6%3DpoRFMzqDO_JuKkUKQhhkCkC4GADm%3DfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

What I was referring to was shard routing, not document routing. I admit
that document routing is more prominent.

Shard routing is explained here
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html

Here is a little recipe for what you want to achieve

  • assume a simple cluster of node N1 and node N2, where N1 is for indexing
    and N2 is for search

  • tag the nodes in the config with node.tag: "index" on N1 (optionally set
    node.tag: "search" on N2)

  • now the first indexing round:

  • create index A1 with shard routing attribute "index" and replica level 0.
    Example
    curl -XPUT localhost:9200/a1/_settings -d '{
    "index.routing.allocation.include.tag" : "index",
    "index.number_of_replicas" : 0
    }'

  • feed documents into index A1, connect only to N1 with the client

  • allow replica to disseminate to all nodes. Example
    curl -XPUT localhost:9200/a1/_settings -d '{
    "index.routing.allocation.include.tag" : "*"
    }'

  • set replica level 1 to index A1, ES is doing the shard copy automatically

  • create index alias A pointing to index A1

  • search on index A on replica only with "only_node:N2"
    http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-preference.html

  • preferably, search clients should only connect to N2

  • now, how to perform more indexing rounds when A1 is in use on both nodes?
    Start over again but with a second index:

  • create index A2 with shard routing attribute "index" and replica level 0

  • feed documents into index A2, connect only to N1 with the client

  • allow replica to disseminate to all nodes

  • set replica level 1 to index A2, ES is doing the shard copy automatically

  • now, modify index alias A regarding the new index. For example, for a
    switchover, remove A1 and add A2. This is done atomically.

A1 and A2 can either be full updates or incremental. How this should be
reflected in search is managed by index aliasing with A. After a switchover
from A1 to A2, A1 can be dropped.

N1 and N2 can also be extended to a group of nodes for indexing and
searching.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHsM4BePXu3J7cqoLMvGy5Y04_NPsEb0trHJ4Q79bd9Wg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(José de Zárate) #5

Jörg

that's brilliant indeed.

Now I'm going to check again with the dev crew which was the reason we
couldn't do clusters when having so many indexes (around 1k), and see if
that can be solved with smart shard routing management. (I think it was
related to the fact that ES waste too much resources trying to determine
which node is responsible or master for which clients).

--
uh, oh http://www.youtube.com/watch?v=GMD_T7ICL0o.

http://www.defectivebydesign.org/no-drm-in-html5

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKNaH0U46H%3Do8SrxG_nAKEWE3c3nWipekaEA2vOmbU42bBdh-Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(José de Zárate) #6

Jörg

I think that system should work only for a two nodes setup. Otherwise, how
can we be sure that when letting elastisearch put a replica everywhere it
wants, there's going to be a copy in the nodes tagged as "search". Say you
have three "indexing" nodes and three "searching" nodes, and you go through
the process described by you. You can be certain that your index is in the
"indexing" zone, but when you set up replica=1, you are not sure where
elasticsearch is going to locate the replicas. Unless, of course, that the
fact that you're pointing the search to a particular node will "force"
elasticsearch to put a replica there.

On Thu, Feb 6, 2014 at 2:27 PM, José de Zárate jzarate@gmail.com wrote:

Jörg

that's brilliant indeed.

Now I'm going to check again with the dev crew which was the reason we
couldn't do clusters when having so many indexes (around 1k), and see if
that can be solved with smart shard routing management. (I think it was
related to the fact that ES waste too much resources trying to determine
which node is responsible or master for which clients).

--
uh, oh http://www.youtube.com/watch?v=GMD_T7ICL0o.

http://www.defectivebydesign.org/no-drm-in-html5

--
uh, oh http://www.youtube.com/watch?v=GMD_T7ICL0o.

http://www.defectivebydesign.org/no-drm-in-html5

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKNaH0WU3pC_cS8YNOG9Y1%2B70sL4jTM59dzhXkaFABAYFWbZkg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7