I understand that if you do not have sufficient storage space, then you
cannot manage a replica on every node. However, you are not limited to the
size of a "usual hdd". You can have a file system that spans many hdds. I
am not suggesting this, but if you have a situation where you need to
distribute all of your data, then you can. Also as we have little info on
your use case, and the most typical seems to be log ingestion, in that
scenario you can have that hot index, the most recent treated differently
than the others. You could have the number of replicas on your most recent
index spread data across the entire cluster, but then as a new index comes
online reduce the number of replicas. You could also reindex historical
data into fewer shards, improving performance, reducing addtl maintenance
The reason I think you need to spend a bit more time reading is that the
algorithm is very easy to find:
It is a very simple algorithm and standard approach to the issue of
shard = hash(routing) % number_of_primary_shards
The routing value by default is the document id, though you can specify
your own routing value. The specifics of which hash are not as important
except in very odd cases.
A bit more research shows this from the source:
Current implementations seem to use the DJB2 hash which is good but does
have some cases such as 33 shards where it behaves poorly. In version 2.0
it appears they are moving to murmur3 which is a more consistent hash
across a greater set of use cases. Note that with the default of 5 shards,
DJB2 performs ideally.
On Monday, March 30, 2015 at 10:04:08 AM UTC-6, MrBu wrote:
Aaron, thanks for the reply.
You cant distribute all of the documents if the size of it is more than a
usual hdd. Also that was an example I gave. I am just figuring out the
magical ways that ES uses rather than lucene has its own.
30 Mart 2015 Pazartesi 18:55:49 UTC+3 tarihinde Aaron Mefford yazdı:
"Automagic" routing happens already on hashing the document id. It
sounds like you may have a situation where your document id is creating a
hot spot. This being the case what you want is not automagic routing but
more control over the routing or a better document id. There is the
ability to code your own routing and create a more even distribution, for
your given keyset, but I think you would be better served by a better
document key, this isnt mongo or hbase where the document key rules the
The other possible reason you are hot-spotting is index creation. In a
log ingestion scenario, the most recent index is almost always the hottest
index. That is where all indexing is occurring, that is where all queries
start. If you have tweaked the 5 shard norm and are only creating 1 shard
that shard will be hot in this scenario.
Your comment on routing a shard to another shard does not make any
sense. You need to read a bit more on what the shards are and how they
work. That said if you have multiple replicas of a shard, then those
shards will automatically be distributed across all of your nodes. In fact
if the number of replicas is the same as the number of nodes in the
cluster, you should automatically have all data on all nodes, and any node
will be able to query local data, and no node will be hot because of query
volume. However indexing is still routed to the master shard.
Like was mentioned previously, the code is open, however it sounds like
you are looking to go deep water diving before learning to swim.
On Monday, March 30, 2015 at 8:57:51 AM UTC-6, MrBu wrote:
Thanks for the input. I have read many tutorials, guides (official one
too). Just I want to re-route in more automagic way. Like routing evenly to
the shard and duplicating mostly used shard to other shards maybe.
30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı:
Elasticsearch is open source, so reading (and using and modifying) the
algorithms is possible. There is also a lot of introductory material
available online, and I recommend "Elasticsearch - The definitive guide" if
you want paperwork.
If you create an index, ES creates shards for this index (by default
5), and different nodes receive one of such shards, so indexing and search
is automatically distributed over the participating nodes. ES keeps a map
of shards in the cluster state, so every node is able to route a query or
an index command. You don't need to manually route queries to shards.
You can force ES to put all data on 3rd node, and in that case, you
already know what you want... there is no surprise. ES follows the
principle of least surprise.
On Mon, Mar 30, 2015 at 5:07 AM, MrBu metin....@gmail.com wrote:
Other than Lucene's own research papers, what are the research papers
or special algorithms that is being used by Elastic? I couldn't find a list
it in the documents.
Are the special algorithms used (and which ones are used in where) for
example what is the algorithm used in in load distribution or just round
I really want to get in deep with Elastic
This way I could have more knowledge. Example, suppose there are 20
nodes, and surprisingly (and somehow) only the data in 3rd node is being
searched all the time. (say these are popular documents somehow gathered
only in this node) so Elastic weights this load into all cluster by
dividing this data to other nodes ? Or will it always use only 3rd node?
There are tons of questions in my mind, waiting to be answered. Only
possible way to read the algorithms . It would help me a lot.
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fa934662-61b3-42db-a97f-671ade563297%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.