What are the research papers that ES relies on?

MrBu · March 30, 2015, 3:07am

Other than Lucene's own research papers, what are the research papers or
special algorithms that is being used by Elastic? I couldn't find a list it
in the documents.

Are the special algorithms used (and which ones are used in where) for
example what is the algorithm used in in load distribution or just round
robin algorithm?

I really want to get in deep with Elastic

This way I could have more knowledge. Example, suppose there are 20 nodes,
and surprisingly (and somehow) only the data in 3rd node is being searched
all the time. (say these are popular documents somehow gathered only in
this node) so Elastic weights this load into all cluster by dividing this
data to other nodes ? Or will it always use only 3rd node? There are tons
of questions in my mind, waiting to be answered. Only possible way to read
the algorithms . It would help me a lot.

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 30, 2015, 7:33am

Elasticsearch is open source, so reading (and using and modifying) the
algorithms is possible. There is also a lot of introductory material
available online, and I recommend "Elasticsearch - The definitive guide" if
you want paperwork.

If you create an index, ES creates shards for this index (by default 5),
and different nodes receive one of such shards, so indexing and search is
automatically distributed over the participating nodes. ES keeps a map of
shards in the cluster state, so every node is able to route a query or an
index command. You don't need to manually route queries to shards.

You can force ES to put all data on 3rd node, and in that case, you already
know what you want... there is no surprise. ES follows the principle of
least surprise.

Jörg

On Mon, Mar 30, 2015 at 5:07 AM, MrBu metin.akyali@gmail.com wrote:

Other than Lucene's own research papers, what are the research papers or
special algorithms that is being used by Elastic? I couldn't find a list it
in the documents.

Are the special algorithms used (and which ones are used in where) for
example what is the algorithm used in in load distribution or just round
robin algorithm?

I really want to get in deep with Elastic

This way I could have more knowledge. Example, suppose there are 20 nodes,
and surprisingly (and somehow) only the data in 3rd node is being searched
all the time. (say these are popular documents somehow gathered only in
this node) so Elastic weights this load into all cluster by dividing this
data to other nodes ? Or will it always use only 3rd node? There are tons
of questions in my mind, waiting to be answered. Only possible way to read
the algorithms . It would help me a lot.

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFsiXK72E8qFOeWpHYuPrC8eNZZz%3Dsn5OE-O2fw4HWV8w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

MrBu · March 30, 2015, 2:57pm

Jörg,

Thanks for the input. I have read many tutorials, guides (official one
too). Just I want to re-route in more automagic way. Like routing evenly to
the shard and duplicating mostly used shard to other shards maybe.

30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı:

Elasticsearch is open source, so reading (and using and modifying) the
algorithms is possible. There is also a lot of introductory material
available online, and I recommend "Elasticsearch - The definitive guide" if
you want paperwork.

If you create an index, ES creates shards for this index (by default 5),
and different nodes receive one of such shards, so indexing and search is
automatically distributed over the participating nodes. ES keeps a map of
shards in the cluster state, so every node is able to route a query or an
index command. You don't need to manually route queries to shards.

You can force ES to put all data on 3rd node, and in that case, you
already know what you want... there is no surprise. ES follows the
principle of least surprise.

Jörg

On Mon, Mar 30, 2015 at 5:07 AM, MrBu <metin....@gmail.com <javascript:>>
wrote:

Other than Lucene's own research papers, what are the research papers or
special algorithms that is being used by Elastic? I couldn't find a list it
in the documents.

Are the special algorithms used (and which ones are used in where) for
example what is the algorithm used in in load distribution or just round
robin algorithm?

I really want to get in deep with Elastic

This way I could have more knowledge. Example, suppose there are 20
nodes, and surprisingly (and somehow) only the data in 3rd node is being
searched all the time. (say these are popular documents somehow gathered
only in this node) so Elastic weights this load into all cluster by
dividing this data to other nodes ? Or will it always use only 3rd node?
There are tons of questions in my mind, waiting to be answered. Only
possible way to read the algorithms . It would help me a lot.

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/24dc8be9-a80a-4e1c-8c2a-0a8f95301287%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

aaronmefford · March 30, 2015, 3:55pm

"Automagic" routing happens already on hashing the document id. It sounds
like you may have a situation where your document id is creating a hot
spot. This being the case what you want is not automagic routing but more
control over the routing or a better document id. There is the ability to
code your own routing and create a more even distribution, for your given
keyset, but I think you would be better served by a better document key,
this isnt mongo or hbase where the document key rules the world.

The other possible reason you are hot-spotting is index creation. In a log
ingestion scenario, the most recent index is almost always the hottest
index. That is where all indexing is occurring, that is where all queries
start. If you have tweaked the 5 shard norm and are only creating 1 shard
that shard will be hot in this scenario.

Your comment on routing a shard to another shard does not make any sense.
You need to read a bit more on what the shards are and how they work. That
said if you have multiple replicas of a shard, then those shards will
automatically be distributed across all of your nodes. In fact if the
number of replicas is the same as the number of nodes in the cluster, you
should automatically have all data on all nodes, and any node will be able
to query local data, and no node will be hot because of query volume.
However indexing is still routed to the master shard.

Like was mentioned previously, the code is open, however it sounds like you
are looking to go deep water diving before learning to swim.
On Monday, March 30, 2015 at 8:57:51 AM UTC-6, MrBu wrote:

Jörg,

Thanks for the input. I have read many tutorials, guides (official one
too). Just I want to re-route in more automagic way. Like routing evenly to
the shard and duplicating mostly used shard to other shards maybe.

30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı:

Elasticsearch is open source, so reading (and using and modifying) the
algorithms is possible. There is also a lot of introductory material
available online, and I recommend "Elasticsearch - The definitive guide" if
you want paperwork.

If you create an index, ES creates shards for this index (by default 5),
and different nodes receive one of such shards, so indexing and search is
automatically distributed over the participating nodes. ES keeps a map of
shards in the cluster state, so every node is able to route a query or an
index command. You don't need to manually route queries to shards.

You can force ES to put all data on 3rd node, and in that case, you
already know what you want... there is no surprise. ES follows the
principle of least surprise.

Jörg

On Mon, Mar 30, 2015 at 5:07 AM, MrBu metin....@gmail.com wrote:

Other than Lucene's own research papers, what are the research papers or
special algorithms that is being used by Elastic? I couldn't find a list it
in the documents.

Are the special algorithms used (and which ones are used in where) for
example what is the algorithm used in in load distribution or just round
robin algorithm?

I really want to get in deep with Elastic

This way I could have more knowledge. Example, suppose there are 20
nodes, and surprisingly (and somehow) only the data in 3rd node is being
searched all the time. (say these are popular documents somehow gathered
only in this node) so Elastic weights this load into all cluster by
dividing this data to other nodes ? Or will it always use only 3rd node?
There are tons of questions in my mind, waiting to be answered. Only
possible way to read the algorithms . It would help me a lot.

Thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c2c0de77-1ba7-4749-93b7-849f022ae0d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

MrBu · March 30, 2015, 4:04pm

Aaron, thanks for the reply.

You cant distribute all of the documents if the size of it is more than a
usual hdd. Also that was an example I gave. I am just figuring out the
magical ways that ES uses rather than lucene has its own.

30 Mart 2015 Pazartesi 18:55:49 UTC+3 tarihinde Aaron Mefford yazdı:

"Automagic" routing happens already on hashing the document id. It sounds
like you may have a situation where your document id is creating a hot
spot. This being the case what you want is not automagic routing but more
control over the routing or a better document id. There is the ability to
code your own routing and create a more even distribution, for your given
keyset, but I think you would be better served by a better document key,
this isnt mongo or hbase where the document key rules the world.

The other possible reason you are hot-spotting is index creation. In a
log ingestion scenario, the most recent index is almost always the hottest
index. That is where all indexing is occurring, that is where all queries
start. If you have tweaked the 5 shard norm and are only creating 1 shard
that shard will be hot in this scenario.

Your comment on routing a shard to another shard does not make any sense.
You need to read a bit more on what the shards are and how they work. That
said if you have multiple replicas of a shard, then those shards will
automatically be distributed across all of your nodes. In fact if the
number of replicas is the same as the number of nodes in the cluster, you
should automatically have all data on all nodes, and any node will be able
to query local data, and no node will be hot because of query volume.
However indexing is still routed to the master shard.

Like was mentioned previously, the code is open, however it sounds like
you are looking to go deep water diving before learning to swim.
On Monday, March 30, 2015 at 8:57:51 AM UTC-6, MrBu wrote:

Jörg,

Thanks for the input. I have read many tutorials, guides (official one
too). Just I want to re-route in more automagic way. Like routing evenly to
the shard and duplicating mostly used shard to other shards maybe.

30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı:

Elasticsearch is open source, so reading (and using and modifying) the
algorithms is possible. There is also a lot of introductory material
available online, and I recommend "Elasticsearch - The definitive guide" if
you want paperwork.

If you create an index, ES creates shards for this index (by default 5),
and different nodes receive one of such shards, so indexing and search is
automatically distributed over the participating nodes. ES keeps a map of
shards in the cluster state, so every node is able to route a query or an
index command. You don't need to manually route queries to shards.

You can force ES to put all data on 3rd node, and in that case, you
already know what you want... there is no surprise. ES follows the
principle of least surprise.

Jörg

On Mon, Mar 30, 2015 at 5:07 AM, MrBu metin....@gmail.com wrote:

Other than Lucene's own research papers, what are the research papers
or special algorithms that is being used by Elastic? I couldn't find a list
it in the documents.

Are the special algorithms used (and which ones are used in where) for
example what is the algorithm used in in load distribution or just round
robin algorithm?

I really want to get in deep with Elastic

This way I could have more knowledge. Example, suppose there are 20
nodes, and surprisingly (and somehow) only the data in 3rd node is being
searched all the time. (say these are popular documents somehow gathered
only in this node) so Elastic weights this load into all cluster by
dividing this data to other nodes ? Or will it always use only 3rd node?
There are tons of questions in my mind, waiting to be answered. Only
possible way to read the algorithms . It would help me a lot.

Thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f3bcef25-b07a-4344-b1f2-9e5b8cc9db72%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

aaronmefford · March 30, 2015, 9:42pm

I understand that if you do not have sufficient storage space, then you
cannot manage a replica on every node. However, you are not limited to the
size of a "usual hdd". You can have a file system that spans many hdds. I
am not suggesting this, but if you have a situation where you need to
distribute all of your data, then you can. Also as we have little info on
your use case, and the most typical seems to be log ingestion, in that
scenario you can have that hot index, the most recent treated differently
than the others. You could have the number of replicas on your most recent
index spread data across the entire cluster, but then as a new index comes
online reduce the number of replicas. You could also reindex historical
data into fewer shards, improving performance, reducing addtl maintenance
tasks.

The reason I think you need to spend a bit more time reading is that the
algorithm is very easy to find:

It is a very simple algorithm and standard approach to the issue of
sharding:

shard = hash(routing) % number_of_primary_shards

The routing value by default is the document id, though you can specify
your own routing value. The specifics of which hash are not as important
except in very odd cases.

A bit more research shows this from the source:

github.com/elastic/elasticsearch

Switch to murmurhash3 to route documents to shards.

committed 03:32PM - 04 Nov 14 UTC

jpountz

+971 -554

We currently use the djb2 hash function in order to compute the shard a document… should go to. Unfortunately this hash function is not very sophisticated and you can sometimes hit adversarial cases, such as numeric ids on 33 shards. Murmur3 generates hashes with a better distribution, which should avoid the adversarial cases. Here are some examples of how 100000 incremental ids are distributed to shards using either djb2 or murmur3. 5 shards: Murmur3: [19933, 19964, 19940, 20030, 20133] DJB: [20000, 20000, 20000, 20000, 20000] 3 shards: Murmur3: [33185, 33347, 33468] DJB: [30100, 30000, 39900] 33 shards: Murmur3: [2999, 3096, 2930, 2986, 3070, 3093, 3023, 3052, 3112, 2940, 3036, 2985, 3031, 3048, 3127, 2961, 2901, 3105, 3041, 3130, 3013, 3035, 3031, 3019, 3008, 3022, 3111, 3086, 3016, 2996, 3075, 2945, 2977] DJB: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 900, 900, 900, 900, 1000, 1000, 10000, 10000, 10000, 10000, 9100, 9100, 9100, 9100, 9000, 9000, 0, 0, 0, 0, 0, 0] Even if djb2 looks ideal in some cases (5 shards), the fact that the distribution of its hashes has some patterns can raise issues with some shard counts (eg. 3, or even worse 33). Some tests have been modified because they relied on implementation details of the routing hash function. Close #7954

Current implementations seem to use the DJB2 hash which is good but does
have some cases such as 33 shards where it behaves poorly. In version 2.0
it appears they are moving to murmur3 which is a more consistent hash
across a greater set of use cases. Note that with the default of 5 shards,
DJB2 performs ideally.

On Monday, March 30, 2015 at 10:04:08 AM UTC-6, MrBu wrote:

Aaron, thanks for the reply.

You cant distribute all of the documents if the size of it is more than a
usual hdd. Also that was an example I gave. I am just figuring out the
magical ways that ES uses rather than lucene has its own.

30 Mart 2015 Pazartesi 18:55:49 UTC+3 tarihinde Aaron Mefford yazdı:

"Automagic" routing happens already on hashing the document id. It
sounds like you may have a situation where your document id is creating a
hot spot. This being the case what you want is not automagic routing but
more control over the routing or a better document id. There is the
ability to code your own routing and create a more even distribution, for
your given keyset, but I think you would be better served by a better
document key, this isnt mongo or hbase where the document key rules the
world.

The other possible reason you are hot-spotting is index creation. In a
log ingestion scenario, the most recent index is almost always the hottest
index. That is where all indexing is occurring, that is where all queries
start. If you have tweaked the 5 shard norm and are only creating 1 shard
that shard will be hot in this scenario.

Your comment on routing a shard to another shard does not make any
sense. You need to read a bit more on what the shards are and how they
work. That said if you have multiple replicas of a shard, then those
shards will automatically be distributed across all of your nodes. In fact
if the number of replicas is the same as the number of nodes in the
cluster, you should automatically have all data on all nodes, and any node
will be able to query local data, and no node will be hot because of query
volume. However indexing is still routed to the master shard.

Like was mentioned previously, the code is open, however it sounds like
you are looking to go deep water diving before learning to swim.
On Monday, March 30, 2015 at 8:57:51 AM UTC-6, MrBu wrote:

Jörg,

Thanks for the input. I have read many tutorials, guides (official one
too). Just I want to re-route in more automagic way. Like routing evenly to
the shard and duplicating mostly used shard to other shards maybe.

30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı:

Elasticsearch is open source, so reading (and using and modifying) the
algorithms is possible. There is also a lot of introductory material
available online, and I recommend "Elasticsearch - The definitive guide" if
you want paperwork.

If you create an index, ES creates shards for this index (by default
5), and different nodes receive one of such shards, so indexing and search
is automatically distributed over the participating nodes. ES keeps a map
of shards in the cluster state, so every node is able to route a query or
an index command. You don't need to manually route queries to shards.

You can force ES to put all data on 3rd node, and in that case, you
already know what you want... there is no surprise. ES follows the
principle of least surprise.

Jörg

On Mon, Mar 30, 2015 at 5:07 AM, MrBu metin....@gmail.com wrote:

Other than Lucene's own research papers, what are the research papers
or special algorithms that is being used by Elastic? I couldn't find a list
it in the documents.

Are the special algorithms used (and which ones are used in where) for
example what is the algorithm used in in load distribution or just round
robin algorithm?

I really want to get in deep with Elastic

This way I could have more knowledge. Example, suppose there are 20
nodes, and surprisingly (and somehow) only the data in 3rd node is being
searched all the time. (say these are popular documents somehow gathered
only in this node) so Elastic weights this load into all cluster by
dividing this data to other nodes ? Or will it always use only 3rd node?
There are tons of questions in my mind, waiting to be answered. Only
possible way to read the algorithms . It would help me a lot.

Thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fa934662-61b3-42db-a97f-671ade563297%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

MrBu · March 31, 2015, 5:53pm

Thats what I was looking for (murmur3) I really wondered what they used and
I was going to ask about murmur3 as weel. But as I see things, are going
pretty awesome.

Thanks

31 Mart 2015 Salı 00:42:45 UTC+3 tarihinde Aaron Mefford yazdı:

I understand that if you do not have sufficient storage space, then you
cannot manage a replica on every node. However, you are not limited to the
size of a "usual hdd". You can have a file system that spans many hdds. I
am not suggesting this, but if you have a situation where you need to
distribute all of your data, then you can. Also as we have little info on
your use case, and the most typical seems to be log ingestion, in that
scenario you can have that hot index, the most recent treated differently
than the others. You could have the number of replicas on your most recent
index spread data across the entire cluster, but then as a new index comes
online reduce the number of replicas. You could also reindex historical
data into fewer shards, improving performance, reducing addtl maintenance
tasks.

The reason I think you need to spend a bit more time reading is that the
algorithm is very easy to find:

Routing a Document to a Shard | Elasticsearch: The Definitive Guide [master] | Elastic

It is a very simple algorithm and standard approach to the issue of
sharding:

shard = hash(routing) % number_of_primary_shards

The routing value by default is the document id, though you can specify
your own routing value. The specifics of which hash are not as important
except in very odd cases.

A bit more research shows this from the source:

Switch to murmurhash3 to route documents to shards. · elastic/elasticsearch@9ea25df · GitHub

Current implementations seem to use the DJB2 hash which is good but does
have some cases such as 33 shards where it behaves poorly. In version 2.0
it appears they are moving to murmur3 which is a more consistent hash
across a greater set of use cases. Note that with the default of 5 shards,
DJB2 performs ideally.

On Monday, March 30, 2015 at 10:04:08 AM UTC-6, MrBu wrote:

Aaron, thanks for the reply.

You cant distribute all of the documents if the size of it is more than a
usual hdd. Also that was an example I gave. I am just figuring out the
magical ways that ES uses rather than lucene has its own.

30 Mart 2015 Pazartesi 18:55:49 UTC+3 tarihinde Aaron Mefford yazdı:

"Automagic" routing happens already on hashing the document id. It
sounds like you may have a situation where your document id is creating a
hot spot. This being the case what you want is not automagic routing but
more control over the routing or a better document id. There is the
ability to code your own routing and create a more even distribution, for
your given keyset, but I think you would be better served by a better
document key, this isnt mongo or hbase where the document key rules the
world.

The other possible reason you are hot-spotting is index creation. In a
log ingestion scenario, the most recent index is almost always the hottest
index. That is where all indexing is occurring, that is where all queries
start. If you have tweaked the 5 shard norm and are only creating 1 shard
that shard will be hot in this scenario.

Your comment on routing a shard to another shard does not make any
sense. You need to read a bit more on what the shards are and how they
work. That said if you have multiple replicas of a shard, then those
shards will automatically be distributed across all of your nodes. In fact
if the number of replicas is the same as the number of nodes in the
cluster, you should automatically have all data on all nodes, and any node
will be able to query local data, and no node will be hot because of query
volume. However indexing is still routed to the master shard.

Like was mentioned previously, the code is open, however it sounds like
you are looking to go deep water diving before learning to swim.
On Monday, March 30, 2015 at 8:57:51 AM UTC-6, MrBu wrote:

Jörg,

Thanks for the input. I have read many tutorials, guides (official one
too). Just I want to re-route in more automagic way. Like routing evenly to
the shard and duplicating mostly used shard to other shards maybe.

30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı:

Elasticsearch is open source, so reading (and using and modifying) the
algorithms is possible. There is also a lot of introductory material
available online, and I recommend "Elasticsearch - The definitive guide" if
you want paperwork.

If you create an index, ES creates shards for this index (by default
5), and different nodes receive one of such shards, so indexing and search
is automatically distributed over the participating nodes. ES keeps a map
of shards in the cluster state, so every node is able to route a query or
an index command. You don't need to manually route queries to shards.

You can force ES to put all data on 3rd node, and in that case, you
already know what you want... there is no surprise. ES follows the
principle of least surprise.

Jörg

On Mon, Mar 30, 2015 at 5:07 AM, MrBu metin....@gmail.com wrote:

Other than Lucene's own research papers, what are the research papers
or special algorithms that is being used by Elastic? I couldn't find a list
it in the documents.

Are the special algorithms used (and which ones are used in where)
for example what is the algorithm used in in load distribution or just
round robin algorithm?

I really want to get in deep with Elastic

This way I could have more knowledge. Example, suppose there are 20
nodes, and surprisingly (and somehow) only the data in 3rd node is being
searched all the time. (say these are popular documents somehow gathered
only in this node) so Elastic weights this load into all cluster by
dividing this data to other nodes ? Or will it always use only 3rd node?
There are tons of questions in my mind, waiting to be answered. Only
possible way to read the algorithms . It would help me a lot.

Thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9d07163e-43c5-4ffb-b933-3b1e7214ad07%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

aaronmefford · March 31, 2015, 6:03pm

Murmur3 appears to be coming in 2.0. Currently it looks like it is using
DJB2.

On Tue, Mar 31, 2015 at 11:53 AM, MrBu metin.akyali@gmail.com wrote:

Thats what I was looking for (murmur3) I really wondered what they used
and I was going to ask about murmur3 as weel. But as I see things, are
going pretty awesome.

Thanks

31 Mart 2015 Salı 00:42:45 UTC+3 tarihinde Aaron Mefford yazdı:

I understand that if you do not have sufficient storage space, then you
cannot manage a replica on every node. However, you are not limited to the
size of a "usual hdd". You can have a file system that spans many hdds. I
am not suggesting this, but if you have a situation where you need to
distribute all of your data, then you can. Also as we have little info on
your use case, and the most typical seems to be log ingestion, in that
scenario you can have that hot index, the most recent treated differently
than the others. You could have the number of replicas on your most recent
index spread data across the entire cluster, but then as a new index comes
online reduce the number of replicas. You could also reindex historical
data into fewer shards, improving performance, reducing addtl maintenance
tasks.

The reason I think you need to spend a bit more time reading is that the
algorithm is very easy to find:
Elasticsearch: The Definitive Guide [master] | Elastic
routing-value.html

It is a very simple algorithm and standard approach to the issue of
sharding:

shard = hash(routing) % number_of_primary_shards

The routing value by default is the document id, though you can specify
your own routing value. The specifics of which hash are not as important
except in very odd cases.

A bit more research shows this from the source:

Extract repositories metrics into its own class (#103034) · elastic/elasticsearch@b9c2980 · GitHub
9ea25df64927172787f2ffa1049f9c7804a91053#diff-
d1fcc8637b3800bf7da881b93e1de983

Current implementations seem to use the DJB2 hash which is good but does
have some cases such as 33 shards where it behaves poorly. In version 2.0
it appears they are moving to murmur3 which is a more consistent hash
across a greater set of use cases. Note that with the default of 5 shards,
DJB2 performs ideally.

On Monday, March 30, 2015 at 10:04:08 AM UTC-6, MrBu wrote:

Aaron, thanks for the reply.

You cant distribute all of the documents if the size of it is more than
a usual hdd. Also that was an example I gave. I am just figuring out the
magical ways that ES uses rather than lucene has its own.

30 Mart 2015 Pazartesi 18:55:49 UTC+3 tarihinde Aaron Mefford yazdı:

"Automagic" routing happens already on hashing the document id. It
sounds like you may have a situation where your document id is creating a
hot spot. This being the case what you want is not automagic routing but
more control over the routing or a better document id. There is the
ability to code your own routing and create a more even distribution, for
your given keyset, but I think you would be better served by a better
document key, this isnt mongo or hbase where the document key rules the
world.

The other possible reason you are hot-spotting is index creation. In a
log ingestion scenario, the most recent index is almost always the hottest
index. That is where all indexing is occurring, that is where all queries
start. If you have tweaked the 5 shard norm and are only creating 1 shard
that shard will be hot in this scenario.

Your comment on routing a shard to another shard does not make any
sense. You need to read a bit more on what the shards are and how they
work. That said if you have multiple replicas of a shard, then those
shards will automatically be distributed across all of your nodes. In fact
if the number of replicas is the same as the number of nodes in the
cluster, you should automatically have all data on all nodes, and any node
will be able to query local data, and no node will be hot because of query
volume. However indexing is still routed to the master shard.

Like was mentioned previously, the code is open, however it sounds like
you are looking to go deep water diving before learning to swim.
On Monday, March 30, 2015 at 8:57:51 AM UTC-6, MrBu wrote:

Jörg,

Thanks for the input. I have read many tutorials, guides (official one
too). Just I want to re-route in more automagic way. Like routing evenly to
the shard and duplicating mostly used shard to other shards maybe.

30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı:

Elasticsearch is open source, so reading (and using and modifying)
the algorithms is possible. There is also a lot of introductory material
available online, and I recommend "Elasticsearch - The definitive guide" if
you want paperwork.

If you create an index, ES creates shards for this index (by default
5), and different nodes receive one of such shards, so indexing and search
is automatically distributed over the participating nodes. ES keeps a map
of shards in the cluster state, so every node is able to route a query or
an index command. You don't need to manually route queries to shards.

You can force ES to put all data on 3rd node, and in that case, you
already know what you want... there is no surprise. ES follows the
principle of least surprise.

Jörg

On Mon, Mar 30, 2015 at 5:07 AM, MrBu metin....@gmail.com wrote:

Other than Lucene's own research papers, what are the research
papers or special algorithms that is being used by Elastic? I couldn't find
a list it in the documents.

Are the special algorithms used (and which ones are used in where)
for example what is the algorithm used in in load distribution or just
round robin algorithm?

I really want to get in deep with Elastic

This way I could have more knowledge. Example, suppose there are 20
nodes, and surprisingly (and somehow) only the data in 3rd node is being
searched all the time. (say these are popular documents somehow gathered
only in this node) so Elastic weights this load into all cluster by
dividing this data to other nodes ? Or will it always use only 3rd node?
There are tons of questions in my mind, waiting to be answered. Only
possible way to read the algorithms . It would help me a lot.

Thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/75907f69-
38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/wgmm_2dUN1Q/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9d07163e-43c5-4ffb-b933-3b1e7214ad07%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9d07163e-43c5-4ffb-b933-3b1e7214ad07%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADqT7cGz2LSP3-r7AifsuE6ttyh89_Y0o9p7ru2RywzrtaOUxg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Separating Index and Search Elasticsearch	7	1675	July 6, 2017
Help a newb understand node distribution Elasticsearch	3	373	July 6, 2017
Questions related to ES cluster architecture Elasticsearch	3	347	July 6, 2017
Some questions about ElasticSearch Elasticsearch	2	932	July 5, 2017
Localized data with Shard Knowledge Elasticsearch	1	444	July 6, 2017

What are the research papers that ES relies on?

Related topics