Automatic index balancing plugin or other solution?

I have heard that ideally, you want to have a similar number of documents
per shard for optimal search times, is that correct?

I have data volumes that are just all over the place, from 100k to tens of
millions in a week.

I'm thinking about a river plugin that could:
Take a mapping object as a template
Define a template for child index names (project_\YYYY_\MM_\DD_\NNN =
project_2014_04_08_000, etc)
Define index shard count (5)
Define maximum index size (1,000,000)
Define a listening endpoint of some sort

Documents would stream into the listening endpoint however you wanted,
rivers, bulk loads using an API, etc. They would be automatically routed to
the lowest numbered not-full index. So on a given day you could end up with
fifteen indexes, or eighty, or two, but they'd all be a maximum of N
records.

A plugin seems desirable in this case, as it frees you from needing to
write the load balancing into every ingestion stream you've got.

Is this a reasonable solution to this problem? Am I overcomplicating
things?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

The number of documents is not relevant to the search time.

Important factors for search time are the type of query, shard size, the
number of unique terms (the dictionary size), the number of segments,
network latency, disk drive latency, ...

Maybe you mean equal distribution of docs with same average size across
shards. This means a search does not have to wait for nodes that must
search in larger shards.

I do not think this needs a river plugin, since equal distribution of docs
over the shards is the default.

Jörg

On Tue, Apr 8, 2014 at 9:03 PM, Josh Harrison hijakk@gmail.com wrote:

I have heard that ideally, you want to have a similar number of documents
per shard for optimal search times, is that correct?

I have data volumes that are just all over the place, from 100k to tens of
millions in a week.

I'm thinking about a river plugin that could:
Take a mapping object as a template
Define a template for child index names (project_\YYYY_\MM_\DD_\NNN =
project_2014_04_08_000, etc)
Define index shard count (5)
Define maximum index size (1,000,000)
Define a listening endpoint of some sort

Documents would stream into the listening endpoint however you wanted,
rivers, bulk loads using an API, etc. They would be automatically routed to
the lowest numbered not-full index. So on a given day you could end up with
fifteen indexes, or eighty, or two, but they'd all be a maximum of N
records.

A plugin seems desirable in this case, as it frees you from needing to
write the load balancing into every ingestion stream you've got.

Is this a reasonable solution to this problem? Am I overcomplicating
things?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEq-fkJiHMkXa6myjBSjB0ut0PYZN1R2_-HTfXvF4E-Jw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Interesting, ok. A colleague went to training with Elasticsearch and was
told that given a default index with N shards similar index size was a
critical thing for maintaining consistent search performance. I guess maybe
that could play out by a two billion record index having a huge number of
unique terms, while a smaller, say, 100k record index would have a
substantially smaller set of terms, right?
Dealing with content from stuff like the Twitter public API, I would
anticipate a fairly linear growth of unique terms and overall index size.
This ultimately results in the scenario initially, where a larger index is
comparatively slower to search, due to its necessarily increased dictionary
size. It seems as though there'd still be room for the kind of
automatically scaling via a template system described above?

On Wednesday, April 9, 2014 7:38:35 AM UTC-7, Jörg Prante wrote:

The number of documents is not relevant to the search time.

Important factors for search time are the type of query, shard size, the
number of unique terms (the dictionary size), the number of segments,
network latency, disk drive latency, ...

Maybe you mean equal distribution of docs with same average size across
shards. This means a search does not have to wait for nodes that must
search in larger shards.

I do not think this needs a river plugin, since equal distribution of docs
over the shards is the default.

Jörg

On Tue, Apr 8, 2014 at 9:03 PM, Josh Harrison <hij...@gmail.com<javascript:>

wrote:

I have heard that ideally, you want to have a similar number of documents
per shard for optimal search times, is that correct?

I have data volumes that are just all over the place, from 100k to tens
of millions in a week.

I'm thinking about a river plugin that could:
Take a mapping object as a template
Define a template for child index names (project_\YYYY_\MM_\DD_\NNN =
project_2014_04_08_000, etc)
Define index shard count (5)
Define maximum index size (1,000,000)
Define a listening endpoint of some sort

Documents would stream into the listening endpoint however you wanted,
rivers, bulk loads using an API, etc. They would be automatically routed to
the lowest numbered not-full index. So on a given day you could end up with
fifteen indexes, or eighty, or two, but they'd all be a maximum of N
records.

A plugin seems desirable in this case, as it frees you from needing to
write the load balancing into every ingestion stream you've got.

Is this a reasonable solution to this problem? Am I overcomplicating
things?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8537dab8-8831-42a5-97b0-92367d3753ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.