Advice needed on ES indexes/structure

Hi,

In a project we'll have data from various sources. A lot of these sources
can be divided as such:

All data belongs to a certain group (1:m) and there can be multiple groups.
The data in a group is always unrelated to the data in another group. It's
hard to say how much the ratio will be, but initial guesses guestimate that
the ratio will be more than 1000:1 (1000 "data" compared to 1 "group").
Data can be seen as chunks of text (of about 200 characters) and meta-data.
Within a group there can be various types of data. It can be file
meta-data, plain-text or "structured" JSON.

My questions are:

  • Is it a good idea to create an index for each group, or would a better
    approach be to create an index for the various types?
  • How does this work with sharding or is that an unrelated issue?
  • Does it have any (availability/performance) benefit to keep data
    contained within an index because it's related (since there is no shared
    data between groups)

The document refers to an index as a schema, as such it would hint to
making an index per type. Id just like to know if that would be a correct
assumption, as there is a valid case for both options.

Thanks in advance,

Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello Mark,

From a functionality point of view, each type has a mapping attached to it,
which acts as a schema. On the other hand, all your types end up in the
same Lucene indices (shards), which means all fields from all types end up
in the same "master" schema, and the type itself is just another field.

Having one index per group has the benefit that you only have that index to
search on, but you'll have a lot of indices, and implicitly a lot of
shards. Which comes with an overhead in terms of memory and open files.

On the other hand, having all data into the same index would normally
require searching in all shards to find data for a single group. You can
work around that by using routing
http://www.elasticsearch.org/guide/reference/mapping/routing-field/

Since you have a low number of documents per group, you'll probably be
better off having them all in the same index and using routing. The problem
comes if you have groups with A LOT of documents, because with routing all
documents from a group will all end up in the same shard - which might make
your shard sizes different. Then, you'd probably have to take those big
groups out in their own indices, and use
aliaseshttp://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/to
trick apps into thinking it's a single index when they search :slight_smile:

I'd recommend two talks about what you're interested in:

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Fri, Jun 14, 2013 at 6:15 PM, Mark van der Velden markvdv@gmail.comwrote:

Hi,

In a project we'll have data from various sources. A lot of these sources
can be divided as such:

All data belongs to a certain group (1:m) and there can be multiple
groups. The data in a group is always unrelated to the data in another
group. It's hard to say how much the ratio will be, but initial guesses
guestimate that the ratio will be more than 1000:1 (1000 "data" compared to
1 "group"). Data can be seen as chunks of text (of about 200 characters)
and meta-data. Within a group there can be various types of data. It can be
file meta-data, plain-text or "structured" JSON.

My questions are:

  • Is it a good idea to create an index for each group, or would a better
    approach be to create an index for the various types?
  • How does this work with sharding or is that an unrelated issue?
  • Does it have any (availability/performance) benefit to keep data
    contained within an index because it's related (since there is no shared
    data between groups)

The document refers to an index as a schema, as such it would hint to
making an index per type. Id just like to know if that would be a correct
assumption, as there is a valid case for both options.

Thanks in advance,

Mark

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Radu,

First of all, thanks for your feedback.

On Sunday, 16 June 2013 14:17:27 UTC+2, Radu Gheorghe wrote:

Hello Mark,

From a functionality point of view, each type has a mapping attached to
it, which acts as a schema. On the other hand, all your types end up in the
same Lucene indices (shards), which means all fields from all types end up
in the same "master" schema, and the type itself is just another field.

Having one index per group has the benefit that you only have that index
to search on, but you'll have a lot of indices, and implicitly a lot of
shards. Which comes with an overhead in terms of memory and open files.

On the other hand, having all data into the same index would normally
require searching in all shards to find data for a single group. You can
work around that by using routing
http://www.elasticsearch.org/guide/reference/mapping/routing-field/

Since you have a low number of documents per group, you'll probably be
better off having them all in the same index and using routing. The problem
comes if you have groups with A LOT of documents, because with routing all
documents from a group will all end up in the same shard - which might make
your shard sizes different. Then, you'd probably have to take those big
groups out in their own indices, and use aliaseshttp://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/to trick apps into thinking it's a single index when they search :slight_smile:

To give a little context about numbers I'm considering:

  • Some groups with around 100.000 documents, per year
  • Many groups with around 1.000.000 documents, per year.
  • A reasonable amount of groups having 150.000.000 documents per year.

I estimate having an increase in groups of about 500.000, each year.

I'd recommend two talks about what you're interested in:

Great, I'll watch these right away.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Fri, Jun 14, 2013 at 6:15 PM, Mark van der Velden <mar...@gmail.com<javascript:>

wrote:

Hi,

In a project we'll have data from various sources. A lot of these sources
can be divided as such:

All data belongs to a certain group (1:m) and there can be multiple
groups. The data in a group is always unrelated to the data in another
group. It's hard to say how much the ratio will be, but initial guesses
guestimate that the ratio will be more than 1000:1 (1000 "data" compared to
1 "group"). Data can be seen as chunks of text (of about 200 characters)
and meta-data. Within a group there can be various types of data. It can be
file meta-data, plain-text or "structured" JSON.

My questions are:

  • Is it a good idea to create an index for each group, or would a better
    approach be to create an index for the various types?
  • How does this work with sharding or is that an unrelated issue?
  • Does it have any (availability/performance) benefit to keep data
    contained within an index because it's related (since there is no shared
    data between groups)

The document refers to an index as a schema, as such it would hint to
making an index per type. Id just like to know if that would be a correct
assumption, as there is a valid case for both options.

Thanks in advance,

Mark

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.