Architecture question

fred · September 25, 2013, 3:42pm

Good morning (ugt),

I'm asking myself an important architecture question, and since it's still
time to change our software architecture, I'd like your opinion on how our
ES cluster will handle this and what's the best way to do it.

We have clients, that have many "crawl" objects associated, and each crawl
has hundreds thousands records (expects hundreds of gigabytes quickly)

We're currently using ES this way:

each client has an index
on each index, we create one crawl_XXX mapping (where XXX is the number
associated to the crawl object). In each crawl_XXX, we have the records.

Pros: it's easy to find somethign.
Cons: hundreds of thousands indexes with thousand mappings, all the same.

I was wondering if it would not be better to use:

a single index
with client_id
with crawl_id

Any constructive opinion appreciated.
Thank you very much

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Graham_Lenton · September 26, 2013, 7:32am

Hundreds of thousands of indices is definitely NOT a good idea. Even if
every index had only a single shard, and no replicas, then your system
would need to maintain hundreds of thousands of open file handles. Is is
possible to partition your data by timestamp ? e.g. have an index for each
day / week / month / year ?

On Wednesday, 25 September 2013 15:42:02 UTC, fr...@botify.com wrote:

Good morning (ugt),

I'm asking myself an important architecture question, and since it's still
time to change our software architecture, I'd like your opinion on how our
ES cluster will handle this and what's the best way to do it.

We have clients, that have many "crawl" objects associated, and each crawl
has hundreds thousands records (expects hundreds of gigabytes quickly)

We're currently using ES this way:

each client has an index

on each index, we create one crawl_XXX mapping (where XXX is the number
associated to the crawl object). In each crawl_XXX, we have the records.

Pros: it's easy to find somethign.
Cons: hundreds of thousands indexes with thousand mappings, all the same.

I was wondering if it would not be better to use:

a single index

with client_id

with crawl_id

Any constructive opinion appreciated.
Thank you very much

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · September 26, 2013, 7:39am

The crawled docs seem to be of the same structure, and you have no
requirements of multi-tenancy, so there is absolutely no need to use more
than one index.

So why don't you use a timestamp-rolling index with a global unique source
ID (e.g. URL or client ID + timestamp) as doc ID?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Questions related to ES cluster architecture Elasticsearch	3	364	July 6, 2017
Multi-tenancy strategy: 1 index with 1 shard and 1 replica per client Elasticsearch	6	641	July 6, 2017
Need some help / idea about architecture Elasticsearch	4	386	July 6, 2017
Evaluating ES and need some input Elasticsearch	5	347	July 6, 2017
Best solution for keeping data of many clients Elasticsearch	4	354	July 6, 2017

Architecture question

Related topics