Architecture question

Good morning (ugt),

I'm asking myself an important architecture question, and since it's still
time to change our software architecture, I'd like your opinion on how our
ES cluster will handle this and what's the best way to do it.

We have clients, that have many "crawl" objects associated, and each crawl
has hundreds thousands records (expects hundreds of gigabytes quickly)

We're currently using ES this way:

  • each client has an index
  • on each index, we create one crawl_XXX mapping (where XXX is the number
    associated to the crawl object). In each crawl_XXX, we have the records.

Pros: it's easy to find somethign.
Cons: hundreds of thousands indexes with thousand mappings, all the same.

I was wondering if it would not be better to use:

  • a single index
  • with client_id
  • with crawl_id

Any constructive opinion appreciated.
Thank you very much

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hundreds of thousands of indices is definitely NOT a good idea. Even if
every index had only a single shard, and no replicas, then your system
would need to maintain hundreds of thousands of open file handles. Is is
possible to partition your data by timestamp ? e.g. have an index for each
day / week / month / year ?

On Wednesday, 25 September 2013 15:42:02 UTC, fr...@botify.com wrote:

Good morning (ugt),

I'm asking myself an important architecture question, and since it's still
time to change our software architecture, I'd like your opinion on how our
ES cluster will handle this and what's the best way to do it.

We have clients, that have many "crawl" objects associated, and each crawl
has hundreds thousands records (expects hundreds of gigabytes quickly)

We're currently using ES this way:

  • each client has an index
  • on each index, we create one crawl_XXX mapping (where XXX is the number
    associated to the crawl object). In each crawl_XXX, we have the records.

Pros: it's easy to find somethign.
Cons: hundreds of thousands indexes with thousand mappings, all the same.

I was wondering if it would not be better to use:

  • a single index
  • with client_id
  • with crawl_id

Any constructive opinion appreciated.
Thank you very much

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The crawled docs seem to be of the same structure, and you have no
requirements of multi-tenancy, so there is absolutely no need to use more
than one index.

So why don't you use a timestamp-rolling index with a global unique source
ID (e.g. URL or client ID + timestamp) as doc ID?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.