I'm asking myself an important architecture question, and since it's still
time to change our software architecture, I'd like your opinion on how our
ES cluster will handle this and what's the best way to do it.
We have clients, that have many "crawl" objects associated, and each crawl
has hundreds thousands records (expects hundreds of gigabytes quickly)
We're currently using ES this way:
each client has an index
on each index, we create one crawl_XXX mapping (where XXX is the number
associated to the crawl object). In each crawl_XXX, we have the records.
Pros: it's easy to find somethign.
Cons: hundreds of thousands indexes with thousand mappings, all the same.
I was wondering if it would not be better to use:
a single index
with client_id
with crawl_id
Any constructive opinion appreciated.
Thank you very much
Hundreds of thousands of indices is definitely NOT a good idea. Even if
every index had only a single shard, and no replicas, then your system
would need to maintain hundreds of thousands of open file handles. Is is
possible to partition your data by timestamp ? e.g. have an index for each
day / week / month / year ?
On Wednesday, 25 September 2013 15:42:02 UTC, fr...@botify.com wrote:
Good morning (ugt),
I'm asking myself an important architecture question, and since it's still
time to change our software architecture, I'd like your opinion on how our
ES cluster will handle this and what's the best way to do it.
We have clients, that have many "crawl" objects associated, and each crawl
has hundreds thousands records (expects hundreds of gigabytes quickly)
We're currently using ES this way:
each client has an index
on each index, we create one crawl_XXX mapping (where XXX is the number
associated to the crawl object). In each crawl_XXX, we have the records.
Pros: it's easy to find somethign.
Cons: hundreds of thousands indexes with thousand mappings, all the same.
I was wondering if it would not be better to use:
a single index
with client_id
with crawl_id
Any constructive opinion appreciated.
Thank you very much
The crawled docs seem to be of the same structure, and you have no
requirements of multi-tenancy, so there is absolutely no need to use more
than one index.
So why don't you use a timestamp-rolling index with a global unique source
ID (e.g. URL or client ID + timestamp) as doc ID?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.