Hello. I am working on an application to index documents using
ElasticSearch. I estimate that a month's worth of indexes will run to about
1.5 GB. The vast majority (90% at least) of queries are made against items
for the previous 24 hours, which is about 60 MB in indexes. It thus seems
logical to create per-day indexes. Are there other considerations though in
creating hundreds of indexes (for years of data)? Is there significant
overhead such as index load time etc. in making a query that would cover
30+ indexes? The servers I am considering for ElasticSearch have 7.5 GB or
RAM (m1.large EC2). Might per-month indexes (1.5 GB) be OK for this sort of
situation or perhaps 3/month? I don't need anyone to tell me what to do but
I would value any thoughts on advantages and tradeoffs.
Hi Ian,
querying 30 indices of one shard each or one index made of 30 shards is
exactly the same in terms of number of shards that need to execute the
query.
Time based indexing seems to be a good fit in your case, but I'd suggest to
run some performance testing to understand what is the capacity of a single
shard with your data, queries and hardware. That way you should be able to
understand if an index per day seems like a waste (as it needs to have at
least a shard, and maybe you wouldn't index enough documents on it on a
single day).
Have you had the chance to watch this talk
already: http://vimeo.com/44716955 ? It elaborates on some advanced data
desing patterns that you can apply depending on how your data flows into
your system. That doesn't mean that you need to use custom routing, but
it's something that you might want to consider or at least be aware of.
Cheers
Luca
On Thursday, October 17, 2013 2:10:52 AM UTC+2, Ian Marsman wrote:
Hello. I am working on an application to index documents using
Elasticsearch. I estimate that a month's worth of indexes will run to about
1.5 GB. The vast majority (90% at least) of queries are made against items
for the previous 24 hours, which is about 60 MB in indexes. It thus seems
logical to create per-day indexes. Are there other considerations though in
creating hundreds of indexes (for years of data)? Is there significant
overhead such as index load time etc. in making a query that would cover
30+ indexes? The servers I am considering for Elasticsearch have 7.5 GB or
RAM (m1.large EC2). Might per-month indexes (1.5 GB) be OK for this sort of
situation or perhaps 3/month? I don't need anyone to tell me what to do but
I would value any thoughts on advantages and tradeoffs.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.