I will be starting some testing using elastic search for a really heavy
amount of data (at least heavy for me by now)
I will need to handle a cluster handling about 5000GB of text for searching
Should it be better to have different smaller clusters, organized depending
on my search needs?
Should I handle scheduled indexing in some way?
I would need searches to finish in some acceptable time for an end user
performing queries, but I don't expect to have a heavy load on queries,
But well, I'm worried about the size of that data, it will millions of
small entires, I guess each entry would be less than 1KB, with an average
around 200 bytes.
I would thank so much some comments about some experiences like this.
plain indexing is not a problem as such. The challenge is making
queries performant over your index. And advice here will differ
depending on your queries: will you be using facets? Will your queries
only hit a subset of all data (such as period-based queries)? How
often will you be indexing new documents? In batches?
I will be starting some testing using Elasticsearch for a really heavy
amount of data (at least heavy for me by now)
I will need to handle a cluster handling about 5000GB of text for searching
Should it be better to have different smaller clusters, organized depending
on my search needs?
Should I handle scheduled indexing in some way?
I would need searches to finish in some acceptable time for an end user
performing queries, but I don't expect to have a heavy load on queries,
But well, I'm worried about the size of that data, it will millions of
small entires, I guess each entry would be less than 1KB, with an average
around 200 bytes.
I would thank so much some comments about some experiences like this.
5 TB is not small, but doable, depending on a number of factors: hardware,
ES configuration, query complexity, query concurrency, query latency
requirements, etc.
Unfortunately, nobody can give you precise advice without knowing a lot
more details about the above.... you'll want to look at sharding,
oversharding, replication, at cache sizes, at compression, at routing and
filtering, etc. etc. Again, can't give you exact guidance or answers
without knowing a lot more.
On Thursday, October 4, 2012 12:05:40 AM UTC-4, Hernán Leoni wrote:
Hi group,
I will be starting some testing using Elasticsearch for a really heavy
amount of data (at least heavy for me by now)
I will need to handle a cluster handling about 5000GB of text for searching
Should it be better to have different smaller clusters, organized
depending on my search needs?
Should I handle scheduled indexing in some way?
I would need searches to finish in some acceptable time for an end user
performing queries, but I don't expect to have a heavy load on queries,
But well, I'm worried about the size of that data, it will millions of
small entires, I guess each entry would be less than 1KB, with an average
around 200 bytes.
I would thank so much some comments about some experiences like this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.