First off, ES seems to be pretty great from my testing. If all holds, I would love to choose this sort of product over an Apache Zookeeper + SOLR (ugh the admin)
After testing a few things, and trying out some configs, I have the following questions
1 - We currently run a web crawler that gets around 100K documents per day. As we get the docs, we tag it with meta data, such as the topics associated with it, named entities, is it a video, etc.
We are up to around 30 million docs.
What sort of tricks/performance would I expect to get using this distributed ES when Sorting by a publication date?
So.. if I do a query where I ask to show me all documents that match a "Topic" then sort by date, is this something that will query against all shards/replicas?
For the purpose of this question, lets say I do 3 shards, 2 replicas.
From just reviewing, it seems like all shards or replicas would be queried for the query, then who ever initiated the query would get the results back, and that box would then have to SORT. Is that correct?
Any better way to do that?
a - in the realms of better ways, would forcing certain Topic or Meta Tagged content into the same shards make sense?
b - Is this something where you would want to have an index per Topic? We could have millions of topics though
Or, am I worried about nothing here, and just having some small amount of shards + replicas would give me sub second responses, no real trickery needed?
- just some more on this. Most of the time we do a query where we get documents that have "Topic(s)" and return by date. Sometimes we do "Topic(s)" + keywords of the text, or Topic(s) + what source they came from.
2 - Shards and replica placements
I have seen talk in this group about multi-data center awareness, and wanted to see if there was any good response to this.
We are hosted at EC2 and would initially put these ES2 clusters in us-east. I would want a hot backup in us-west though, just in case east failed. I would not want any of the shards/replicas to take any traffic though.
Almost like... I have 3 shards/2 replicas in us-east, across say 5 nodes. I would want to run say 2 nodes in us-west as a hot backup, just in case there is a major failure.
Anyway to do that? Or is it just, you fail in us-east, and then you try and start from your s3 gateway in us-west?
Or, is there a way to have other machines in us-west be a gateway, like s3 is, and recover from those?
3 - Speaking of #2 and recovery
I know that there is the s3 gateway, what is a good way to possibly attach an EBS, snapshot that EBS, and then recover from that snapshot? Would all "shard" nodes need to have this EBS?
Or, would every node in the cluster need EBS, and you have to snapshot all of them, just to have recovery?
4 - index freshness
so, I got a bit confused at the documentation on index freshness. There is a index refresh versus the transaction log timings. Just trying to figure out what happens after I insert a doc.
So, at time 0, I index document A
by default, after 1 second, I should be able to query for document A?
That is what the index refresh says.
Or, is it 60 minutes, which is the default on the transaction log?
if it is the default 1 second for the index, what is the transaction log doing that it has some kind of 60 minute flush?
5 - API queries for health errors
so, according to the docs, I was supposed to be able to call this api url
which gave me an error that said you can't have a cluster with "_" in the name
so, then I did
and got this error:
"error" : "ElasticSearchParseException[Failed to derive xcontent from (offset=0, length=0): ]",
"status" : 500
Getting and Putting documents were fine though, so, just trying to figure out why this did not work.
Thanks for any help in the evaluation. So far, this seems like the easiest and best way to manage a large, distributed index. Way better than the Zookeeper/SOLR path.