Hello All,
First off, ES seems to be pretty great from my testing. If all
holds, I would love to choose this sort of product over an Apache
Zookeeper + SOLR (ugh the admin)
After testing a few things, and trying out some configs, I have the
following questions
1 - We currently run a web crawler that gets around 100K documents per
day. As we get the docs, we tag it with meta data, such as the topics
associated with it, named entities, is it a video, etc.
We are up to around 30 million docs.
What sort of tricks/performance would I expect to get using this
distributed ES when Sorting by a publication date?
So.. if I do a query where I ask to show me all documents that match
a "Topic" then sort by date, is this something that will query against
all shards/replicas?
For the purpose of this question, lets say I do 3 shards, 2
replicas.
From just reviewing, it seems like all shards or replicas would be
queried for the query, then who ever initiated the query would get the
results back, and that box would then have to SORT. Is that correct?
Any better way to do that?
a - in the realms of better ways, would forcing certain Topic or
Meta Tagged content into the same shards make sense?
b - Is this something where you would want to have an index per
Topic? We could have millions of topics though
Or, am I worried about nothing here, and just having some small
amount of shards + replicas would give me sub second responses, no
real trickery needed?
- just some more on this. Most of the time we do a query where we
get documents that have "Topic(s)" and return by date. Sometimes we do
"Topic(s)" + keywords of the text, or Topic(s) + what source they came
from.
2 - Shards and replica placements
I have seen talk in this group about multi-data center awareness,
and wanted to see if there was any good response to this.
We are hosted at EC2 and would initially put these ES2 clusters in
us-east. I would want a hot backup in us-west though, just in case
east failed. I would not want any of the shards/replicas to take any
traffic though.
Almost like... I have 3 shards/2 replicas in us-east, across say 5
nodes. I would want to run say 2 nodes in us-west as a hot backup,
just in case there is a major failure.
Anyway to do that? Or is it just, you fail in us-east, and then you
try and start from your s3 gateway in us-west?
Or, is there a way to have other machines in us-west be a gateway,
like s3 is, and recover from those?
3 - Speaking of #2 and recovery
I know that there is the s3 gateway, what is a good way to possibly
attach an EBS, snapshot that EBS, and then recover from that
snapshot? Would all "shard" nodes need to have this EBS?
Or, would every node in the cluster need EBS, and you have to
snapshot all of them, just to have recovery?
4 - index freshness
so, I got a bit confused at the documentation on index freshness.
There is a index refresh versus the transaction log timings. Just
trying to figure out what happens after I insert a doc.
So, at time 0, I index document A
by default, after 1 second, I should be able to query for document
A?
That is what the index refresh says.
Or, is it 60 minutes, which is the default on the transaction log?
if it is the default 1 second for the index, what is the
transaction log doing that it has some kind of 60 minute flush?
5 - API queries for health errors
so, according to the docs, I was supposed to be able to call this
api url
/_cluster/health?pretty=true
which gave me an error that said you can't have a cluster with "_" in
the name
so, then I did
//health?pretty=true
and got this error:
{
"error" : "ElasticSearchParseException[Failed to derive xcontent
from (offset=0, length=0): []]",
"status" : 500
}
Getting and Putting documents were fine though, so, just trying to
figure out why this did not work.
Thanks for any help in the evaluation. So far, this seems like the
easiest and best way to manage a large, distributed index. Way better
than the Zookeeper/SOLR path.
Scott