Hi there,
I am new to ElasticSearch and Lucene and have been prototyping my business
case with a single node both locally and in AWS. I want to move to the next
step and start building out an ES cluster in AWS. One thing I am trying to
figure out is what would be the recommended amount of nodes, hardware specs
for nodes, and my sharding\index strategy I should have for my business
case. Any advice, best practices, tips would be much appreciated. Here is
some info about the documents I am indexing.
The documents I am indexing represents an object lets say a user. For the
user object I have facts about the user that I need to retain on a weekly
basis for possibly 2 years. I also need to be able to query and sort on
deltas of those facts. To meet the need of sorting on facts deltas, each
object contains all the facts for every week and I use the custom score
query with a script field to do something like:
"doc.facts.week1.fact2 - doc.facts.week2.fact2;"
For this case the document JSON looks something like this:
{
"name": "User1",
"facts":{
"week1": {
"fact1": 10,
"fact2":100
},
"week2":{
"fact1": 30,
"fact2":500
}
}
}
My actual documents have a bit more so here are some sizing details:
Average Document size with 10 weeks of data - 178.5kb
Extrapolated to 104 weeks (2 years) of data - 1.74mb per doc
Current Estimates have about 4MM objects in the system and can grow to 10MM
objects in the next year
So to start out the cluster would handle about 680gb and could grow to
16tb+ over the next two years.
Finally I am also trying to understand best practice for a sharding\index
strategy. My understanding is that querying against 1 index with 2 shards
is the same as querying against 2 indices. However when you create an index
you specify the amount of shards and the amount of shards cannot change. My
concern is what would would be the best strategy so that an index or single
shard in an index does not get too big for a node to handle and if its
approaching that size what can be done? Here is some more info about my
data to understand possible sharding\index strategy.
The data is partitioned by accounts. Accounts have regions and regions
contain users. Most accounts have around 3-5 regions and 10,000 - 50,000
users for the whole account. There are some larger accounts that have about
100 regions with about 100,000 users.
My initial thoughts were to make an index for each account or an index for
each region. With just 1 shard and 1 or 2 replicas for redundancy. But
wasn't sure what I would do if an index became too large for one node to
handle.
Please let me know if more info would be useful.
Thanks in advance for any help.
-dave
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.