Huge Index


(senthil prabhu) #1

hi team,
if i have 500 GB of data for an index. how can i
configure shards and replicas for that, and also gateway.

how many servers i need to achieve this.

is there any possible to split gateway?

Pls help..


(harryf) #2

For understanding the role of shards and replicas, this thread from
the mailing list is a good read - http://goo.gl/hFOVr

On Dec 30, 10:16 am, senthil prabhu senthils...@gmail.com wrote:

hi team,
if i have 500 GB of data for an index. how can i
configure shards and replicas for that, and also gateway.

how many servers i need to achieve this.

is there any possible to split gateway?

Pls help..


(dbenson) #3

When we started our deployment, we thought the shared gateway would be
idea. Having a central place with all our index data seemed
conceptually nice. It would provide the warm fuzzy of a backup.

But when we got further along, we came to recognize the limitations of
the shared file system gateway and the benefits of the local gateway.

  • Single point of failure - if the gateway goes down for an extended
    period, snapshotting will fail. We used the confirmation of the
    snapshot to acknowedge a new doc from our document repository.
  • NFS can easily saturate 1G NICs and cause clusters to become split,
    which requires manual intervention.
  • Gateway needs to be large enough to store the full index

The local gateway ends up acting like software RAID 10 (1+0)

  • Replica count acts as mirroring (RAID 1). In our environment we have
    replica=1, but we have multiple data centers, each with a full index.
    In the advent of extensive failure, we can route client traffic around
    the problem. If you don't have this luxury, you may want a higher
    replica count.
  • Shards act like striping (RAID 0). Set the shard count to ensure
    each is a manageable size
  • Adding more servers, means each server contains a smaller portion of
    the overall index for performance and reliability

The size of the index is one consideration for the number of servers,
but query volume and complexity is another driving factor. I'd
consider two an absolute bare minimum, with three providing a margin
for failure. Obviously, your budget may constrain or expand your
choices.

We have 3 servers in each data center, with 28M docs consuming 170G
disk (soon to shrink with ES 0.14), handling about 6k req/min for
client queries and 195k document matches/minute for alerting purposes.
With our hardware, we're hardly taxing them and still averaging
30-35ms response times.

David


(system) #4