Elastic cluster recommendation for large data set

I'm planing to use elasticsearch for my reporting purposes. I'm very new to elasticsearch therefore looking for some advises in below areas.

Scenario
For reporting purposes i'm using ES. My data set roughly estimated to be 5 billion. This data contains only 1 type of data hence 1 index. I need to keep data for 3 months time period. This data would be search to generate reports. Therefore I know the most of the queries i'm using. The main requirements are low response time. Also i cannot loose single document.

  1. To begin with index design I'm using shard = 1000, replica = 1. Is using 1000 shards too much ? Does replica = 1 good, given that i don't want to loose
    a document.

  2. I have data fields which should not be searchable but should appear in result based on other searches. Is there any benefit marking those as index : false ? I'm looking for high performance queries and to reduce disk usage.

  3. For cluster i'm starting with 3 dedicated master nodes. 2 data nodes to begin which expect to grow later. 1 client node. Is this good cluster to
    start with ( HA and avoiding split brain are the focus here )

  4. For index deletion, i'm thinking creating monthly index then deleting index which are 3 months old using curator. Does this delete would effect
    any queries run on that time ? Is there a better approach here ?

  5. Do i need to get manual backup or replica=1 would be enough ?

Is there any advice or areas which i should look more into.

Sounds wildly excessive. If you are going to use monthly, weekly or maybe even daily indices, how large do you expect them to become? This will depend a lot on the size of the documents and how many fields you index. I would recommend starting with just one or a few primary shards per index. If they turn out too large you can use the split index API to adjust this.

This probably mostly have an impact on how much space the index takes up on disk.

Sounds reasonable, especially if you envision expanding the cluster later.

This is a common approach that is tried and tested. If you are keeping 3 months worth of data it might however make sense to have each index cover a smaller time period to make the delete more granular.

Having a replica gives you some protection against data loss, but I would recommend to also take snapshots periodically.

2 Likes

Thanks for your response.

If you are going to use monthly, weekly or maybe even daily indices, how large do you expect them to become?

Daily i would expect 60 million data while weekly 4 billion data set. Also I have total 32 no of fields which have 18 no of fields as index fields which needs to be analyzed ( queries run as aggregation on these fields ). Others are not needed to analyzed.

Also could you please explain on impact of weekly ( 60 million ) vs monthly ( 4 billion ) index approaches. Which gives more performance ? Which consumes more disk ? Or both approaches are same ?

Pardon for another question. Would it possible to run client nodes sharing other services ( java written micro services ) ? I think master and data should be dedicated to run elasticsearch service. Am I correct ? My concern is since my cluster have lot of other services ( mainly java ) running could I share relatively low processing services with elasticsearch nodes to reduce cost.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.