Need help optimizing index with 450mio entries

sulo · April 20, 2020, 2:06pm

Hi everybody,

I have currently massive problems with my elasticsearch setup. I store aggregated google analytics data in elastic search for further processing. There are currently around 450mio documents in this index.

The problem I have that the aggregate queries that I need to perform on the index sometimes take longer than 60sec which results in a "Gateway Timeout".

But the much bigger problem is that the spark-jobs that I use to add data once a day as a batch started failing on me because it seems that the index can't keep up inserting documents. This basically makes the data inconsistent and of not that much worth.

When I started I didn't configure much as shown below. But after the first problems occurred I started researching indexes and shards and so on... also I read that it is a good idea to split the index in monthly indexes.

So that was my next and current try. But that performs even worse for lookup. Haven't tested it for writing tho.

Since the amount of data is quite large and copying data from one index to another takes quite some time. There are not that many shots that I can take. That's why some help would really be appreciated.

I started with a very basic index without much configuration:

    {
      "kpi_dashboard" : {
        "aliases" : { },
        "mappings" : {
          "properties" : {
            "date" : {
              "type" : "date"
            },
            "fullVisitorId" : {
              "type" : "keyword",
              "eager_global_ordinals" : true
            },
            "page_hostname" : {
              "type" : "keyword"
            },
            "portfolio" : {
              "type" : "keyword",
              "eager_global_ordinals" : true
            },
            "session_id" : {
              "type" : "keyword"
            },
            "time_on_host" : {
              "type" : "long"
            },
            "visitId" : {
              "type" : "keyword"
            }
          }
        },
        "settings" : {
          "index" : {
            "creation_date" : "1583354059301",
            "number_of_shards" : "5",
            "number_of_replicas" : "1",
            "uuid" : "XSHICZ1gQICzHfbSmBKajA",
            "version" : {
              "created" : "7010199"
            },
            "provided_name" : "kpi_dashboard"
          }
        }
      }
    }

Then used a monthswise index like this:

    {
      "kpi-dashboard-traffic-2019-01-00001" : {
        "aliases" : {
          "kpi-dashboard-traffic-read" : { }
        },
        "mappings" : {
          "properties" : {
            "date" : {
              "type" : "date"
            },
            "fullVisitorId" : {
              "type" : "keyword",
              "eager_global_ordinals" : true
            },
            "page_hostname" : {
              "type" : "keyword"
            },
            "portfolio" : {
              "type" : "keyword",
              "eager_global_ordinals" : true
            },
            "session_id" : {
              "type" : "keyword",
              "eager_global_ordinals" : true
            },
            "time_on_host" : {
              "type" : "long"
            },
            "visitId" : {
              "type" : "keyword"
            }
          }
        },
        "settings" : {
          "index" : {
            "number_of_shards" : "1",
            "provided_name" : "kpi-dashboard-traffic-2019-01-00001",
            "creation_date" : "1583998849339",
            "sort" : {
              "field" : [
                "date",
                "portfolio",
                "session_id"
              ],
              "order" : [
                "desc",
                "desc",
                "desc"
              ]
            },
            "number_of_replicas" : "1",
            "uuid" : "T1ZMx5DHSEie8y7qC_OlgA",
            "version" : {
              "created" : "7010199"
            }
          }
        }
      }
    }

The Elasticsearch Cluster is currently running on AWS with 1 master and 1 data node in each of 3 availability zones. (so 6 servers)

masters are r5.large
datanodes are r5.2xlarge

Thank you very much.

Magnus_Kessler · April 20, 2020, 5:05pm

Please have a look at Index Lifecycle Managemen (ILM). You didn't mention how big the indices/shards have become, but based on your description I guess that problems start once the shards get too big. Good shard sizes are up to about 25-50 GB per primary shard. The number of primary shards should be chosen so that they can cope with the maximum data rate of incoming documents. If a single primary shard struggles to keep up, increase the number of primary shards.

Please also have a look at this blog post, that explain how to size shards correctly.

sulo · April 20, 2020, 6:32pm

Hi Magnus, Thank you very much for your answer... I just checked. Shard sizes seem to be ok with some room for them to grow.

Do you have any other idea were to look?

kpi_dashboard 1 p STARTED 44531382 8.4gb x.x.x.x dd401a1420dd82bf7fd772de629451ba
kpi_dashboard 1 r STARTED 44531382 8.4gb x.x.x.x 9b69f600fc8fba1a6be6515ece16ee7c
kpi_dashboard 2 r STARTED 46404871 9.6gb x.x.x.x 98adf36189b641dd13590949b39a4af5
kpi_dashboard 2 p STARTED 46404871 9.6gb x.x.x.x 9b69f600fc8fba1a6be6515ece16ee7c
kpi_dashboard 4 p STARTED 46431690 9.8gb x.x.x.x 98adf36189b641dd13590949b39a4af5
kpi_dashboard 4 r STARTED 46431690 9.8gb x.x.x.x dd401a1420dd82bf7fd772de629451ba
kpi_dashboard 3 r STARTED 44206414 8.9gb x.x.x.x dd401a1420dd82bf7fd772de629451ba
kpi_dashboard 3 p STARTED 44206414 8.9gb x.x.x.x 9b69f600fc8fba1a6be6515ece16ee7c
kpi_dashboard 0 p STARTED 44841125   8gb x.x.x.x 98adf36189b641dd13590949b39a4af5
kpi_dashboard 0 r STARTED 44841125   8gb x.x.x.x 9b69f600fc8fba1a6be6515ece16ee7c

Christian_Dahlqvist · April 20, 2020, 7:45pm

Elasticsearch is often limited by the performance of the storage used. Based on your instance types I expect you are using EBS. Are you using PIOPS? What is the size of your data volumes? Have you run iostat on then while under load to see what disk utilisation and iowait looks like?

sulo · April 20, 2020, 8:14pm

Hi Christian,

you are correct I use EBS:

EBS volume type - General Purpose (SSD)
EBS volume size - 150 GiB

About the other stuff: As you can see I currently dont use PIOPS SSD. I will try to change that and see whether that helps.
About iostat, no I haven't run that.. to be honest I don't even know how.. but I will check it out for sure... thank you very much for your suggestions.

Christian_Dahlqvist · April 20, 2020, 8:24pm

Standard gp2 EBS have IOPS levels proportional to size so while such disk can burst to higher levels it is quite slow.

sulo · April 20, 2020, 9:42pm

Hi Christian,

i changed the EBS as a test to PIOPS and just for test put in the max. possible number of PIOPS for my volume size.

But it didn't seem to help. The write operations still fail with

ERROR NetworkClient: Node [xxxx.es.amazonaws.com:443] failed (java.net.SocketTimeoutException: Read timed out); no other nodes left - aborting...
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried

after a minute.

Some spark stats of this failed task:

date	                   duration	   GC time	    shuffle read	shuffle remote reads          
2020-04-20 23:18 (UTC+2)	1.0 min		50 ms		16.3 KiB / 1,932	120.4 KiB

I couldn't find sofar how to access iostat of an AWS Elasticsearch Service Cluster : -/

sulo · April 21, 2020, 7:55pm

bump... Does nobody have any ideas?

Christian_Dahlqvist · April 22, 2020, 4:55am

If you are using AWS Elasticsearch service it limits what you can see as you do not have access to the hosts. I would recommend checking with AWS support what monitoring metrics are available. It might also be possible to use the profile API to see what the query is spending time on.

sulo · April 22, 2020, 7:27pm

ok I have now found the issue why the writes time-out. The problem here was the setting "eager_global_ordinals" : true that was placed to speed up the query speed.

But because elastic was now updateing indices why the bulk writes happened it basically got DDoS'd by Spark because it couldn't keep up with the requests spark was sending.

But I still have no idea how to make the querying the index faster.

system · May 20, 2020, 7:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.