Investigate indexing bottleneck

weibin.wu · May 24, 2017, 8:54am

We would like to index in a cluster as fast as possible, right now the cluster cap at 11000 doc/per second.
All the documents we index look like this

{
"_index":"myteksi-changeling_changeling_models_loglings",
"_type":"changeling/models/logling",
"_id":"91078043c6f9ab8816e5377cd817594b",
"_score":1,
"_source":{
"id":"91078043c6f9ab8816e5377cd817594b",
"klass":"candidate",
"oid":"2681140321",
"modified_by":null,
"modifications":"{"driver_distance":[1.46391,3.842],"lock_version":[0,1]}",
"modified_at":"2015-04-25T00:11:33+08:00",
"modified_fields":[
"driver_distance",
"lock_version"
]
}
}

Our server that sends the documents is using about 50% CPU.
Network bandwidth is not a worry because it is setup in AWS.

The CPU usage of the ES cluster is Avg 30%.
We are using SSD. Write throughtput is about maximum 80MB /per second.

Any attribute I can look at to investigate the bottleneck. please advice thanks.

Christian_Dahlqvist · May 24, 2017, 9:04am

It would be useful if you could provide some additional details:

Which version of Elasticsearch are you using?
Which EC2 instance types are you using? How large is the cluster?
How many indices/shards are you actively indexing into?
What is the size of indices and shards being indexed into?
Are indexed documents immutable or updated? If updated, how large portion of operations are updates?
Are your mappings static or dynamic?
Are you indexing in bulk? If so, what is your bulk size?
How many parallel indexing threads do you have against the cluster?

weibin.wu · May 24, 2017, 10:59am

Which version of Elasticsearch are you using?
5.1

Which EC2 instance types are you using? How large is the cluster?
3 x m4.xlarge

How many indices/shards are you actively indexing into?
indices are separated by month.
Each month index is about 140GB max with 5 shards.

What is the size of indices and shards being indexed into?
140GB max with 5 shards

Are indexed documents immutable or updated? If updated, how large portion of operations are updates?
The indexed document is totally new for the cluster. We insert with _bulk of 20000 in a batch.

Are your mappings static or dynamic?
mapping is static ,here is the mapping

{
"aliases": {

},
"mappings": {
  "changeling/models/logling": {
    "properties": {
      "id": {
        "type": "string"
      },
      "klass": {
        "type": "string"
      },
      "modifications": {
        "type": "string"
      },
      "modified_at": {
        "type": "date",
        "format": "dateOptionalTime"
      },
      "modified_by": {
        "type": "string"
      },
      "modified_fields": {
        "type": "string",
        "analyzer": "keyword"
      },
      "oid": {
        "type": "string"
      }
    }
  }
},
"settings": {
  "index": {
    "number_of_replicas": "1",
    "number_of_shards": "5",
    "refresh_interval": "60s"
  }
},
"warmers": {
  
}

}

Are you indexing in bulk? If so, what is your bulk size?
yes. 20000 in a batch
How many parallel indexing threads do you have against the cluster?
I used 3 processes in Linux, and allocate them to different CPU.

Christian_Dahlqvist · May 24, 2017, 1:43pm

Do you have monitoring installed? What does GC look like? How many queries per second are you seeing?

weibin.wu · May 25, 2017, 3:23am

query per second as I mentioned, it was 10000 indexing per second. No query at that time.

Please see monitoring here

Christian_Dahlqvist · May 25, 2017, 6:53am

Based on the sample record it looks like you are specifying the document ID at the application layer instead of letting Elasticsearch assign one. Is this correct? The way you assign a document id can have an impact on indexing performance as Elasticsearch need to determine if it is an update or an new document. Are you by any chance seeing indexing throughput drop as the monthly index gets larger and then recover once you start a. new monthly index? If you are not updating documents, can you let Elasticsearch assign IDs and see if that makes a difference?

weibin.wu · May 25, 2017, 6:58am

I am using _bulk request to do the index like this
{"index":{"_index":"myteksi-changeling_changeling_models_loglings","_type":"changeling/models/logling","_id":"4bdfa1cdab20cb352cf745db1fbc7cfd"}}
{"id":"4bdfa1cdab20cb352cf745db1fbc7cfd","klass":"rating","oid":"27647823","modified_by":null,"modifications":"{"current_rating":["0.0","4.58765"]}","modified_at":"2016-02-14T03:12:17+08:00","modified_fields":["current_rating"]}

If I want to let ES decide the _id, should I remove _id field like this?
{"index":{"_index":"myteksi-changeling_changeling_models_loglings","_type":"changeling/models/logling"}
{"klass":"rating","oid":"27647823","modified_by":null,"modifications":"{"current_rating":["0.0","4.58765"]}","modified_at":"2016-02-14T03:12:17+08:00","modified_fields":["current_rating"]}

Christian_Dahlqvist · May 25, 2017, 7:04am

If you do not specify an id Elasticsearch will assign one, so that looks correct.

weibin.wu · May 25, 2017, 7:06am

Okay thanks. I will have a test to see whether its faster.

system · June 22, 2017, 7:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow Indexing speed / Bottleneck Elasticsearch	6	722	September 16, 2020
Elasticsearch 7.17.10 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS Elasticsearch	53	1421	June 22, 2023
Indexing rate performance in cluster Elasticsearch	6	3759	July 5, 2017
Performance problems Elasticsearch	12	586	July 6, 2017
ES Query bottleneck Elasticsearch	7	1583	July 6, 2017

Investigate indexing bottleneck

We would like to index in a cluster as fast as possible, right now the cluster cap at 11000 doc/per second. All the documents we index look like this

Related topics

We would like to index in a cluster as fast as possible, right now the cluster cap at 11000 doc/per second.
All the documents we index look like this