Investigate indexing bottleneck

We would like to index in a cluster as fast as possible, right now the cluster cap at 11000 doc/per second.
All the documents we index look like this

{
"_index":"myteksi-changeling_changeling_models_loglings",
"_type":"changeling/models/logling",
"_id":"91078043c6f9ab8816e5377cd817594b",
"_score":1,
"_source":{
"id":"91078043c6f9ab8816e5377cd817594b",
"klass":"candidate",
"oid":"2681140321",
"modified_by":null,
"modifications":"{"driver_distance":[1.46391,3.842],"lock_version":[0,1]}",
"modified_at":"2015-04-25T00:11:33+08:00",
"modified_fields":[
"driver_distance",
"lock_version"
]
}
}

Our server that sends the documents is using about 50% CPU.
Network bandwidth is not a worry because it is setup in AWS.

The CPU usage of the ES cluster is Avg 30%.
We are using SSD. Write throughtput is about maximum 80MB /per second.

Any attribute I can look at to investigate the bottleneck. please advice thanks.

It would be useful if you could provide some additional details:

  • Which version of Elasticsearch are you using?
  • Which EC2 instance types are you using? How large is the cluster?
  • How many indices/shards are you actively indexing into?
  • What is the size of indices and shards being indexed into?
  • Are indexed documents immutable or updated? If updated, how large portion of operations are updates?
  • Are your mappings static or dynamic?
  • Are you indexing in bulk? If so, what is your bulk size?
  • How many parallel indexing threads do you have against the cluster?
1 Like

Which version of Elasticsearch are you using?
5.1

Which EC2 instance types are you using? How large is the cluster?
3 x m4.xlarge

How many indices/shards are you actively indexing into?
indices are separated by month.
Each month index is about 140GB max with 5 shards.

What is the size of indices and shards being indexed into?
140GB max with 5 shards

Are indexed documents immutable or updated? If updated, how large portion of operations are updates?
The indexed document is totally new for the cluster. We insert with _bulk of 20000 in a batch.

Are your mappings static or dynamic?
mapping is static ,here is the mapping

{
"aliases": {

},
"mappings": {
  "changeling/models/logling": {
    "properties": {
      "id": {
        "type": "string"
      },
      "klass": {
        "type": "string"
      },
      "modifications": {
        "type": "string"
      },
      "modified_at": {
        "type": "date",
        "format": "dateOptionalTime"
      },
      "modified_by": {
        "type": "string"
      },
      "modified_fields": {
        "type": "string",
        "analyzer": "keyword"
      },
      "oid": {
        "type": "string"
      }
    }
  }
},
"settings": {
  "index": {
    "number_of_replicas": "1",
    "number_of_shards": "5",
    "refresh_interval": "60s"
  }
},
"warmers": {
  
}

}

Are you indexing in bulk? If so, what is your bulk size?
yes. 20000 in a batch
How many parallel indexing threads do you have against the cluster?
I used 3 processes in Linux, and allocate them to different CPU.

Do you have monitoring installed? What does GC look like? How many queries per second are you seeing?

query per second as I mentioned, it was 10000 indexing per second. No query at that time.

Please see monitoring here

Based on the sample record it looks like you are specifying the document ID at the application layer instead of letting Elasticsearch assign one. Is this correct? The way you assign a document id can have an impact on indexing performance as Elasticsearch need to determine if it is an update or an new document. Are you by any chance seeing indexing throughput drop as the monthly index gets larger and then recover once you start a. new monthly index? If you are not updating documents, can you let Elasticsearch assign IDs and see if that makes a difference?

I am using _bulk request to do the index like this
{"index":{"_index":"myteksi-changeling_changeling_models_loglings","_type":"changeling/models/logling","_id":"4bdfa1cdab20cb352cf745db1fbc7cfd"}}
{"id":"4bdfa1cdab20cb352cf745db1fbc7cfd","klass":"rating","oid":"27647823","modified_by":null,"modifications":"{"current_rating":["0.0","4.58765"]}","modified_at":"2016-02-14T03:12:17+08:00","modified_fields":["current_rating"]}

If I want to let ES decide the _id, should I remove _id field like this?
{"index":{"_index":"myteksi-changeling_changeling_models_loglings","_type":"changeling/models/logling"}
{"klass":"rating","oid":"27647823","modified_by":null,"modifications":"{"current_rating":["0.0","4.58765"]}","modified_at":"2016-02-14T03:12:17+08:00","modified_fields":["current_rating"]}

If you do not specify an id Elasticsearch will assign one, so that looks correct.

Okay thanks. I will have a test to see whether its faster.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.