Which EC2 instance types are you using? How large is the cluster?
3 x m4.xlarge
How many indices/shards are you actively indexing into?
indices are separated by month.
Each month index is about 140GB max with 5 shards.
What is the size of indices and shards being indexed into?
140GB max with 5 shards
Are indexed documents immutable or updated? If updated, how large portion of operations are updates?
The indexed document is totally new for the cluster. We insert with _bulk of 20000 in a batch.
Are your mappings static or dynamic?
mapping is static ,here is the mapping
Are you indexing in bulk? If so, what is your bulk size?
yes. 20000 in a batch
How many parallel indexing threads do you have against the cluster?
I used 3 processes in Linux, and allocate them to different CPU.
Based on the sample record it looks like you are specifying the document ID at the application layer instead of letting Elasticsearch assign one. Is this correct? The way you assign a document id can have an impact on indexing performance as Elasticsearch need to determine if it is an update or an new document. Are you by any chance seeing indexing throughput drop as the monthly index gets larger and then recover once you start a. new monthly index? If you are not updating documents, can you let Elasticsearch assign IDs and see if that makes a difference?
I am using _bulk request to do the index like this
{"index":{"_index":"myteksi-changeling_changeling_models_loglings","_type":"changeling/models/logling","_id":"4bdfa1cdab20cb352cf745db1fbc7cfd"}}
{"id":"4bdfa1cdab20cb352cf745db1fbc7cfd","klass":"rating","oid":"27647823","modified_by":null,"modifications":"{"current_rating":["0.0","4.58765"]}","modified_at":"2016-02-14T03:12:17+08:00","modified_fields":["current_rating"]}
If I want to let ES decide the _id, should I remove _id field like this?
{"index":{"_index":"myteksi-changeling_changeling_models_loglings","_type":"changeling/models/logling"}
{"klass":"rating","oid":"27647823","modified_by":null,"modifications":"{"current_rating":["0.0","4.58765"]}","modified_at":"2016-02-14T03:12:17+08:00","modified_fields":["current_rating"]}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.