ES 2.1.0 bulk api throws ArrayIndexOutOfBoundsException

James_Kelvin · January 8, 2016, 3:39pm

I use BulkProcessor for indexing, bulksize = 5M. It works very well with ES 1.7.3.
but when upgrade to ES 2.1.0, throw exceptions as below:

[2016-01-08 17:45:19,733][DEBUG][action.bulk ] [mybox] [mgindex0][3] failed to execute bulk item (index) index {[mgindex0][95090eb1-9948-4b21-868a-fc3389b34b6a][95090eb1-1f78-467e-a600-1ce540f27ec0], source[{"ea312e4e-48b8-4c5a-87c6-59d7fe0d9970":"买家","808ff715-f215-4300-bc47-6d29c53a1945|":70589.0,"e8238304-703f-4d89-b543-af994886393f|":"时尚雪地靴","381cbca0-71ba-434e-b624-93d2a732f427":25062702,"ea312e4e-48b8-4c5a-87c6-59d7fe0d9970|":"买家","4c2bc039-0ace-4713-a988-cafd05eff5b9|":"女鞋","label":"paipai","createdon":1452243612892,"4c2bc039-0ace-4713-a988-cafd05eff5b9":"女鞋","createdby":1,"datasource":"bf0bef91-88ea-4c26-acbf-ca1f38976aef","808ff715-f215-4300-bc47-6d29c53a1945":70589.0,"381cbca0-71ba-434e-b624-93d2a732f427|":25062702,"e8238304-703f-4d89-b543-af994886393f":"时尚雪地靴"}]}
java.lang.ArrayIndexOutOfBoundsException: -2097153
at org.apache.lucene.util.BytesRefHash.rehash(BytesRefHash.java:419)
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:323)
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:150)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:661)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1254)
at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:539)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:468)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:571)
at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
at org.elasticsearch.action.support.replication.TransportReplicationAction.executeIndexRequestOnPrimary(TransportReplicationAction.java:1073)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:338)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:131)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.performOnPrimary(TransportReplicationAction.java:579)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1.doRun(TransportReplicationAction.java:452)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-01-08 17:45:19,789][DEBUG][index.translog ] [mybox] [mgindex0][3] translog closed
[2016-01-08 17:45:19,789][DEBUG][index.engine ] [mybox] [mgindex0][3] engine closed [engine failed on: [already closed by tragic event]]

Is there something wrong?
Thanks for your advice.

jasontedor · January 8, 2016, 4:25pm

With the default settings, I am not able to reproduce this issue. Can you please share any relevant settings and the mapping for the index mgindex0?

Here is my attempt to reproduce:

$ curl -XGET localhost:9200/

{
  "name" : "Corsair",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.1.0",
    "build_hash" : "72cd1f1a3eee09505e036106146dc1949dc5dc87",
    "build_timestamp" : "2015-11-18T22:40:03Z",
    "build_snapshot" : false,
    "lucene_version" : "5.3.1"
  },
  "tagline" : "You Know, for Search"
}

$ curl -XDELETE localhost:9200/mgindex0?pretty=1
{
  "acknowledged" : true
}

$ curl -XPOST localhost:9200/_bulk?pretty=1 -d '
{ "index": { "_index" : "mgindex0", "_type" : "95090eb1-9948-4b21-868a-fc3389b34b6a", "_id" : "95090eb1-1f78-467e-a600-1ce540f27ec0" }
{ "ea312e4e-48b8-4c5a-87c6-59d7fe0d9970":"买家", "808ff715-f215-4300-bc47-6d29c53a1945|":70589.0, "e8238304-703f-4d89-b543-af994886393f|":"时尚雪地靴", "381cbca0-71ba-434e-\
b624-93d2a732f427":25062702, "ea312e4e-48b8-4c5a-87c6-59d7fe0d9970|":"买家", "4c2bc039-0ace-4713-a988-cafd05eff5b9|":"女鞋", "label":"paipai", "createdon":1452243612892, "4\
c2bc039-0ace-4713-a988-cafd05eff5b9":"女鞋", "createdby":1, "datasource":"bf0bef91-88ea-4c26-acbf-ca1f38976aef", "808ff715-f215-4300-bc47-6d29c53a1945":70589.0, "381cbca0-7\
1ba-434e-b624-93d2a732f427|":25062702, "e8238304-703f-4d89-b543-af994886393f":"时尚雪地靴"}
'
{
  "took" : 54,
  "errors" : false,
  "items" : [ {
    "index" : {
      "_index" : "mgindex0",
      "_type" : "95090eb1-9948-4b21-868a-fc3389b34b6a",
      "_id" : "95090eb1-1f78-467e-a600-1ce540f27ec0",
      "_version" : 1,
      "_shards" : {
        "total" : 2,
        "successful" : 1,
        "failed" : 0
      },
      "status" : 201
    }
  } ]
}

jprante · January 8, 2016, 9:43pm

@jasontedor I think for reproduction of the issue, bulk data with certain characteristics must be pushed under heavy load. It looks like a bug in the Lucene rehash() method.

@James_Kelvin my suggestion is to increase the number of shards or to increase heap size, maybe this can work around the issue.
Another suggestion is to check if less field names are possible. It seems there are a lot of different field names that look like UUID keys or some unique coding. Can you show the mapping? Unique codings are ok as field values, but at a certain number of different field names, this might get a bit challenging.
It would be nice if you could somehow prepare a test data set for bulk indexing or describe the data set you index, so the issue can be reproduced more easily.

James_Kelvin · January 11, 2016, 1:59am

@jprante
@jasontedor
Thanks for your advice. Here is the mapping:
{
"mgindex0" : {
"mappings" : {
"95090eb1-9948-4b21-868a-fc3389b34b6a" : {
"properties" : {
"381cbca0-71ba-434e-b624-93d2a732f427" : {
"type" : "integer"
},
"381cbca0-71ba-434e-b624-93d2a732f427|" : {
"type" : "integer",
"copy_to" : [ "info" ]
},
"4c2bc039-0ace-4713-a988-cafd05eff5b9" : {
"type" : "string"
},
"4c2bc039-0ace-4713-a988-cafd05eff5b9|" : {
"type" : "string",
"index" : "not_analyzed",
"copy_to" : [ "info" ]
},
"808ff715-f215-4300-bc47-6d29c53a1945" : {
"type" : "double"
},
"808ff715-f215-4300-bc47-6d29c53a1945|" : {
"type" : "double",
"copy_to" : [ "info" ]
},
"confidence" : {
"type" : "integer"
},
"createdby" : {
"type" : "integer"
},
"createdon" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"datasource" : {
"type" : "string",
"index" : "not_analyzed"
},
"e8238304-703f-4d89-b543-af994886393f" : {
"type" : "string"
},
"e8238304-703f-4d89-b543-af994886393f|" : {
"type" : "string",
"index" : "not_analyzed",
"copy_to" : [ "info" ]
},
"ea312e4e-48b8-4c5a-87c6-59d7fe0d9970" : {
"type" : "string"
},
"ea312e4e-48b8-4c5a-87c6-59d7fe0d9970|" : {
"type" : "string",
"index" : "not_analyzed",
"copy_to" : [ "info" ]
},
"etime" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"info" : {
"type" : "string",
"index" : "not_analyzed"
},
"label" : {
"type" : "string",
"copy_to" : [ "info" ]
},
"location" : {
"type" : "geo_point",
"fielddata" : {
"format" : "compressed",
"precision" : "1cm"
}
},
"mugshot" : {
"type" : "string",
"index" : "not_analyzed"
},
"stime" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
}
}
}
}
}
}

1: field name is UUID
(1) raw uuid is for analyzed field.
(2) uuid + "|" for not analyzed field.
we did not use multi-field, but i think it does not matter.
2: all analyzed field copy to "info" field
our app need "info" field for special query.

James_Kelvin · January 11, 2016, 2:09am

@jprante
add our dataset description:
we need to index lots of db data(like mysql ,oracle ,db2...) to our data platform. so there is lots of fields and field names should be unique. So, we use uuid as fieldname.

Topic		Replies	Views
Elasticsearch 2.1.0版本 buck index过程中出现异常中文提问与讨论	3	3253	July 6, 2017
Bulk insert error? Elasticsearch	4	864	July 6, 2017
Debug logs when indexing by bulks Elasticsearch	2	403	July 6, 2017
Bulk crash Caused by: java.lang.OutOfMemoryError: Java heap space Elasticsearch	3	1425	May 31, 2017
Running multiple instances triggers exception-java.lang.IndexOutOfBoundsException Elasticsearch	8	800	July 6, 2017

ES 2.1.0 bulk api throws ArrayIndexOutOfBoundsException

I use BulkProcessor for indexing, bulksize = 5M. It works very well with ES 1.7.3. but when upgrade to ES 2.1.0, throw exceptions as below:

Related topics

I use BulkProcessor for indexing, bulksize = 5M. It works very well with ES 1.7.3.
but when upgrade to ES 2.1.0, throw exceptions as below: