ES (0.17.8) BulkIndex Errors:PDF / HTML Docs

Hello All -

Am new to ElasticSearch (and Lucene based searching) but was trying to
use the BulkIndexer from Java.

I went about it in the following steps:

  1. Creating a TransportClient (to localhost).
  2. Creating indexes dynamically based on a set of directory names
    (where pdf docs are).
  3. Creating a Map of Map<String, Object> where String is name of file
    and Object is the InputStream to the file.


Then I call BulkIndexer: (where esClient is a TransportClient).
brb = esClient.prepareBulk();
FileComposer fileComposer = new FileComposer();
Map<String, Map<String, Object>> documentMap =
fileComposer.getDocumentsToIndex(fileList_);
if (documentMap.size() > 0) {
for (String docUUID: documentMap.keySet()) {
docUUID = UUID.randomUUID().toString();
brb.add(esClient.prepareIndex(indexName,
docUUID).setSource(XContentFactory.jsonBuilder().map(documentMap.get(docUUID))));
}
brb.execute().actionGet();
}


Everything works fine - no errors. I have log messages as it goes
through each directory and catch exceptions. However when I look at
the log I get the following errors.
Please note: At log messages I log whether an index exists or not as
per error generated when new index by dirname is attempted to be
created and such.


Errors and code snippets in GoogleDocs:

http://preview.tinyurl.com/Errors-ES-MDR

http://preview.tinyurl.com/CodeSnippets-ES-MDR
http://tinyurl.com/CodeSnippets-ES-MDR


My indexnames (from dirnames) are in the following format:
'lang.java', 'lang.scala' etc.
Is the indexname an issue by anychance?
It creates indexes fine (as long as lowercase) and goes through status
updates but here are the errors (linked above in detail) I get:


[2011-10-17 12:57:05,164][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: initializing ...
[2011-10-17 12:57:05,186][INFO ][plugins ] [Wong]
loaded [], sites []
[2011-10-17 12:57:08,009][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: initialized
[2011-10-17 12:57:08,009][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: starting ...
[2011-10-17 12:57:08,137][INFO ][transport ] [Wong]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
192.168.1.96:9300]}
[2011-10-17 12:57:11,212][INFO ][cluster.service ] [Wong]
new_master [Wong][2oDJkNUqTRGieeEPJLQnTQ][inet[/192.168.1.96:9300]],
reason: zen-disco-join (elected_as_master)
[2011-10-17 12:57:11,282][INFO ][discovery ] [Wong]
cluster.ES.MDR/2oDJkNUqTRGieeEPJLQnTQ
[2011-10-17 12:57:11,295][INFO ][http ] [Wong]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
192.168.1.96:9200]}
[2011-10-17 12:57:11,330][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: started
[2011-10-17 12:57:11,332][INFO ][gateway ] [Wong]
recovered [0] indices into cluster_state
[2011-10-17 12:57:22,767][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: stopping ...
[2011-10-17 12:57:22,791][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: stopped
[2011-10-17 12:57:22,792][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: closing ...
[2011-10-17 12:57:22,806][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: closed
[2011-10-17 12:57:27,376][INFO ][node ] [Gavel]
{elasticsearch/0.17.8}[24646]: initializing ...
[2011-10-17 12:57:27,382][INFO ][plugins ] [Gavel]
loaded [], sites []
[2011-10-17 12:57:29,047][INFO ][node ] [Gavel]
{elasticsearch/0.17.8}[24646]: initialized
[2011-10-17 12:57:29,047][INFO ][node ] [Gavel]
{elasticsearch/0.17.8}[24646]: starting ...
[2011-10-17 12:57:29,103][INFO ][transport ] [Gavel]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
192.168.1.96:9300]}
[2011-10-17 12:57:32,145][INFO ][cluster.service ] [Gavel]
new_master [Gavel][gMOx2v-5QcGk6mq6Fzeqgw][inet[/192.168.1.96:9300]],
reason: zen-disco-join (elected_as_master)
[2011-10-17 12:57:32,198][INFO ][discovery ] [Gavel]
cluster.ES.MDR/gMOx2v-5QcGk6mq6Fzeqgw
[2011-10-17 12:57:32,212][INFO ][http ] [Gavel]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
192.168.1.96:9200]}
[2011-10-17 12:57:32,213][INFO ][node ] [Gavel]
{elasticsearch/0.17.8}[24646]: started
[2011-10-17 12:57:32,218][INFO ][gateway ] [Gavel]
recovered [0] indices into cluster_state
[2011-10-17 12:59:54,850][INFO ][cluster.metadata ] [Gavel]
[framework.soa] creating index, cause [api], shards [5]/[1], mappings
[]
[2011-10-17 13:00:07,619][DEBUG][action.bulk ] [Gavel]
[framework.soa][0] failed to bulk item (index) index {[framework.soa]
[a839eabb-7005-4c94-9659-b399d05dc951][uTGTFr7fSk6cXFFk60Y1Lg],
source[null]}
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from (offset=0, length=4): [110, 117, 108, 108]
at
org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
181)
at
org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
172)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
512)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
491)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:
269)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
136)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
464)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:377)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
[2011-10-17 13:00:07,619][DEBUG][action.bulk ] [Gavel]
[framework.soa][4] failed to bulk item (index) index {[framework.soa]
[e4dcd21c-2bde-437b-ba00-b6d4b184fef8][qeH_IgjmRX-qv5ucx2_0hA],
source[null]}
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from (offset=0, length=4): [110, 117, 108, 108]
at
org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
181)
at
org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
172)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
512)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
491)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:
269)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
136)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
464)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:377)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)


I have tried to detail it as much as possible. Please let me know if
you need anything else.

Thanks for your help in resolving this.

Regards.

Monosij

I think you are trying to push an empty doc: offset = 0, length = 4 or
am I wrong?

The type is missing:

if (type == null) {
throw new ElasticSearchParseException("Failed to derive xcontent
from (offset=" + offset + ", length=" + length + "): " +
Arrays.toString(data));
}

On 17 Okt., 19:30, "m.dr" monosij.for...@gmail.com wrote:

Hello All -

Am new to Elasticsearch (and Lucene based searching) but was trying to
use the BulkIndexer from Java.

I went about it in the following steps:

  1. Creating a TransportClient (to localhost).
  2. Creating indexes dynamically based on a set of directory names
    (where pdf docs are).
  3. Creating a Map of Map<String, Object> where String is name of file
    and Object is the InputStream to the file.


Then I call BulkIndexer: (where esClient is a TransportClient).
brb = esClient.prepareBulk();
FileComposer fileComposer = new FileComposer();
Map<String, Map<String, Object>> documentMap =
fileComposer.getDocumentsToIndex(fileList_);
if (documentMap.size() > 0) {
for (String docUUID: documentMap.keySet()) {
docUUID = UUID.randomUUID().toString();
brb.add(esClient.prepareIndex(indexName,
docUUID).setSource(XContentFactory.jsonBuilder().map(documentMap.get(docUUID))));
}
brb.execute().actionGet();}



Everything works fine - no errors. I have log messages as it goes
through each directory and catch exceptions. However when I look at
the log I get the following errors.
Please note: At log messages I log whether an index exists or not as
per error generated when new index by dirname is attempted to be
created and such.


Errors and code snippets in GoogleDocs:

http://preview.tinyurl.com/Errors-ES-MDRhttp://tinyurl.com/Errors-ES-MDR

http://preview.tinyurl.com/CodeSnippets-ES-MDRhttp://tinyurl.com/CodeSnippets-ES-MDR


My indexnames (from dirnames) are in the following format:
'lang.java', 'lang.scala' etc.
Is the indexname an issue by anychance?
It creates indexes fine (as long as lowercase) and goes through status
updates but here are the errors (linked above in detail) I get:


[2011-10-17 12:57:05,164][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: initializing ...
[2011-10-17 12:57:05,186][INFO ][plugins ] [Wong]
loaded , sites
[2011-10-17 12:57:08,009][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: initialized
[2011-10-17 12:57:08,009][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: starting ...
[2011-10-17 12:57:08,137][INFO ][transport ] [Wong]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
192.168.1.96:9300]}
[2011-10-17 12:57:11,212][INFO ][cluster.service ] [Wong]
new_master [Wong][2oDJkNUqTRGieeEPJLQnTQ][inet[/192.168.1.96:9300]],
reason: zen-disco-join (elected_as_master)
[2011-10-17 12:57:11,282][INFO ][discovery ] [Wong]
cluster.ES.MDR/2oDJkNUqTRGieeEPJLQnTQ
[2011-10-17 12:57:11,295][INFO ][http ] [Wong]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
192.168.1.96:9200]}
[2011-10-17 12:57:11,330][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: started
[2011-10-17 12:57:11,332][INFO ][gateway ] [Wong]
recovered [0] indices into cluster_state
[2011-10-17 12:57:22,767][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: stopping ...
[2011-10-17 12:57:22,791][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: stopped
[2011-10-17 12:57:22,792][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: closing ...
[2011-10-17 12:57:22,806][INFO ][node ] [Wong]
{elasticsearch/0.17.8}[24575]: closed
[2011-10-17 12:57:27,376][INFO ][node ] [Gavel]
{elasticsearch/0.17.8}[24646]: initializing ...
[2011-10-17 12:57:27,382][INFO ][plugins ] [Gavel]
loaded , sites
[2011-10-17 12:57:29,047][INFO ][node ] [Gavel]
{elasticsearch/0.17.8}[24646]: initialized
[2011-10-17 12:57:29,047][INFO ][node ] [Gavel]
{elasticsearch/0.17.8}[24646]: starting ...
[2011-10-17 12:57:29,103][INFO ][transport ] [Gavel]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
192.168.1.96:9300]}
[2011-10-17 12:57:32,145][INFO ][cluster.service ] [Gavel]
new_master [Gavel][gMOx2v-5QcGk6mq6Fzeqgw][inet[/192.168.1.96:9300]],
reason: zen-disco-join (elected_as_master)
[2011-10-17 12:57:32,198][INFO ][discovery ] [Gavel]
cluster.ES.MDR/gMOx2v-5QcGk6mq6Fzeqgw
[2011-10-17 12:57:32,212][INFO ][http ] [Gavel]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
192.168.1.96:9200]}
[2011-10-17 12:57:32,213][INFO ][node ] [Gavel]
{elasticsearch/0.17.8}[24646]: started
[2011-10-17 12:57:32,218][INFO ][gateway ] [Gavel]
recovered [0] indices into cluster_state
[2011-10-17 12:59:54,850][INFO ][cluster.metadata ] [Gavel]
[framework.soa] creating index, cause [api], shards [5]/[1], mappings

[2011-10-17 13:00:07,619][DEBUG][action.bulk ] [Gavel]
[framework.soa][0] failed to bulk item (index) index {[framework.soa]
[a839eabb-7005-4c94-9659-b399d05dc951][uTGTFr7fSk6cXFFk60Y1Lg],
source[null]}
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from (offset=0, length=4): [110, 117, 108, 108]
at
org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
181)
at
org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
172)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
512)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
491)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:
269)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
136)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
464)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:377)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
[2011-10-17 13:00:07,619][DEBUG][action.bulk ] [Gavel]
[framework.soa][4] failed to bulk item (index) index {[framework.soa]
[e4dcd21c-2bde-437b-ba00-b6d4b184fef8][qeH_IgjmRX-qv5ucx2_0hA],
source[null]}
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from (offset=0, length=4): [110, 117, 108, 108]
at
org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
181)
at
org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
172)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
512)
at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
491)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:
269)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
136)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
464)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:377)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)


I have tried to detail it as much as possible. Please let me know if
you need anything else.

Thanks for your help in resolving this.

Regards.

Monosij