ElasticSearch (0.17.6) BulkIndexer: PDF / HTML Docs

Hello All -

Am new to ElasticSearch (and Lucene based searching) but was trying to use the BulkIndexer from Java.

I went about it in the following steps:

  1. Creating a TransportClient (to localhost).
  2. Creating indexes dynamically based on a set of directory names (where pdf docs are).
  3. Creating a Map of Map<String, Object> where String is name of file and Object is the InputStream to the file.


Then I call BulkIndexer: (where esClient is a TransportClient).
brb = esClient.prepareBulk();
FileComposer fileComposer = new FileComposer();
Map<String, Map<String, Object>> documentMap = fileComposer.getDocumentsToIndex(fileList_);
if (documentMap.size() > 0) {
for (String docUUID: documentMap.keySet()) {
docUUID = UUID.randomUUID().toString();
brb.add(esClient.prepareIndex(indexName, docUUID).setSource(XContentFactory.jsonBuilder().map(documentMap.get(docUUID))));
}
brb.execute().actionGet();
}


Everything works fine - no errors. I have log messages as it goes through each directory and catch exceptions. However when I look at the log I get the following errors.
Please note: At log messages I log whether an index exists or not as per error generated when new index by dirname is attempted to be created and such.

My indexnames (from dirnames) are in the following format: 'lang.java', 'lang.scala' etc.
Is the indexname an issue by anychance?
It creates indexes fine (as long as lowercase) and goes through status updates but here are the errors I get:


[2011-10-16 21:26:41,018][INFO ][node ] [Stane, Ezekiel] {elasticsearch/0.17.6}[18244]: initializing ...
[2011-10-16 21:26:41,152][INFO ][plugins ] [Stane, Ezekiel] loaded [mapper-attachments], sites []
[2011-10-16 21:26:43,965][INFO ][node ] [Stane, Ezekiel] {elasticsearch/0.17.6}[18244]: initialized
[2011-10-16 21:26:43,965][INFO ][node ] [Stane, Ezekiel] {elasticsearch/0.17.6}[18244]: starting ...
[2011-10-16 21:26:44,159][INFO ][transport ] [Stane, Ezekiel] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.96:9300]}
[2011-10-16 21:26:47,227][INFO ][cluster.service ] [Stane, Ezekiel] new_master [Stane, Ezekiel][vG7-i-ZTRhCNu5jBj-bkzQ][inet[/192.168.1.96:9300]], reason: zen-disco-join (elected_as_master)
[2011-10-16 21:26:47,302][INFO ][discovery ] [Stane, Ezekiel] mdrCluster/vG7-i-ZTRhCNu5jBj-bkzQ
[2011-10-16 21:26:47,579][INFO ][http ] [Stane, Ezekiel] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.96:9200]}
[2011-10-16 21:26:47,580][INFO ][node ] [Stane, Ezekiel] {elasticsearch/0.17.6}[18244]: started
[2011-10-16 21:26:48,266][INFO ][gateway ] [Stane, Ezekiel] recovered [21] indices into cluster_state
[2011-10-16 21:27:16,302][DEBUG][action.bulk ] [Stane, Ezekiel] [framework.soa][3] failed to bulk item (index) index {[framework.soa][c464405c-c080-4b51-ac46-e7bd62b698ef][vGEfhWTWSp-ZlDqbg0zbmQ], source[null]}
org.elasticsearch.ElasticSearchParseException: Failed to derive xcontent from (offset=0, length=4): [110, 117, 108, 108]
at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:181)
at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:172)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:512)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:491)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:269)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:135)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:428)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:341)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)


I have tried to detail t as much as possible. Please let me know if you need anything else.

Thanks for your help in resolving this.

Regards.

Monosij