ElasticSearch (0.17.6) BulkIndexer: PDF / HTML Docs

m_dr · October 17, 2011, 4:19pm

Hello All -

Am new to ElasticSearch (and Lucene based searching) but was trying to use the BulkIndexer from Java.

I went about it in the following steps:

Creating a TransportClient (to localhost).
Creating indexes dynamically based on a set of directory names (where pdf docs are).
Creating a Map of Map<String, Object> where String is name of file and Object is the InputStream to the file.

Then I call BulkIndexer: (where esClient is a TransportClient).
brb = esClient.prepareBulk();
FileComposer fileComposer = new FileComposer();
Map<String, Map<String, Object>> documentMap = fileComposer.getDocumentsToIndex(fileList_);
if (documentMap.size() > 0) {
for (String docUUID: documentMap.keySet()) {
docUUID = UUID.randomUUID().toString();
brb.add(esClient.prepareIndex(indexName, docUUID).setSource(XContentFactory.jsonBuilder().map(documentMap.get(docUUID))));
}
brb.execute().actionGet();
}

Everything works fine - no errors. I have log messages as it goes through each directory and catch exceptions. However when I look at the log I get the following errors.
Please note: At log messages I log whether an index exists or not as per error generated when new index by dirname is attempted to be created and such.

My indexnames (from dirnames) are in the following format: 'lang.java', 'lang.scala' etc.
Is the indexname an issue by anychance?
It creates indexes fine (as long as lowercase) and goes through status updates but here are the errors I get:

[2011-10-16 21:26:41,018][INFO ][node ] [Stane, Ezekiel] {elasticsearch/0.17.6}[18244]: initializing ...
[2011-10-16 21:26:41,152][INFO ][plugins ] [Stane, Ezekiel] loaded [mapper-attachments], sites []
[2011-10-16 21:26:43,965][INFO ][node ] [Stane, Ezekiel] {elasticsearch/0.17.6}[18244]: initialized
[2011-10-16 21:26:43,965][INFO ][node ] [Stane, Ezekiel] {elasticsearch/0.17.6}[18244]: starting ...
[2011-10-16 21:26:44,159][INFO ][transport ] [Stane, Ezekiel] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.96:9300]}
[2011-10-16 21:26:47,227][INFO ][cluster.service ] [Stane, Ezekiel] new_master [Stane, Ezekiel][vG7-i-ZTRhCNu5jBj-bkzQ][inet[/192.168.1.96:9300]], reason: zen-disco-join (elected_as_master)
[2011-10-16 21:26:47,302][INFO ][discovery ] [Stane, Ezekiel] mdrCluster/vG7-i-ZTRhCNu5jBj-bkzQ
[2011-10-16 21:26:47,579][INFO ][http ] [Stane, Ezekiel] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.96:9200]}
[2011-10-16 21:26:47,580][INFO ][node ] [Stane, Ezekiel] {elasticsearch/0.17.6}[18244]: started
[2011-10-16 21:26:48,266][INFO ][gateway ] [Stane, Ezekiel] recovered [21] indices into cluster_state
[2011-10-16 21:27:16,302][DEBUG][action.bulk ] [Stane, Ezekiel] [framework.soa][3] failed to bulk item (index) index {[framework.soa][c464405c-c080-4b51-ac46-e7bd62b698ef][vGEfhWTWSp-ZlDqbg0zbmQ], source[null]}
org.elasticsearch.ElasticSearchParseException: Failed to derive xcontent from (offset=0, length=4): [110, 117, 108, 108]
at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:181)
at org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:172)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:512)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:491)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:269)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:135)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:428)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:341)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

I have tried to detail t as much as possible. Please let me know if you need anything else.

Thanks for your help in resolving this.

Regards.

Monosij

Topic		Replies	Views
ES BulkIndexer for PDF / HTML files: ES 0.17.6 Elasticsearch	1	288	July 6, 2017
ES (0.17.8) BulkIndex Errors:PDF / HTML Docs Elasticsearch	2	383	July 6, 2017
Error in bulk indexing - this IndexWriter is closed Elasticsearch	6	4006	July 6, 2017
Creating, Updating, Validating and Rebuilding Index using Java API Elasticsearch	5	1596	July 6, 2017
Sending BulkRequest in Java API Client Elasticsearch version 7.16.3 Elasticsearch language-clients	3	822	March 22, 2022

ElasticSearch (0.17.6) BulkIndexer: PDF / HTML Docs

My indexnames (from dirnames) are in the following format: 'lang.java', 'lang.scala' etc. Is the indexname an issue by anychance? It creates indexes fine (as long as lowercase) and goes through status updates but here are the errors I get:

Related topics

My indexnames (from dirnames) are in the following format: 'lang.java', 'lang.scala' etc.
Is the indexname an issue by anychance?
It creates indexes fine (as long as lowercase) and goes through status updates but here are the errors I get: