I've prepared files (Gigabytes) of data as JSON and I would like to bulk
load these complex JSON documents into ElasticSearch using the
elasticsearch-hadoop, using the MapReduce model.
Is this possible?
The data is organized with a "_id" field representing the document id and
additional fields, some nested, arrays, hashes etc. A simple example:
{"_id":"IX111", "name":"john"}\n
{"_id":"IX112", "name":"jane"}\n
{"_id":"IX113", "name":"jerry"}\n
{"_id":"IX114", "name":"jim"}\n
However, in testing It appears that the JSON data is being escaped:
2013-10-05 23:10:37,829 INFO org.elasticsearch.hadoop.rest.BufferedRestClient: Indexing object ["{"_id":"IN188035438","names":[{"instances":1,"captured_at":"2013-10-03T01:50:05Z","value":"John"}\t"]
which is probably why I see error logs on the Elasticsearch cluster like:
2013-10-05_21:11:13.78420 [2013-10-05 21:11:13,751][DEBUG][action.bulk ] [tv_es-search-0] [tvperf2v1][20] failed to execute bulk item (index) index {[tvperf2v1][v][yNFpGzvmR2mMruyucT
_e9w], source["{"_id" ...
2013-10-05_21:11:13.78410 org.elasticsearch.index.mapper.MapperParsingException: Malformed content, must start with an object
2013-10-05_21:11:13.78411 at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:479)
2013-10-05_21:11:13.78412 at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:452)
2013-10-05_21:11:13.78413 at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:320)
2013-10-05_21:11:13.78415 at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:401)
2013-10-05_21:11:13.78415 at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:155)
2013-10-05_21:11:13.78416 at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAc
tion.java:533)
2013-10-05_21:11:13.78417 at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:4
18)
2013-10-05_21:11:13.78418 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
2013-10-05_21:11:13.78418 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
2013-10-05_21:11:13.78419 at java.lang.Thread.run(Thread.java:662)
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.