Elasticsearch-hadoop: bulk indexing JSON


(Douglas Moore) #1

I've prepared files (Gigabytes) of data as JSON and I would like to bulk
load these complex JSON documents into ElasticSearch using the
elasticsearch-hadoop, using the MapReduce model.
Is this possible?

The data is organized with a "_id" field representing the document id and
additional fields, some nested, arrays, hashes etc. A simple example:
{"_id":"IX111", "name":"john"}\n
{"_id":"IX112", "name":"jane"}\n
{"_id":"IX113", "name":"jerry"}\n
{"_id":"IX114", "name":"jim"}\n

However, in testing It appears that the JSON data is being escaped:

2013-10-05 23:10:37,829 INFO org.elasticsearch.hadoop.rest.BufferedRestClient: Indexing object ["{"_id":"IN188035438","names":[{"instances":1,"captured_at":"2013-10-03T01:50:05Z","value":"John"}\t"]

which is probably why I see error logs on the Elasticsearch cluster like:

2013-10-05_21:11:13.78420 [2013-10-05 21:11:13,751][DEBUG][action.bulk ] [tv_es-search-0] [tvperf2v1][20] failed to execute bulk item (index) index {[tvperf2v1][v][yNFpGzvmR2mMruyucT
_e9w], source["{"_id" ...

2013-10-05_21:11:13.78410 org.elasticsearch.index.mapper.MapperParsingException: Malformed content, must start with an object
2013-10-05_21:11:13.78411 at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:479)
2013-10-05_21:11:13.78412 at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:452)
2013-10-05_21:11:13.78413 at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:320)
2013-10-05_21:11:13.78415 at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:401)
2013-10-05_21:11:13.78415 at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:155)
2013-10-05_21:11:13.78416 at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAc
tion.java:533)
2013-10-05_21:11:13.78417 at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:4
18)
2013-10-05_21:11:13.78418 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
2013-10-05_21:11:13.78418 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
2013-10-05_21:11:13.78419 at java.lang.Thread.run(Thread.java:662)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #2

I've replied in the initial issue that you raised. Streaming of JSON documents is not yet supported. As explained in the
docs [1], the current M/R model expects the data to be broken down in MapWritable which are then converted into JSON and
indexed.
In this case, we would have to use a different route - potentially using the Text itself as a the REST payload.

Cheers,

[1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/mapreduce.html

On 06/10/2013 4:19 AM, Douglas Moore wrote:

I've prepared files (Gigabytes) of data as JSON and I would like to bulk load these complex JSON documents into
ElasticSearch using the elasticsearch-hadoop, using the MapReduce model.
Is this possible?

The data is organized with a "_id" field representing the document id and additional fields, some nested, arrays, hashes
etc. A simple example:
{"_id":"IX111", "name":"john"}\n
{"_id":"IX112", "name":"jane"}\n
{"_id":"IX113", "name":"jerry"}\n
{"_id":"IX114", "name":"jim"}\n

However, in testing It appears that the JSON data is being escaped:

2013-10-05 23:10:37,829 INFO org.elasticsearch.hadoop.rest.BufferedRestClient: Indexing object ["{"_id":"IN188035438","names":[{"instances":1,"captured_at":"2013-10-03T01:50:05Z","value":"John"}\t"]

which is probably why I see error logs on the Elasticsearch cluster like:

2013-10-05_21:11:13.78420 [2013-10-05 21:11:13,751][DEBUG][action.bulk ] [tv_es-search-0] [tvperf2v1][20] failed to execute bulk item (index) index {[tvperf2v1][v][yNFpGzvmR2mMruyucT
_e9w], source["{"_id" ...

2013-10-05_21:11:13.78410 org.elasticsearch.index.mapper.MapperParsingException: Malformed content, must start with an object
2013-10-05_21:11:13.78411 at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:479)
2013-10-05_21:11:13.78412 at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:452)
2013-10-05_21:11:13.78413 at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:320)
2013-10-05_21:11:13.78415 at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:401)
2013-10-05_21:11:13.78415 at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:155)
2013-10-05_21:11:13.78416 at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAc
tion.java:533)
2013-10-05_21:11:13.78417 at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:4
18)
2013-10-05_21:11:13.78418 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
2013-10-05_21:11:13.78418 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
2013-10-05_21:11:13.78419 at java.lang.Thread.run(Thread.java:662)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(James Richardson) #3

Why bother with hadoop, if you have less than 1 harddrive worth of data. Just load the files up into es by iterating over the files, with a few threads, you will be done in 15 mins.

James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(M_20) #4

Hi Guys,

Could you please give me a java sample code of mapper and reducer in
Elasticsearch-hadoop?
I'd appreciate it.

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d13f8fa7-fe4c-4070-8c3a-20149cb348eb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Costin Leau) #5

Have you looked at the docs?
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/mapreduce.html

On Fri, Jul 25, 2014 at 11:04 PM, M_20 rastegar.83@gmail.com wrote:

Hi Guys,

Could you please give me a java sample code of mapper and reducer in
Elasticsearch-hadoop?
I'd appreciate it.

Thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d13f8fa7-fe4c-4070-8c3a-20149cb348eb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d13f8fa7-fe4c-4070-8c3a-20149cb348eb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmffdaXvoJHLxEXZLmi6DN%3Deqvs9v7Otisi%2Bqut2V38z2g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6