ES Indexing from Hadoop Issues


(Sudhir Rao) #1

Hi all,

I have 4 node ES running

ElasticSearch : 1.5.2
OS : RHEL 6.x
Java : 1.7
CPU : 16 cores
2 machines : 60 GB RAM, 10 TB disk
2 machines : 120 GB RAM, 5 TB disk

I also have a 500 node hadoop cluster and am trying to index data from
Hadoop which is in Avro Format

Daily size : 1.2 TB
Hourly size : 40-60 GB

elasticsearch.yml config

cluster.name: zebra
index.mapping.ignore_malformed: true
index.merge.scheduler.max_thread_count: 1
index.store.throttle.type: none
index.refresh_interval: -1
index.translog.flush_threshold_size: 1024000000
discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"]
path.data:
/hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es
bootstrap.mlockall: true
indices.memory.index_buffer_size: 30%
index.translog.flush_threshold_ops: 50000
index.store.type: mmapfs

Cluster Settings

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "zebra",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 21,
"active_shards" : 22,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"number_of_pending_tasks" : 0
}

Pig Script:

avro_data = LOAD '$INPUT_PATH' USING AvroStorage ();

temp_projection = FOREACH avro_data GENERATE
our.own.udf.ToJsonString(headers,data) as data;

STORE temp_projection INTO 'fpti/raw_data' USING
org.elasticsearch.hadoop.pig.EsStorage ('es.resource =
fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4',
'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1');
When i run the above, there are around 300 mappers none of them complete
and every time the job fails with the below error. There is some documents
that gets indexed though.

Error:

2015-05-20 15:40:20,618 [main] ERROR
org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal
error. Could not write all entries [1/8448] (maybe ES was overloaded?).
Bailing out...

The job however finishes when the data size is few thousands

Please let me know what else i can do to increase my indexing throughput

regards

#sudhir

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Allan Mitchell) #2

Hi

The error is a Grunt error which suggests Pig is throwing it not ES. What
do the PIG logs say? What makes you think ES is the issue?

I know it works with smaller data but that also means Pig works with
smaller data not just ES.

Allan

On 21 May 2015 at 01:34, Sudhir Rao ysudhir@gmail.com wrote:

Hi all,

I have 4 node ES running

ElasticSearch : 1.5.2
OS : RHEL 6.x
Java : 1.7
CPU : 16 cores
2 machines : 60 GB RAM, 10 TB disk
2 machines : 120 GB RAM, 5 TB disk

I also have a 500 node hadoop cluster and am trying to index data from
Hadoop which is in Avro Format

Daily size : 1.2 TB
Hourly size : 40-60 GB

elasticsearch.yml config

cluster.name: zebra
index.mapping.ignore_malformed: true
index.merge.scheduler.max_thread_count: 1
index.store.throttle.type: none
index.refresh_interval: -1
index.translog.flush_threshold_size: 1024000000
discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"]
path.data:
/hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es
bootstrap.mlockall: true
indices.memory.index_buffer_size: 30%
index.translog.flush_threshold_ops: 50000
index.store.type: mmapfs

Cluster Settings

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "zebra",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 21,
"active_shards" : 22,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"number_of_pending_tasks" : 0
}

Pig Script:

avro_data = LOAD '$INPUT_PATH' USING AvroStorage ();

temp_projection = FOREACH avro_data GENERATE
our.own.udf.ToJsonString(headers,data) as data;

STORE temp_projection INTO 'fpti/raw_data' USING
org.elasticsearch.hadoop.pig.EsStorage ('es.resource =
fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4',
'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1');
When i run the above, there are around 300 mappers none of them complete
and every time the job fails with the below error. There is some documents
that gets indexed though.

Error:

2015-05-20 15:40:20,618 [main] ERROR
org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal
error. Could not write all entries [1/8448] (maybe ES was overloaded?).
Bailing out...

The job however finishes when the data size is few thousands

Please let me know what else i can do to increase my indexing throughput

regards

#sudhir

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAECdJzAKEZzFv4q_H9auBzAN%2B5b91XEzz1SUye-xHA1nRAcX_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Sudhir Rao) #3

I see the following in the elasticsearch logs

stop throttling indexing: numMergesInFlight=4, maxNumMerges=5

The indexing however happens for few million records before all the mapper
fail - please see the attached error screenshot

On Thursday, May 21, 2015 at 12:46:29 AM UTC-7, Allan Mitchell wrote:

Hi

The error is a Grunt error which suggests Pig is throwing it not ES. What
do the PIG logs say? What makes you think ES is the issue?

I know it works with smaller data but that also means Pig works with
smaller data not just ES.

Allan

On 21 May 2015 at 01:34, Sudhir Rao <ysu...@gmail.com <javascript:>>
wrote:

Hi all,

I have 4 node ES running

ElasticSearch : 1.5.2
OS : RHEL 6.x
Java : 1.7
CPU : 16 cores
2 machines : 60 GB RAM, 10 TB disk
2 machines : 120 GB RAM, 5 TB disk

I also have a 500 node hadoop cluster and am trying to index data from
Hadoop which is in Avro Format

Daily size : 1.2 TB
Hourly size : 40-60 GB

elasticsearch.yml config

cluster.name: zebra
index.mapping.ignore_malformed: true
index.merge.scheduler.max_thread_count: 1
index.store.throttle.type: none
index.refresh_interval: -1
index.translog.flush_threshold_size: 1024000000
discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"]
path.data:
/hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es
bootstrap.mlockall: true
indices.memory.index_buffer_size: 30%
index.translog.flush_threshold_ops: 50000
index.store.type: mmapfs

Cluster Settings

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "zebra",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 21,
"active_shards" : 22,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"number_of_pending_tasks" : 0
}

Pig Script:

avro_data = LOAD '$INPUT_PATH' USING AvroStorage ();

temp_projection = FOREACH avro_data GENERATE
our.own.udf.ToJsonString(headers,data) as data;

STORE temp_projection INTO 'fpti/raw_data' USING
org.elasticsearch.hadoop.pig.EsStorage ('es.resource =
fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4',
'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1');
When i run the above, there are around 300 mappers none of them complete
and every time the job fails with the below error. There is some documents
that gets indexed though.

Error:

2015-05-20 15:40:20,618 [main] ERROR
org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal
error. Could not write all entries [1/8448] (maybe ES was overloaded?).
Bailing out...

The job however finishes when the data size is few thousands

Please let me know what else i can do to increase my indexing throughput

regards

#sudhir

--
Please update your bookmarks! We have moved to
https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d45e0a3d-88ec-40c6-a5f8-a1c939f81cb8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Sudhir Rao) #4

Here is what the indexing performance i see : it takes 10 mins 29 seconds
to finish indexing 626K records using the mapreduce (pig). Is this the
expected performance for 4 node elasticsearch ?

Output(s):

Successfully stored 626283 records in: "index1/raw_data"

Counters:

Total records written : 626283

Total bytes written : 0

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

On Saturday, May 23, 2015 at 12:24:50 PM UTC-7, Sudhir Rao wrote:

I see the following in the elasticsearch logs

stop throttling indexing: numMergesInFlight=4, maxNumMerges=5

The indexing however happens for few million records before all the mapper
fail - please see the attached error screenshot

On Thursday, May 21, 2015 at 12:46:29 AM UTC-7, Allan Mitchell wrote:

Hi

The error is a Grunt error which suggests Pig is throwing it not ES.
What do the PIG logs say? What makes you think ES is the issue?

I know it works with smaller data but that also means Pig works with
smaller data not just ES.

Allan

On 21 May 2015 at 01:34, Sudhir Rao ysu...@gmail.com wrote:

Hi all,

I have 4 node ES running

ElasticSearch : 1.5.2
OS : RHEL 6.x
Java : 1.7
CPU : 16 cores
2 machines : 60 GB RAM, 10 TB disk
2 machines : 120 GB RAM, 5 TB disk

I also have a 500 node hadoop cluster and am trying to index data from
Hadoop which is in Avro Format

Daily size : 1.2 TB
Hourly size : 40-60 GB

elasticsearch.yml config

cluster.name: zebra
index.mapping.ignore_malformed: true
index.merge.scheduler.max_thread_count: 1
index.store.throttle.type: none
index.refresh_interval: -1
index.translog.flush_threshold_size: 1024000000
discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"]
path.data:
/hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es
bootstrap.mlockall: true
indices.memory.index_buffer_size: 30%
index.translog.flush_threshold_ops: 50000
index.store.type: mmapfs

Cluster Settings

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "zebra",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 21,
"active_shards" : 22,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"number_of_pending_tasks" : 0
}

Pig Script:

avro_data = LOAD '$INPUT_PATH' USING AvroStorage ();

temp_projection = FOREACH avro_data GENERATE
our.own.udf.ToJsonString(headers,data) as data;

STORE temp_projection INTO 'fpti/raw_data' USING
org.elasticsearch.hadoop.pig.EsStorage ('es.resource =
fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4',
'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1');
When i run the above, there are around 300 mappers none of them
complete and every time the job fails with the below error. There is some
documents that gets indexed though.

Error:

2015-05-20 15:40:20,618 [main] ERROR
org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal
error. Could not write all entries [1/8448] (maybe ES was overloaded?).
Bailing out...

The job however finishes when the data size is few thousands

Please let me know what else i can do to increase my indexing throughput

regards

#sudhir

--
Please update your bookmarks! We have moved to
https://discuss.elastic.co/

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53d91e49-3a70-460d-93b3-0255e242fb42%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Allan Mitchell) #5

Your original error was around a yarn container being destroyed. My guess
right now would be that this is due to memory pressure in Hadoop.
I would look to increase the heap size AND/OR number of reducers in Hadoop.

Mapreduce is not known for being the fastest thing on the planet to be
honest given there is a lot of overhead. It works nicely in batch mode
over a large dataset where you want distributed compute but over a smallish
dataset it can be seen as laggy.

Allan

On 23 May 2015 at 20:28, Sudhir Rao ysudhir@gmail.com wrote:

Here is what the indexing performance i see : it takes 10 mins 29 seconds
to finish indexing 626K records using the mapreduce (pig). Is this the
expected performance for 4 node elasticsearch ?

Output(s):

Successfully stored 626283 records in: "index1/raw_data"

Counters:

Total records written : 626283

Total bytes written : 0

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

On Saturday, May 23, 2015 at 12:24:50 PM UTC-7, Sudhir Rao wrote:

I see the following in the elasticsearch logs

stop throttling indexing: numMergesInFlight=4, maxNumMerges=5

The indexing however happens for few million records before all the
mapper fail - please see the attached error screenshot

On Thursday, May 21, 2015 at 12:46:29 AM UTC-7, Allan Mitchell wrote:

Hi

The error is a Grunt error which suggests Pig is throwing it not ES.
What do the PIG logs say? What makes you think ES is the issue?

I know it works with smaller data but that also means Pig works with
smaller data not just ES.

Allan

On 21 May 2015 at 01:34, Sudhir Rao ysu...@gmail.com wrote:

Hi all,

I have 4 node ES running

ElasticSearch : 1.5.2
OS : RHEL 6.x
Java : 1.7
CPU : 16 cores
2 machines : 60 GB RAM, 10 TB disk
2 machines : 120 GB RAM, 5 TB disk

I also have a 500 node hadoop cluster and am trying to index data from
Hadoop which is in Avro Format

Daily size : 1.2 TB
Hourly size : 40-60 GB

elasticsearch.yml config

cluster.name: zebra
index.mapping.ignore_malformed: true
index.merge.scheduler.max_thread_count: 1
index.store.throttle.type: none
index.refresh_interval: -1
index.translog.flush_threshold_size: 1024000000
discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"]
path.data:
/hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es
bootstrap.mlockall: true
indices.memory.index_buffer_size: 30%
index.translog.flush_threshold_ops: 50000
index.store.type: mmapfs

Cluster Settings

$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "zebra",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 4,
"number_of_data_nodes" : 4,
"active_primary_shards" : 21,
"active_shards" : 22,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"number_of_pending_tasks" : 0
}

Pig Script:

avro_data = LOAD '$INPUT_PATH' USING AvroStorage ();

temp_projection = FOREACH avro_data GENERATE
our.own.udf.ToJsonString(headers,data) as data;

STORE temp_projection INTO 'fpti/raw_data' USING
org.elasticsearch.hadoop.pig.EsStorage ('es.resource =
fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4',
'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1');
When i run the above, there are around 300 mappers none of them
complete and every time the job fails with the below error. There is some
documents that gets indexed though.

Error:

2015-05-20 15:40:20,618 [main] ERROR
org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal
error. Could not write all entries [1/8448] (maybe ES was overloaded?).
Bailing out...

The job however finishes when the data size is few thousands

Please let me know what else i can do to increase my indexing throughput

regards

#sudhir

--
Please update your bookmarks! We have moved to
https://discuss.elastic.co/

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/


You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/53d91e49-3a70-460d-93b3-0255e242fb42%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/53d91e49-3a70-460d-93b3-0255e242fb42%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAECdJzDD%3DyUafhSY4L%2BZ0MSY%2Bjms-E4DosRR5d9Z_SwVC90%3DwA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6