No efect refresh_interval


(Marek Dabrowski) #1

Hello

My configuration is:
6 nodes Elasticsearch cluster
OS: Centos 6.5
JVM: 1.7.0_25

Cluster is working fine. I can indexing data, query, etc. Now I'm doing
test on package about ~50mln doc (~13GB). I would like take better
performance during indexing data. To take this target I has been changed
parameter refresh_interval. I did test for 1s, -1 and 600s. Time for
indexing data is that same. I checked configuration (_settings) for index
and value for refresh_interval is ok (has proper value), eg:

{
"smt_20140501_100000_20g_norefresh" : {
"settings" : {
"index" : {
"uuid" : "q3imiZGQTDasQUuMWS8oiw",
"number_of_replicas" : "1",
"number_of_shards" : "6",
"refresh_interval" : "600s",
"version" : {
"created" : "1020199"
}
}
}
}
}

Create index, setting refresh_interval and load is done on that same
cluster node. Before test index is deleted and created again before start
new test with new value of refresh_interval. All cluster nodes logs
information that parameter has been changed, eg:
[2014-07-16 11:24:09,813][INFO ][index.shard.service ] [h6]
[smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s]
to [-1]
or
[2014-07-16 11:32:32,928][INFO ][index.shard.service ] [h6]
[smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s]
to [10m]

After start test new data are available immediately and indexing time that
same in 3 cases. I don't know where is failure. Somebody know what is going
on?

Regards
Marek

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Michael McCandless) #2

Which ES version are you using? You should use the latest (soon to be
1.3): there have been a number of bulk-indexing improvements recently.

Are you using the bulk API with multiple/async client threads? Are you
saturating either CPU or IO in your cluster (so that the test is really a
full cluster capacity test)?

Also, the relationship between refresh_interval and indexing performance is
tricky: it turns out, -1 is often a poor choice, because it means your bulk
indexing threads are sometimes tied up flushing segments when with
refreshing enabled, it's a separate thread that does that. So a refresh of
5s is maybe a good choice.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 16, 2014 at 6:51 AM, Marek Dabrowski marek.dabrowski@gmail.com
wrote:

Hello

My configuration is:
6 nodes Elasticsearch cluster
OS: Centos 6.5
JVM: 1.7.0_25

Cluster is working fine. I can indexing data, query, etc. Now I'm doing
test on package about ~50mln doc (~13GB). I would like take better
performance during indexing data. To take this target I has been changed
parameter refresh_interval. I did test for 1s, -1 and 600s. Time for
indexing data is that same. I checked configuration (_settings) for index
and value for refresh_interval is ok (has proper value), eg:

{
"smt_20140501_100000_20g_norefresh" : {
"settings" : {
"index" : {
"uuid" : "q3imiZGQTDasQUuMWS8oiw",
"number_of_replicas" : "1",
"number_of_shards" : "6",
"refresh_interval" : "600s",
"version" : {
"created" : "1020199"
}
}
}
}
}

Create index, setting refresh_interval and load is done on that same
cluster node. Before test index is deleted and created again before start
new test with new value of refresh_interval. All cluster nodes logs
information that parameter has been changed, eg:
[2014-07-16 11:24:09,813][INFO ][index.shard.service ] [h6]
[smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s]
to [-1]
or
[2014-07-16 11:32:32,928][INFO ][index.shard.service ] [h6]
[smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s]
to [10m]

After start test new data are available immediately and indexing time that
same in 3 cases. I don't know where is failure. Somebody know what is going
on?

Regards
Marek

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRezWeZFQMSMVXj7ELW0xGSu3sPRxfXqcuF4bmtrLVBjYg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Marek Dabrowski) #3

Hello Mike

My ES version is 1.2.1
I checked utilization nodes my cluster. Common valus ofr all nodes are:
java proces cpu utilization: < 6%
os load: < 1
io stat: < 15kB/s write

I checked indexing process 2 methods:
a) indexing by native json data (13GB splited to 100MB chunks)
time for i in /tmp/SMT* ; do echo $i; curl -s -XPOST
h3:9200/smt_20140501_bulk_json_refresh_600/num/_bulk --data-binary @$i ; rm
-f $i; done

b) indexing csv data by use perl script

my $e = Search::Elasticsearch->new(
nodes => [
'h3:9200',
]
);

my $bulk = $e->bulk_helper(
index => $idx_name,
type => $idx_type,
max_count => 10000
);

open(my $DATA, '<', $data_file) or die $!;
while(<$DATA>) {
chomp;

my @data = split(',', $_);
$bulk->index({ source => {  
                            p0  => $data[0], 
                            p1  => $data[1],
                            p2  => $data[2],
                            p3  => $data[3],
                            p4  => $data[4],
                            p5  => $data[5],
                            p6  => $data[6],
                            p7  => $data[7],
                            p8  => $data[8],
                            p9  => $data[9],
                            p10 => $data[10],
                            p11 => $data[11]
            }});

}
close($DATA);
$bulk->flush;

Setting refresh_interval to 600s in both cases has no effect. Data are
available immediately. I expect (equal to ES documentation) that new data
will be available after 10 minutes and in consequently indexing process
will be quicker but it doesn’t.

Regards

W dniu środa, 16 lipca 2014 16:52:31 UTC+2 użytkownik Michael McCandless
napisał:

Which ES version are you using? You should use the latest (soon to be
1.3): there have been a number of bulk-indexing improvements recently.

Are you using the bulk API with multiple/async client threads? Are you
saturating either CPU or IO in your cluster (so that the test is really a
full cluster capacity test)?

Also, the relationship between refresh_interval and indexing performance
is tricky: it turns out, -1 is often a poor choice, because it means your
bulk indexing threads are sometimes tied up flushing segments when with
refreshing enabled, it's a separate thread that does that. So a refresh of
5s is maybe a good choice.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 16, 2014 at 6:51 AM, Marek Dabrowski <marek.d...@gmail.com
<javascript:>> wrote:

Hello

My configuration is:
6 nodes Elasticsearch cluster
OS: Centos 6.5
JVM: 1.7.0_25

Cluster is working fine. I can indexing data, query, etc. Now I'm doing
test on package about ~50mln doc (~13GB). I would like take better
performance during indexing data. To take this target I has been changed
parameter refresh_interval. I did test for 1s, -1 and 600s. Time for
indexing data is that same. I checked configuration (_settings) for index
and value for refresh_interval is ok (has proper value), eg:

{
"smt_20140501_100000_20g_norefresh" : {
"settings" : {
"index" : {
"uuid" : "q3imiZGQTDasQUuMWS8oiw",
"number_of_replicas" : "1",
"number_of_shards" : "6",
"refresh_interval" : "600s",
"version" : {
"created" : "1020199"
}
}
}
}
}

Create index, setting refresh_interval and load is done on that same
cluster node. Before test index is deleted and created again before start
new test with new value of refresh_interval. All cluster nodes logs
information that parameter has been changed, eg:
[2014-07-16 11:24:09,813][INFO ][index.shard.service ] [h6]
[smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s]
to [-1]
or
[2014-07-16 11:32:32,928][INFO ][index.shard.service ] [h6]
[smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s]
to [10m]

After start test new data are available immediately and indexing time
that same in 3 cases. I don't know where is failure. Somebody know what is
going on?

Regards
Marek

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7c6e2fef-4e40-44d8-a1ea-eade7880d5d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Marek Dabrowski) #4

Hello

I found reason my problems.
Refresh index during usage perl depend on parameters "max_count" and
"max_size" for
$e->bulk_helper
Values for this parameters determine when refresh will be done on index.

Tnx for help.

Regards

W dniu czwartek, 17 lipca 2014 09:59:55 UTC+2 użytkownik Marek Dabrowski
napisał:

Hello Mike

My ES version is 1.2.1
I checked utilization nodes my cluster. Common valus ofr all nodes are:
java proces cpu utilization: < 6%
os load: < 1
io stat: < 15kB/s write

I checked indexing process 2 methods:
a) indexing by native json data (13GB splited to 100MB chunks)
time for i in /tmp/SMT* ; do echo $i; curl -s -XPOST
h3:9200/smt_20140501_bulk_json_refresh_600/num/_bulk --data-binary @$i ; rm
-f $i; done

b) indexing csv data by use perl script

my $e = Search::Elasticsearch->new(
nodes => [
'h3:9200',
]
);

my $bulk = $e->bulk_helper(
index => $idx_name,
type => $idx_type,
max_count => 10000
);

open(my $DATA, '<', $data_file) or die $!;
while(<$DATA>) {
chomp;

my @data = split(',', $_);
$bulk->index({ source => {  
                            p0  => $data[0], 
                            p1  => $data[1],
                            p2  => $data[2],
                            p3  => $data[3],
                            p4  => $data[4],
                            p5  => $data[5],
                            p6  => $data[6],
                            p7  => $data[7],
                            p8  => $data[8],
                            p9  => $data[9],
                            p10 => $data[10],
                            p11 => $data[11]
            }});

}
close($DATA);
$bulk->flush;

Setting refresh_interval to 600s in both cases has no effect. Data are
available immediately. I expect (equal to ES documentation) that new data
will be available after 10 minutes and in consequently indexing process
will be quicker but it doesn’t.

Regards

W dniu środa, 16 lipca 2014 16:52:31 UTC+2 użytkownik Michael McCandless
napisał:

Which ES version are you using? You should use the latest (soon to be
1.3): there have been a number of bulk-indexing improvements recently.

Are you using the bulk API with multiple/async client threads? Are you
saturating either CPU or IO in your cluster (so that the test is really a
full cluster capacity test)?

Also, the relationship between refresh_interval and indexing performance
is tricky: it turns out, -1 is often a poor choice, because it means your
bulk indexing threads are sometimes tied up flushing segments when with
refreshing enabled, it's a separate thread that does that. So a refresh of
5s is maybe a good choice.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 16, 2014 at 6:51 AM, Marek Dabrowski marek.d...@gmail.com
wrote:

Hello

My configuration is:
6 nodes Elasticsearch cluster
OS: Centos 6.5
JVM: 1.7.0_25

Cluster is working fine. I can indexing data, query, etc. Now I'm doing
test on package about ~50mln doc (~13GB). I would like take better
performance during indexing data. To take this target I has been changed
parameter refresh_interval. I did test for 1s, -1 and 600s. Time for
indexing data is that same. I checked configuration (_settings) for index
and value for refresh_interval is ok (has proper value), eg:

{
"smt_20140501_100000_20g_norefresh" : {
"settings" : {
"index" : {
"uuid" : "q3imiZGQTDasQUuMWS8oiw",
"number_of_replicas" : "1",
"number_of_shards" : "6",
"refresh_interval" : "600s",
"version" : {
"created" : "1020199"
}
}
}
}
}

Create index, setting refresh_interval and load is done on that same
cluster node. Before test index is deleted and created again before start
new test with new value of refresh_interval. All cluster nodes logs
information that parameter has been changed, eg:
[2014-07-16 11:24:09,813][INFO ][index.shard.service ] [h6]
[smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s]
to [-1]
or
[2014-07-16 11:32:32,928][INFO ][index.shard.service ] [h6]
[smt_20140501_100000_20g_norefresh][1] updating refresh_interval from [1s]
to [10m]

After start test new data are available immediately and indexing time
that same in 3 cases. I don't know where is failure. Somebody know what is
going on?

Regards
Marek

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fb81cccc-d826-46d0-b37f-ca63e74093d2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Michael McCandless) #5

OK, thanks for bringing closure.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jul 17, 2014 at 9:02 AM, Marek Dabrowski marek.dabrowski@gmail.com
wrote:

Hello

I found reason my problems.
Refresh index during usage perl depend on parameters "max_count" and
"max_size" for
$e->bulk_helper
Values for this parameters determine when refresh will be done on index.

Tnx for help.

Regards

W dniu czwartek, 17 lipca 2014 09:59:55 UTC+2 użytkownik Marek Dabrowski
napisał:

Hello Mike

My ES version is 1.2.1
I checked utilization nodes my cluster. Common valus ofr all nodes are:
java proces cpu utilization: < 6%
os load: < 1
io stat: < 15kB/s write

I checked indexing process 2 methods:
a) indexing by native json data (13GB splited to 100MB chunks)
time for i in /tmp/SMT* ; do echo $i; curl -s -XPOST
h3:9200/smt_20140501_bulk_json_refresh_600/num/_bulk --data-binary @$i ;
rm -f $i; done

b) indexing csv data by use perl script

my $e = Search::Elasticsearch->new(
nodes => [
'h3:9200',
]
);

my $bulk = $e->bulk_helper(
index => $idx_name,
type => $idx_type,
max_count => 10000
);

open(my $DATA, '<', $data_file) or die $!;
while(<$DATA>) {
chomp;

my @data = split(',', $_);
$bulk->index({ source => {
                            p0  => $data[0],
                            p1  => $data[1],
                            p2  => $data[2],
                            p3  => $data[3],
                            p4  => $data[4],
                            p5  => $data[5],
                            p6  => $data[6],
                            p7  => $data[7],
                            p8  => $data[8],
                            p9  => $data[9],
                            p10 => $data[10],
                            p11 => $data[11]
            }});

}
close($DATA);
$bulk->flush;

Setting refresh_interval to 600s in both cases has no effect. Data are
available immediately. I expect (equal to ES documentation) that new data
will be available after 10 minutes and in consequently indexing process
will be quicker but it doesn’t.

Regards

W dniu środa, 16 lipca 2014 16:52:31 UTC+2 użytkownik Michael McCandless
napisał:

Which ES version are you using? You should use the latest (soon to be
1.3): there have been a number of bulk-indexing improvements recently.

Are you using the bulk API with multiple/async client threads? Are you
saturating either CPU or IO in your cluster (so that the test is really a
full cluster capacity test)?

Also, the relationship between refresh_interval and indexing performance
is tricky: it turns out, -1 is often a poor choice, because it means your
bulk indexing threads are sometimes tied up flushing segments when with
refreshing enabled, it's a separate thread that does that. So a refresh of
5s is maybe a good choice.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 16, 2014 at 6:51 AM, Marek Dabrowski marek.d...@gmail.com
wrote:

Hello

My configuration is:
6 nodes Elasticsearch cluster
OS: Centos 6.5
JVM: 1.7.0_25

Cluster is working fine. I can indexing data, query, etc. Now I'm doing
test on package about ~50mln doc (~13GB). I would like take better
performance during indexing data. To take this target I has been changed
parameter refresh_interval. I did test for 1s, -1 and 600s. Time for
indexing data is that same. I checked configuration (_settings) for index
and value for refresh_interval is ok (has proper value), eg:

{
"smt_20140501_100000_20g_norefresh" : {
"settings" : {
"index" : {
"uuid" : "q3imiZGQTDasQUuMWS8oiw",
"number_of_replicas" : "1",
"number_of_shards" : "6",
"refresh_interval" : "600s",
"version" : {
"created" : "1020199"
}
}
}
}
}

Create index, setting refresh_interval and load is done on that same
cluster node. Before test index is deleted and created again before start
new test with new value of refresh_interval. All cluster nodes logs
information that parameter has been changed, eg:
[2014-07-16 11:24:09,813][INFO ][index.shard.service ] [h6]
[smt_20140501_100000_20g_norefresh][1] updating refresh_interval from
[1s] to [-1]
or
[2014-07-16 11:32:32,928][INFO ][index.shard.service ] [h6]
[smt_20140501_100000_20g_norefresh][1] updating refresh_interval from
[1s] to [10m]

After start test new data are available immediately and indexing time
that same in 3 cases. I don't know where is failure. Somebody know what is
going on?

Regards
Marek

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f7565c36-98c7-4e3e-8132-796f9edfb3fa%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fb81cccc-d826-46d0-b37f-ca63e74093d2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/fb81cccc-d826-46d0-b37f-ca63e74093d2%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRc85-8vuZdNzwYP4mbsba7SBDHA2whdGuyaj0%2BLLG__hQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6