How to work index refresh? & asyncronous replication setting

Hello. I don't have any experience to write question in the internet. And
I'm worried about my pool English ability.

I try to find the internal work about index refresh. But I can't figure
out.

Index refresh interval is so important to optimize the es performance. So I
want to know that how to work and why this is needed,..or etc.

Could someone explain for me. Why index refresh is needed? is it related
on Lucene?

I just guess If we want to search the data , index refresh is already done.

And I have an one more question.

When we index the bulk data to elastic search I want to use the
asynchronous replication policy.

In this policy I think some data indexing later in the replication.

At that time some node can ask that data to replication.

In this case the replication node will reply "there is no data".

After do that node ask again other node or not? I think ask again other
node. so it can cause fail-over situation.

So if we want to prevent it probably is it better set the Get operation
preference _primary?

In that case we can't use the several replication to search for performance.

If I have a wrong plz let me know.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hiya

Index refresh interval is so important to optimize the es performance.
So I want to know that how to work and why this is needed,..or etc.

Indexed data only becomes visible to search once it has been written to
a new segment file (although not yet Lucene flushed), and the
IndexReaders in Lucene have been reopened.

This happens automatically every second by default.

When we index the bulk data to Elasticsearch I want to use the
asynchronous replication policy.

In this policy I think some data indexing later in the replication.

Yes - it will be indexed on the primary, but not yet on the replica
shards.

Are you sure you need this? Are you not over optimizing?

After do that node ask again other node or not? I think ask again
other node. so it can cause fail-over situation.

No.

So if we want to prevent it probably is it better set the Get
operation preference _primary?

Yes

In that case we can't use the several replication to search for
performance.

Well, search will only see the document after a refresh anyway, so this
is probably less important.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Clinton~ Thank you for your help.

Indexed data only becomes visible to search once it has been written to
a new segment file (although not yet Lucene flushed), and the
IndexReaders in Lucene have been reopened.

This happens automatically every second by default.

Yes, I know the default is 1s.
I already changed it 30S for high performance indexing.
Because I use the es for log collector.
Our system will use the kafka consumer in front of es. we can not expect
how many logs are received.
In the test result every second index refresh has a large impact on
performance .
In my test environment is like this..
I used the 4 node in amazon m1.xlarge type. I use the bulk Api and I use
the apache bench and nginx is for client load balancing in front of
es...etc...
Before change the index refresh time, es can index 7000TPS , and after
change the index refresh time it can index 12000TPS.

Are you sure you need this? Are you not over optimizing?

Our log system will be so heavy. So I want to check best performance for
our system.
I think indexing performance is very important in our system. So I already
did over optimizing as I can do.

I already optimized as below

Open file limits(server) => 65535
Memory half allocation => 7G (m1.xlarge)
Asynchronous Replication(server)
Index refresh => 30s
bootstrap.mlockall: true ( Don't swap the jvm )
Bulk indexing (client)
Field cache type (use soft for stability) -> ( performance little down) =>
I dont know why? anyone know??
Thread pool number => ( It will be fixed , need to find sweet spot.., not
yet)
Routing ( dont need yet) -> it cause sometimes can be slow ... because
shard don't have similar data , I think it cause unbalanced data volume
Storage Optimize ( dont need yet)
Timeout (dont need yet)

So I can get 3times more indexing performance except Open file limits
optimize.
Before optimized things es can indexing 4000TPS after optimized 14000TPS

Data are bulk docs. 1 doc is 264byte. one bulk has 300,000request.

After do that node ask again other node or not? I think ask again
other node. so it can cause fail-over situation.

No.

Thank you.

I will ask one more.
As I know Default setting is ask randomly shard or replicas.
If the specific shard or replica don't reply In that case will es ask again
other shard or replica?
If it is not, In my opinion It will be better ask again shard or other
replica.

So if we want to prevent it probably is it better set the Get
operation preference _primary?

Yes

In that case we can't use the several replication to search for
performance.

Well, search will only see the document after a refresh anyway, so this
is probably less important.

You don't understand my question or I didn't understand es working process.
If we set the preference _primary, all requests are asked primary shard
firstly. in that case don't ask replica for searching.
It means if we use the more replica it can not affect performance.

If I have a wrong please let me know.
And Thank you very much for your kindness reply.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry for mistake.
1bulk is 1000request and I will send it 300.
So total requests are 300,000.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I used the 4 node in amazon m1.xlarge type. I use the bulk Api and I
use the apache bench and nginx is for client load balancing in front
of es...etc...
Before change the index refresh time, es can index 7000TPS , and after
change the index refresh time it can index 12000TPS.

Consider using m2.4xlarge instances, which have provisioned IOPS. The
faster the disks, the better the performance.

Field cache type (use soft for stability) -> ( performance little
down) => I dont know why? anyone know??

You don't want this. This causes the field data cache to be reloaded
every time you run a query. That's a huge impact on search.

Thread pool number => ( It will be fixed , need to find sweet spot..,
not yet)
Routing ( dont need yet) -> it cause sometimes can be slow ...
because shard don't have similar data , I think it cause unbalanced
data volume

For log data, you probably want to go for index-per-day/week/month/hour
(whatever suits your loads) rather than using routing.

Data are bulk docs. 1 doc is 264byte. one bulk has 300,000request.

I recommend making smaller bulk requests. The entire bulk request has
to be held in memory, which takes away from the memory available for
(eg) searching.

In my tests, the sweet spot for bulk is 1000-5000 docs at a time. The
actual number will depend on your doc size etc, but I found that
performance fell off after a certain size of request.

I will ask one more.
As I know Default setting is ask randomly shard or replicas.
If the specific shard or replica don't reply In that case will es ask
again other shard or replica?
If it is not, In my opinion It will be better ask again shard or other
replica.

If a shard fails, it is removed and reallocated somewhere else. Whether
a search request will retry on another shard, I'm not sure about.

So if we want to prevent it probably is it better set the Get
operation preference _primary?

Yes

In that case we can't use the several replication to search for
performance.

Well, search will only see the document after a refresh anyway, so
this
is probably less important.

You don't understand my question or I didn't understand es working
process.
If we set the preference _primary, all requests are asked primary
shard firstly. in that case don't ask replica for searching.
It means if we use the more replica it can not affect performance.

Using async means that a document is indexed on primary, but the request
returns before we confirm that a document has also been indexed on the
replicas.

So this affects GETing a document. However, search only refreshes once
every second anyway. So if you index a document, then immediately
search on the primary, you probably won't see the doc. So for search,
it really doesn't matter whether you're searching primaries or replicas.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Tuesday, March 12, 2013 7:44:31 PM UTC+9, Clinton Gormley wrote:

Consider using m2.4xlarge instances, which have provisioned IOPS. The
faster the disks, the better the performance.

ok. I will check m2.4xlarge instances, but I already use provisioned 

IOPS(100) in m1.xlarge. It can set when make a vm.

You don't want this. This causes the field data cache to be reloaded
every time you run a query. That's a huge impact on search.

oh, really?  I saw the invalid info. when I read this option from some 

article. It will avoid out of memory error. Because this option will first
garbage collect Cache filed.
But I don't know every query it works. thank you very much.

For log data, you probably want to go for index-per-day/week/month/hour
(whatever suits your loads) rather than using routing.

Yes, I agree with you. 

I recommend making smaller bulk requests. The entire bulk request has
to be held in memory, which takes away from the memory available for
(eg) searching.

In my tests, the sweet spot for bulk is 1000-5000 docs at a time. The
actual number will depend on your doc size etc, but I found that

sorry for mistake to write about bulk size. My bulk size is 1000 docs 

in this test.
I already wrote that before mail. I agree with you. and I will try to
find the sweet spot.

Using async means that a document is indexed on primary, but the request
returns before we confirm that a document has also been indexed on the
replicas.

So this affects GETing a document. However, search only refreshes once
every second anyway. So if you index a document, then immediately
search on the primary, you probably won't see the doc. So for search,
it really doesn't matter whether you're searching primaries or replicas.

OK. you mean before index refresh finished, I won't see the docs. even 

though data exist in the shard or replica.

Think about this situation. 

First I didn't use the refresh every second for indexing performance. 

(could be 10s or 20S)
When index refresh is processing, the replication is indexing.(async)
at the same time. how to work Index refresh??

Second 
For example if I make 30shard and 60replica and some request reach the 

es.
and es search data 1shard or 2 replica randomly when option is default.
So a lot of search request will be loadbalanced 3 nodes (if we use 90
nodes).
If I set the more replica , a lot of search request will be
loadbalanced more nodes.
So search performance will go up.
But if I set the preference _primary it will find the data only primary
shard first and didn't find the replication. It means there is no meaning
more replica. Is that right?

thank you for your help.

Regards, Kim hee sung. 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

oh, really?  I saw the invalid info. when I read this option from

some article. It will avoid out of memory error. Because this option
will first garbage collect Cache filed.
But I don't know every query it works. thank you very much.

Yes - it does free up memory, but with a big performance and garbage
collection cost. Best avoided.

OK. you mean before index refresh finished, I won't see the docs.

even though data exist in the shard or replica.

Yes.

Think about this situation. 

First I didn't use the refresh every second for indexing

performance. (could be 10s or 20S)
When index refresh is processing, the replication is
indexing.(async) at the same time. how to work Index refresh??

Setting refresh to eg 10s or 20s for log data seems fine to me. Do you
really care if you are only seeing logs from 20 seconds ago?

If you do, then you can force a manual refresh with the refresh API.

So search performance will go up. 
But if I set the preference _primary it will find the data only

primary shard first and didn't find the replication. It means there is
no meaning more replica. Is that right?

Yes, but you really don't need to worry about this for search. As above

  • who cares if you are seeing log data that is a few seconds old.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.