Bulk indexing and search with two different threads


(Mustafa Sener) #1

Hi,
I started two ES nodes and configured an index with 2 shards and 1 replica.
Then started bulk indexing. While indexing process continue, I started
another thread which searches the same index after a while I saw that
totalHits count is fluctuating. I expected a continuous increase in number
of objects since the thread only index items. However, sometimes totalHits
returns less number of items than previous search requests.

I retried some test with a single thread as following. The thread first
execute bulk indexing synchronously and then searches same index for 500
time. The same thing happens.

Do you have any ideas about what the problem is? Or is this an expected
result?

I think it may be because of replication. When I set replica count as 0 the
problem is solved.

--
Mustafa Sener
www.ifountain.com


(Shay Banon) #2

Yes, you will see that because a shard and its replica are not sync'ed when ti comes to refreshing. So, one might have sync'ed a second ago, while the other sync'ed half a second ago. When you do a search, it round robins between a shard and its replicas, so you might see different results depending on when each was refreshed.
On Friday, March 11, 2011 at 1:22 PM, Mustafa Sener wrote:

Hi,
I started two ES nodes and configured an index with 2 shards and 1 replica. Then started bulk indexing. While indexing process continue, I started another thread which searches the same index after a while I saw that totalHits count is fluctuating. I expected a continuous increase in number of objects since the thread only index items. However, sometimes totalHits returns less number of items than previous search requests.

I retried some test with a single thread as following. The thread first execute bulk indexing synchronously and then searches same index for 500 time. The same thing happens.

Do you have any ideas about what the problem is? Or is this an expected result?

I think it may be because of replication. When I set replica count as 0 the problem is solved.

--
Mustafa Sener
www.ifountain.com


(Mustafa Sener) #3

Are there any ways to make search operations on primary shard only?

On Fri, Mar 11, 2011 at 3:13 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Yes, you will see that because a shard and its replica are not sync'ed
when ti comes to refreshing. So, one might have sync'ed a second ago, while
the other sync'ed half a second ago. When you do a search, it round robins
between a shard and its replicas, so you might see different results
depending on when each was refreshed.

On Friday, March 11, 2011 at 1:22 PM, Mustafa Sener wrote:

Hi,
I started two ES nodes and configured an index with 2 shards and 1 replica.
Then started bulk indexing. While indexing process continue, I started
another thread which searches the same index after a while I saw that
totalHits count is fluctuating. I expected a continuous increase in number
of objects since the thread only index items. However, sometimes totalHits
returns less number of items than previous search requests.

I retried some test with a single thread as following. The thread first
execute bulk indexing synchronously and then searches same index for 500
time. The same thing happens.

Do you have any ideas about what the problem is? Or is this an expected
result?

I think it may be because of replication. When I set replica count as 0 the
problem is solved.

--
Mustafa Sener
www.ifountain.com

--
Mustafa Sener
www.ifountain.com


(Shay Banon) #4

No, there isn't. Can be added, I guess, though not sure why would you want to do that.
On Friday, March 11, 2011 at 3:43 PM, Mustafa Sener wrote:

Are there any ways to make search operations on primary shard only?

On Fri, Mar 11, 2011 at 3:13 PM, Shay Banon shay.banon@elasticsearch.com wrote:

Yes, you will see that because a shard and its replica are not sync'ed when ti comes to refreshing. So, one might have sync'ed a second ago, while the other sync'ed half a second ago. When you do a search, it round robins between a shard and its replicas, so you might see different results depending on when each was refreshed.
On Friday, March 11, 2011 at 1:22 PM, Mustafa Sener wrote:

Hi,
I started two ES nodes and configured an index with 2 shards and 1 replica. Then started bulk indexing. While indexing process continue, I started another thread which searches the same index after a while I saw that totalHits count is fluctuating. I expected a continuous increase in number of objects since the thread only index items. However, sometimes totalHits returns less number of items than previous search requests.

I retried some test with a single thread as following. The thread first execute bulk indexing synchronously and then searches same index for 500 time. The same thing happens.

Do you have any ideas about what the problem is? Or is this an expected result?

I think it may be because of replication. When I set replica count as 0 the problem is solved.

--
Mustafa Sener
www.ifountain.com

--
Mustafa Sener
www.ifountain.com


(Shay Banon) #5

And, btw, its not a matter of data not being sync'ed between teh shard and the replica, cause it is (sync replicated). Its the refresh interval that makes changes visible for search (thought I would explicitly clear that).
On Friday, March 11, 2011 at 3:44 PM, Shay Banon wrote:

No, there isn't. Can be added, I guess, though not sure why would you want to do that.
On Friday, March 11, 2011 at 3:43 PM, Mustafa Sener wrote:

Are there any ways to make search operations on primary shard only?

On Fri, Mar 11, 2011 at 3:13 PM, Shay Banon shay.banon@elasticsearch.com wrote:

Yes, you will see that because a shard and its replica are not sync'ed when ti comes to refreshing. So, one might have sync'ed a second ago, while the other sync'ed half a second ago. When you do a search, it round robins between a shard and its replicas, so you might see different results depending on when each was refreshed.
On Friday, March 11, 2011 at 1:22 PM, Mustafa Sener wrote:

Hi,
I started two ES nodes and configured an index with 2 shards and 1 replica. Then started bulk indexing. While indexing process continue, I started another thread which searches the same index after a while I saw that totalHits count is fluctuating. I expected a continuous increase in number of objects since the thread only index items. However, sometimes totalHits returns less number of items than previous search requests.

I retried some test with a single thread as following. The thread first execute bulk indexing synchronously and then searches same index for 500 time. The same thing happens.

Do you have any ideas about what the problem is? Or is this an expected result?

I think it may be because of replication. When I set replica count as 0 the problem is solved.

--
Mustafa Sener
www.ifountain.com

--
Mustafa Sener
www.ifountain.com


(Mustafa Sener) #6

No, there isn't. Can be added, I guess, though not sure why would you want
to do that.
Because ES is our main data storage and when a user click on one refresh
buttons on UI the results are fluctuating. For example for first time it
shows 4000 items and next time it shown 3000 items. This causes very big
problems related with data.

On Fri, Mar 11, 2011 at 3:45 PM, Shay Banon shay.banon@elasticsearch.comwrote:

And, btw, its not a matter of data not being sync'ed between teh shard
and the replica, cause it is (sync replicated). Its the refresh interval
that makes changes visible for search (thought I would explicitly clear
that).

On Friday, March 11, 2011 at 3:44 PM, Shay Banon wrote:

No, there isn't. Can be added, I guess, though not sure why would you
want to do that.

On Friday, March 11, 2011 at 3:43 PM, Mustafa Sener wrote:

Are there any ways to make search operations on primary shard only?

On Fri, Mar 11, 2011 at 3:13 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Yes, you will see that because a shard and its replica are not sync'ed
when ti comes to refreshing. So, one might have sync'ed a second ago, while
the other sync'ed half a second ago. When you do a search, it round robins
between a shard and its replicas, so you might see different results
depending on when each was refreshed.

On Friday, March 11, 2011 at 1:22 PM, Mustafa Sener wrote:

Hi,
I started two ES nodes and configured an index with 2 shards and 1
replica. Then started bulk indexing. While indexing process continue, I
started another thread which searches the same index after a while I saw
that totalHits count is fluctuating. I expected a continuous increase in
number of objects since the thread only index items. However, sometimes
totalHits returns less number of items than previous search requests.

I retried some test with a single thread as following. The thread first
execute bulk indexing synchronously and then searches same index for 500
time. The same thing happens.

Do you have any ideas about what the problem is? Or is this an expected
result?

I think it may be because of replication. When I set replica count as 0
the problem is solved.

--
Mustafa Sener
www.ifountain.com

--
Mustafa Sener
www.ifountain.com

--
Mustafa Sener
www.ifountain.com


(Shay Banon) #7

Right, I see. Well, this mainly affects cases where you do bulk loading and data is getting added "really fast" so have the two fluctuate so much. Note that adding this capability, search only on primary shard, (which I am not terribly against) looses the ability to scale your searches (since it will always hit the primary shards).

One thing that might help here is the recent addition in master to be able to change the refresh interval dynamically on a running system. More info here: https://github.com/elasticsearch/elasticsearch/issues/closed#issue/758.

Using the above feature, you can, when you do heavy bulk loading, disable refreshing, do the bulk loading, and then enable it again. If you want a user to see changes, you can call the refresh API explicitly, which will not guarantee exact "refresh" state between the shards, but the difference (since it happens in parallel on all shards) should be small enough not to notice.

The reason why there isn't a sync'ed refresh is since if it was to be implemented, a pause between the primary and its replicas will need to be coordinated, and some sort of two phase "commit" process would need to implemented in order to make sure both refreshed at the same time, which is very very expensive to do, and can lead to all sort of other problems in a distributed env.

Open an issue for the ability to search only on primary shards. I wanted to add a feature to search to prefer local allocated shards over remote ones anyhow, so it can be just another type of "preference".

-shay.banon
On Friday, March 11, 2011 at 3:59 PM, Mustafa Sener wrote:
No, there isn't. Can be added, I guess, though not sure why would you want to do that.

Because ES is our main data storage and when a user click on one refresh buttons on UI the results are fluctuating. For example for first time it shows 4000 items and next time it shown 3000 items. This causes very big problems related with data.

On Fri, Mar 11, 2011 at 3:45 PM, Shay Banon shay.banon@elasticsearch.com wrote:

And, btw, its not a matter of data not being sync'ed between teh shard and the replica, cause it is (sync replicated). Its the refresh interval that makes changes visible for search (thought I would explicitly clear that).
On Friday, March 11, 2011 at 3:44 PM, Shay Banon wrote:

No, there isn't. Can be added, I guess, though not sure why would you want to do that.
On Friday, March 11, 2011 at 3:43 PM, Mustafa Sener wrote:

Are there any ways to make search operations on primary shard only?

On Fri, Mar 11, 2011 at 3:13 PM, Shay Banon shay.banon@elasticsearch.com wrote:

Yes, you will see that because a shard and its replica are not sync'ed when ti comes to refreshing. So, one might have sync'ed a second ago, while the other sync'ed half a second ago. When you do a search, it round robins between a shard and its replicas, so you might see different results depending on when each was refreshed.
On Friday, March 11, 2011 at 1:22 PM, Mustafa Sener wrote:

Hi,
I started two ES nodes and configured an index with 2 shards and 1 replica. Then started bulk indexing. While indexing process continue, I started another thread which searches the same index after a while I saw that totalHits count is fluctuating. I expected a continuous increase in number of objects since the thread only index items. However, sometimes totalHits returns less number of items than previous search requests.

I retried some test with a single thread as following. The thread first execute bulk indexing synchronously and then searches same index for 500 time. The same thing happens.

Do you have any ideas about what the problem is? Or is this an expected result?

I think it may be because of replication. When I set replica count as 0 the problem is solved.

--
Mustafa Sener
www.ifountain.com

--
Mustafa Sener
www.ifountain.com

--
Mustafa Sener
www.ifountain.com


(Mustafa Sener) #8

Thanks, I created issue

On Fri, Mar 11, 2011 at 4:09 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Right, I see. Well, this mainly affects cases where you do bulk loading
and data is getting added "really fast" so have the two fluctuate so much.
Note that adding this capability, search only on primary shard, (which I am
not terribly against) looses the ability to scale your searches (since it
will always hit the primary shards).

One thing that might help here is the recent addition in master to be able
to change the refresh interval dynamically on a running system. More info
here:
https://github.com/elasticsearch/elasticsearch/issues/closed#issue/758.

Using the above feature, you can, when you do heavy bulk loading, disable
refreshing, do the bulk loading, and then enable it again. If you want a
user to see changes, you can call the refresh API explicitly, which will
not guarantee exact "refresh" state between the shards, but the difference
(since it happens in parallel on all shards) should be small enough not to
notice.

The reason why there isn't a sync'ed refresh is since if it was to be
implemented, a pause between the primary and its replicas will need to be
coordinated, and some sort of two phase "commit" process would need to
implemented in order to make sure both refreshed at the same time, which is
very very expensive to do, and can lead to all sort of other problems in a
distributed env.

Open an issue for the ability to search only on primary shards. I wanted to
add a feature to search to prefer local allocated shards over remote ones
anyhow, so it can be just another type of "preference".

-shay.banon

On Friday, March 11, 2011 at 3:59 PM, Mustafa Sener wrote:

No, there isn't. Can be added, I guess, though not sure why would you want
to do that.
Because ES is our main data storage and when a user click on one refresh
buttons on UI the results are fluctuating. For example for first time it
shows 4000 items and next time it shown 3000 items. This causes very big
problems related with data.

On Fri, Mar 11, 2011 at 3:45 PM, Shay Banon shay.banon@elasticsearch.comwrote:

And, btw, its not a matter of data not being sync'ed between teh shard
and the replica, cause it is (sync replicated). Its the refresh interval
that makes changes visible for search (thought I would explicitly clear
that).

On Friday, March 11, 2011 at 3:44 PM, Shay Banon wrote:

No, there isn't. Can be added, I guess, though not sure why would you
want to do that.

On Friday, March 11, 2011 at 3:43 PM, Mustafa Sener wrote:

Are there any ways to make search operations on primary shard only?

On Fri, Mar 11, 2011 at 3:13 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Yes, you will see that because a shard and its replica are not sync'ed
when ti comes to refreshing. So, one might have sync'ed a second ago, while
the other sync'ed half a second ago. When you do a search, it round robins
between a shard and its replicas, so you might see different results
depending on when each was refreshed.

On Friday, March 11, 2011 at 1:22 PM, Mustafa Sener wrote:

Hi,
I started two ES nodes and configured an index with 2 shards and 1
replica. Then started bulk indexing. While indexing process continue, I
started another thread which searches the same index after a while I saw
that totalHits count is fluctuating. I expected a continuous increase in
number of objects since the thread only index items. However, sometimes
totalHits returns less number of items than previous search requests.

I retried some test with a single thread as following. The thread first
execute bulk indexing synchronously and then searches same index for 500
time. The same thing happens.

Do you have any ideas about what the problem is? Or is this an expected
result?

I think it may be because of replication. When I set replica count as 0
the problem is solved.

--
Mustafa Sener
www.ifountain.com

--
Mustafa Sener
www.ifountain.com

--
Mustafa Sener
www.ifountain.com

--
Mustafa Sener
www.ifountain.com


(system) #9