Using elasticsearch as mysql cache, is frequent index burden ES when documents is massive?

xin_zhang · May 9, 2013, 6:15am

Hi,
I have built a mysql cache using ElasticSearch, I am not sure if anybody
else using ES in this way, but it really works for me.
My motivation is from two concerns in terms of mysql:

max connection number of mysql limits concurrent access when massive
users read database.
search is hard, especially I need to perform search through http API.

so I built a framework which automatically indexes database records to
ES whenever a row gets created/updated/deleted. The trick is
Spring TransactionSynchronization, each time a transaction commit complete,
the callback triggers an index operation.

it works quite well so far, at least in my unit test. However, I still
have some concerns:

Will frequent index screw ES? when database update frequently, let's
say 10000 users update database concurrently, and mysql max connection is
This means I may do ES index 2000 times per second. Will this burden
ES too much? The max documents will be 1.5 million, I would not support any
number bigger than this.
how can I reduce refresh time? As I use ES as cache, which means all
read operation will perform to ES instead of mysql. When users update
database, they expect to see the update immediately in next read operation.
Now I set refresh interval to 500ms and add 300ms delay in search
operation. This works, but I am still afraid that sometime user will read
old data as the delay is smaller than refresh interval. I wonder if setting
refresh interval 100ms can make things better?
how to make index faster? In my test, indexing a 4K document takes
~120ms. Comparing to writing database it is slow but still acceptable.
However, my unit test only create about 20 documents, I am afraid the index
time will increase significantly if there are millions document. Is there
any way I can make it faster? for example, creating schema for each index?

 I used to build my framework on Solr until I found ES, now I will

never look back. Hope you guys keep improving this awesome project!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · May 9, 2013, 7:20am

Hello,

On Thu, May 9, 2013 at 9:15 AM, xin zhang xing5820@gmail.com wrote:

Hi,
I have built a mysql cache using Elasticsearch, I am not sure if
anybody else using ES in this way, but it really works for me.
My motivation is from two concerns in terms of mysql:

max connection number of mysql limits concurrent access when massive
users read database.

search is hard, especially I need to perform search through http API.

so I built a framework which automatically indexes database records to
ES whenever a row gets created/updated/deleted. The trick is
Spring TransactionSynchronization, each time a transaction commit complete,
the callback triggers an index operation.

it works quite well so far, at least in my unit test. However, I still
have some concerns:

Will frequent index screw ES? when database update frequently, let's
say 10000 users update database concurrently, and mysql max connection is

This means I may do ES index 2000 times per second. Will this burden
ES too much? The max documents will be 1.5 million, I would not support any
number bigger than this.

Too many index operations shouldn't screw ES, but your indexing operations
may fall behind.

I'd try to use the Bulk API, because 2000 single index operations per
second is quite a lot, but a bulk of 2000 items in a second isn't that much.

2. how can I reduce refresh time? As I use ES as cache, which means
all read operation will perform to ES instead of mysql. When users update
database, they expect to see the update immediately in next read operation.
Now I set refresh interval to 500ms and add 300ms delay in search
operation. This works, but I am still afraid that sometime user will read
old data as the delay is smaller than refresh interval. I wonder if setting
refresh interval 100ms can make things better?

It would make things more realtime, but it will also slow down indexing, as
refresh operations are expensive. I think it's a case of test and see.

3. how to make index faster? In my test, indexing a 4K document takes
~120ms. Comparing to writing database it is slow but still acceptable.
However, my unit test only create about 20 documents, I am afraid the index
time will increase significantly if there are millions document. Is there
any way I can make it faster? for example, creating schema for each index?

Besides having the refresh rate as big as you can afford (and it doesn't
seem like you have too much room for that), I can think of the following:

use the bulk API wherever possible
use the "create" operation wherever possible
if you have replicas, you can use async replication when indexing
increase the flush settings for your translog
increase your number of shards, and/or the number of segments through the
merge policy to decrease the impact of merging

Besides the bulk API, I'm not sure if the others will significantly improve
your indexing speed. So it's best to test after each change and see.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

xin_zhang · May 9, 2013, 7:45am

Thanks Radu

I'd try to use the Bulk API, because 2000 single index operations per
second is quite a lot, but a bulk of 2000 items in a second >>isn't that
much.

actually I am using Bulk API. However, it couldn't help much as the index
thread could not wait to long because we don't know when the next db update
comes, waiting too long will hurt realtime. I even think about using
algorithm like elevator used by Linux kernel for I/O.

increase the flush settings for your translog
Could you explain a little more on this?

And I am thinking if I can use a NoSql database for cache. My main
requirement is making all read operations go to cache, in order to support
massive concurrent read. In my software, mysql would be updated frequently
by user action, but most DB access es are from UI which is simply reading
some data to display.

I don't have too strong requirement on search, all I need is term query and
IN/NOT_IN query.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

xin_zhang · May 9, 2013, 7:48am

And I do need one-many relationship. This is where ES fascinates me much,
solr doesn't have it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · May 9, 2013, 8:15am

Why don't you use Apache Commons DBCP for MySQL connection pooling
DBCP – Overview You put a lot of burden
on MySQL with 2000 connections, it does not scale.
ES is fast but for that, you must not use refresh that often. This
puts a lot of unnecessary burden on ES, it does not scale. You cannot
guarantuee that ES can search an indexed document like in a transaction
after a MySQL insert. You can use near realtime get though. Do not use
refresh interval under 1s, it destroys performance. What do you mean by
delay of 300ms in search? There is no use in delaying search. 100ms
intervals will make things even worse.
To make index faster, do not use refresh, use short documents, use
fast I/O subsystem. A "schema" does not matter for ES performance (ES
will autodetect all new field types, but as you use MySQL, the number of
fields are limited). You can't test with 20 documents only. You must
test with millions of concurrent writes to MySQL and ES and you will see
that your system design will bog down the whole writing, with awfully
slow read responses.

To add it up, if you can manage it, you can not use ES within a DB
transaction without getting bad performance. Because of that I recommend
using ES asynchronously, without transactions, without refresh, for best
performance. Many MB/sec are possible in asynchronous bulk indexing.
That is where an ES river comes into play, for example the JDBC river.
You should fetch data on a time schedule with bulk indexing and live
with stale data for some seconds/minutes. For near realtime reads, use
the get API.

Jörg

Am 09.05.13 08:15, schrieb xin zhang:

Hi,
I have built a mysql cache using Elasticsearch, I am not sure if
anybody else using ES in this way, but it really works for me.
My motivation is from two concerns in terms of mysql:

max connection number of mysql limits concurrent access when
massive users read database.

search is hard, especially I need to perform search through http
API.

so I built a framework which automatically indexes database records
to ES whenever a row gets created/updated/deleted. The trick is
Spring TransactionSynchronization, each time a transaction commit
complete, the callback triggers an index operation.

it works quite well so far, at least in my unit test. However, I
still have some concerns:

Will frequent index screw ES? when database update frequently,
let's say 10000 users update database concurrently, and mysql max
connection is 2000. This means I may do ES index 2000 times per
second. Will this burden ES too much? The max documents will be 1.5
million, I would not support any number bigger than this.
2. how can I reduce refresh time? As I use ES as cache, which 
means all read operation will perform to ES instead of mysql. When
users update database, they expect to see the update immediately in
next read operation. Now I set refresh interval to 500ms and add 300ms
delay in search operation. This works, but I am still afraid that
sometime user will read old data as the delay is smaller than refresh
interval. I wonder if setting refresh interval 100ms can make things
better?
3. how to make index faster? In my test, indexing a 4K document 
takes ~120ms. Comparing to writing database it is slow but still
acceptable. However, my unit test only create about 20 documents, I am
afraid the index time will increase significantly if there are
millions document. Is there any way I can make it faster? for example,
creating schema for each index?
 I used to build my framework on Solr until I found ES, now I will 
never look back. Hope you guys keep improving this awesome project!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · May 9, 2013, 10:10am

On Thu, May 9, 2013 at 10:45 AM, xin zhang xing5820@gmail.com wrote:

Thanks Radu

I'd try to use the Bulk API, because 2000 single index operations per
second is quite a lot, but a bulk of 2000 items in a second >>isn't that
much.

actually I am using Bulk API. However, it couldn't help much as the index
thread could not wait to long because we don't know when the next db update
comes, waiting too long will hurt realtime. I even think about using
algorithm like elevator used by Linux kernel for I/O.

increase the flush settings for your translog
Could you explain a little more on this?

Yep, the transaction log is a file where ES logs your indexed document will
live on the disk before it's committed (flushed) to the Lucene index. That
flush is pretty expensive, so it's worth increasing the period between
flushes, as long as you don't get a translog that's too big. The translog
is also used for replication (ie: the translog is used to index the same
data in the replica shards).

For reference:

Regarding Jörg's advice, if you need to use real-time get, take a look here:

This shouldn't care of refreshing, it should get you the document as soon
as it was indexed.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Index speed? Elasticsearch	2	719	February 15, 2017
Using ElasticSearch as a Object Cache Elasticsearch	3	5153	July 6, 2017
Using elasticsearch as a cache with frequent updates Elasticsearch	2	846	July 5, 2017
Strategy for keeping Elasticsearch updated with MySQL Elasticsearch	5	5364	July 6, 2017
Getting started with elasticsearch Elasticsearch	1	268	July 6, 2017

Using elasticsearch as mysql cache, is frequent index burden ES when documents is massive?

Best regards, Radu

Best regards, Radu

Related topics

Best regards,
Radu

Best regards,
Radu