Using elasticsearch as mysql cache, is frequent index burden ES when documents is massive?

Hi,
I have built a mysql cache using ElasticSearch, I am not sure if anybody
else using ES in this way, but it really works for me.
My motivation is from two concerns in terms of mysql:

  1. max connection number of mysql limits concurrent access when massive
    users read database.
  2. search is hard, especially I need to perform search through http API.

so I built a framework which automatically indexes database records to
ES whenever a row gets created/updated/deleted. The trick is
Spring TransactionSynchronization, each time a transaction commit complete,
the callback triggers an index operation.

it works quite well so far, at least in my unit test. However, I still
have some concerns:

  1. Will frequent index screw ES? when database update frequently, let's
    say 10000 users update database concurrently, and mysql max connection is

  2. This means I may do ES index 2000 times per second. Will this burden
    ES too much? The max documents will be 1.5 million, I would not support any
    number bigger than this.

  3. how can I reduce refresh time? As I use ES as cache, which means all
    read operation will perform to ES instead of mysql. When users update
    database, they expect to see the update immediately in next read operation.
    Now I set refresh interval to 500ms and add 300ms delay in search
    operation. This works, but I am still afraid that sometime user will read
    old data as the delay is smaller than refresh interval. I wonder if setting
    refresh interval 100ms can make things better?

  4. how to make index faster? In my test, indexing a 4K document takes
    ~120ms. Comparing to writing database it is slow but still acceptable.
    However, my unit test only create about 20 documents, I am afraid the index
    time will increase significantly if there are millions document. Is there
    any way I can make it faster? for example, creating schema for each index?

 I used to build my framework on Solr until I found ES, now I will 

never look back. Hope you guys keep improving this awesome project!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello,

On Thu, May 9, 2013 at 9:15 AM, xin zhang xing5820@gmail.com wrote:

Hi,
I have built a mysql cache using Elasticsearch, I am not sure if
anybody else using ES in this way, but it really works for me.
My motivation is from two concerns in terms of mysql:

  1. max connection number of mysql limits concurrent access when massive
    users read database.
  2. search is hard, especially I need to perform search through http API.

so I built a framework which automatically indexes database records to
ES whenever a row gets created/updated/deleted. The trick is
Spring TransactionSynchronization, each time a transaction commit complete,
the callback triggers an index operation.

it works quite well so far, at least in my unit test. However, I still
have some concerns:

  1. Will frequent index screw ES? when database update frequently, let's
    say 10000 users update database concurrently, and mysql max connection is
  2. This means I may do ES index 2000 times per second. Will this burden
    ES too much? The max documents will be 1.5 million, I would not support any
    number bigger than this.

Too many index operations shouldn't screw ES, but your indexing operations
may fall behind.

I'd try to use the Bulk API, because 2000 single index operations per
second is quite a lot, but a bulk of 2000 items in a second isn't that much.

2. how can I reduce refresh time? As I use ES as cache, which means

all read operation will perform to ES instead of mysql. When users update
database, they expect to see the update immediately in next read operation.
Now I set refresh interval to 500ms and add 300ms delay in search
operation. This works, but I am still afraid that sometime user will read
old data as the delay is smaller than refresh interval. I wonder if setting
refresh interval 100ms can make things better?

It would make things more realtime, but it will also slow down indexing, as
refresh operations are expensive. I think it's a case of test and see.

3. how to make index faster? In my test, indexing a 4K document takes

~120ms. Comparing to writing database it is slow but still acceptable.
However, my unit test only create about 20 documents, I am afraid the index
time will increase significantly if there are millions document. Is there
any way I can make it faster? for example, creating schema for each index?

Besides having the refresh rate as big as you can afford (and it doesn't
seem like you have too much room for that), I can think of the following:

  • use the bulk API wherever possible
  • use the "create" operation wherever possible
  • if you have replicas, you can use async replication when indexing
  • increase the flush settings for your translog
  • increase your number of shards, and/or the number of segments through the
    merge policy to decrease the impact of merging

Besides the bulk API, I'm not sure if the others will significantly improve
your indexing speed. So it's best to test after each change and see.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Radu

I'd try to use the Bulk API, because 2000 single index operations per
second is quite a lot, but a bulk of 2000 items in a second >>isn't that
much.

actually I am using Bulk API. However, it couldn't help much as the index
thread could not wait to long because we don't know when the next db update
comes, waiting too long will hurt realtime. I even think about using
algorithm like elevator used by Linux kernel for I/O.

  • increase the flush settings for your translog
    Could you explain a little more on this?

And I am thinking if I can use a NoSql database for cache. My main
requirement is making all read operations go to cache, in order to support
massive concurrent read. In my software, mysql would be updated frequently
by user action, but most DB access es are from UI which is simply reading
some data to display.

I don't have too strong requirement on search, all I need is term query and
IN/NOT_IN query.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

And I do need one-many relationship. This is where ES fascinates me much,
solr doesn't have it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

  1. Why don't you use Apache Commons DBCP for MySQL connection pooling
    DBCP – Overview You put a lot of burden
    on MySQL with 2000 connections, it does not scale.

  2. ES is fast but for that, you must not use refresh that often. This
    puts a lot of unnecessary burden on ES, it does not scale. You cannot
    guarantuee that ES can search an indexed document like in a transaction
    after a MySQL insert. You can use near realtime get though. Do not use
    refresh interval under 1s, it destroys performance. What do you mean by
    delay of 300ms in search? There is no use in delaying search. 100ms
    intervals will make things even worse.

  3. To make index faster, do not use refresh, use short documents, use
    fast I/O subsystem. A "schema" does not matter for ES performance (ES
    will autodetect all new field types, but as you use MySQL, the number of
    fields are limited). You can't test with 20 documents only. You must
    test with millions of concurrent writes to MySQL and ES and you will see
    that your system design will bog down the whole writing, with awfully
    slow read responses.

To add it up, if you can manage it, you can not use ES within a DB
transaction without getting bad performance. Because of that I recommend
using ES asynchronously, without transactions, without refresh, for best
performance. Many MB/sec are possible in asynchronous bulk indexing.
That is where an ES river comes into play, for example the JDBC river.
You should fetch data on a time schedule with bulk indexing and live
with stale data for some seconds/minutes. For near realtime reads, use
the get API.

Jörg

Am 09.05.13 08:15, schrieb xin zhang:

Hi,
I have built a mysql cache using Elasticsearch, I am not sure if
anybody else using ES in this way, but it really works for me.
My motivation is from two concerns in terms of mysql:

  1. max connection number of mysql limits concurrent access when
    massive users read database.
  2. search is hard, especially I need to perform search through http
    API.

so I built a framework which automatically indexes database records
to ES whenever a row gets created/updated/deleted. The trick is
Spring TransactionSynchronization, each time a transaction commit
complete, the callback triggers an index operation.

it works quite well so far, at least in my unit test. However, I
still have some concerns:

  1. Will frequent index screw ES? when database update frequently,
    let's say 10000 users update database concurrently, and mysql max
    connection is 2000. This means I may do ES index 2000 times per
    second. Will this burden ES too much? The max documents will be 1.5
    million, I would not support any number bigger than this.
2. how can I reduce refresh time? As I use ES as cache, which 

means all read operation will perform to ES instead of mysql. When
users update database, they expect to see the update immediately in
next read operation. Now I set refresh interval to 500ms and add 300ms
delay in search operation. This works, but I am still afraid that
sometime user will read old data as the delay is smaller than refresh
interval. I wonder if setting refresh interval 100ms can make things
better?

3. how to make index faster? In my test, indexing a 4K document 

takes ~120ms. Comparing to writing database it is slow but still
acceptable. However, my unit test only create about 20 documents, I am
afraid the index time will increase significantly if there are
millions document. Is there any way I can make it faster? for example,
creating schema for each index?

 I used to build my framework on Solr until I found ES, now I will 

never look back. Hope you guys keep improving this awesome project!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Thu, May 9, 2013 at 10:45 AM, xin zhang xing5820@gmail.com wrote:

Thanks Radu

I'd try to use the Bulk API, because 2000 single index operations per
second is quite a lot, but a bulk of 2000 items in a second >>isn't that
much.

actually I am using Bulk API. However, it couldn't help much as the index
thread could not wait to long because we don't know when the next db update
comes, waiting too long will hurt realtime. I even think about using
algorithm like elevator used by Linux kernel for I/O.

  • increase the flush settings for your translog
    Could you explain a little more on this?

Yep, the transaction log is a file where ES logs your indexed document will
live on the disk before it's committed (flushed) to the Lucene index. That
flush is pretty expensive, so it's worth increasing the period between
flushes, as long as you don't get a translog that's too big. The translog
is also used for replication (ie: the translog is used to index the same
data in the replica shards).

For reference:

Regarding Jörg's advice, if you need to use real-time get, take a look here:

This shouldn't care of refreshing, it should get you the document as soon
as it was indexed.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.