Best way to offload indexing from reading node

Hello,

I have an ecommerce store with a relatively large index of products. About
10 000 000 products, 15 GB index size. This index gets updated very often,
maybe 100-1000 updates per second.

What I'm trying to do is to setup two servers. One for rapidly updating the
index, and another one just for querying. There is no problem if the data
on the "querying" server is relatively stale (up to 4 hours is ok).

However I can't figure out a way to implement this. When I do replication,
querying server is also suffering from intense writing i/o (even on XFS
with a large buffer, or ext4 with commit=60,data=writeback), and sometimes
the querying gets redirected to the "indexing" server.

So far my best guesses are:

  1. Setup both nodes over the same Shared FS or Hadoop Gateway, write to
    "indexing" node, read from "querying" node (maybe even disable data
    alteration with 0.19.2 new APIs), don't let them join into cluster (?).
    Sometimes reboot "querying" node so it will recover from the gateway (maybe
    there is a better way to propagate changes?). Use "memory" index on
    "querying" node to prevent confusion over changed disk data (?).

  2. Setup a simple cluster replication with 1 replica, and just use
    "preference=_local" Search parameter to query local replica.

Would gladly accept any advice on how to achieve this. Thanks!

With 2 servers and 1 replica, both servers will have the same write load,
regardless of which you use for queries. In your use case, you can disable
auto refresh, refresh manually and use bulk API for writes.ES does not have
write vs read server concept, so by far your best option is to improve
performance is adding a 3rd server or more disks/cpu/memory, etc. I'd make
sure you have a performance problem with queries before diving into
attempting to force ES to work they way you're describing.

If you do have to go that route, you may try handling writes from your
client code rather than relying on replication. Write to 1st server while
querying the other for x minutes and then start reading from first server
while bringing the second up to date, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Apr 5, 2012 at 3:09 AM, Mikhail Sayapin
mikhail.sayapin@gmail.comwrote:

Hello,

I have an ecommerce store with a relatively large index of products. About
10 000 000 products, 15 GB index size. This index gets updated very often,
maybe 100-1000 updates per second.

What I'm trying to do is to setup two servers. One for rapidly updating
the index, and another one just for querying. There is no problem if the
data on the "querying" server is relatively stale (up to 4 hours is ok).

However I can't figure out a way to implement this. When I do replication,
querying server is also suffering from intense writing i/o (even on XFS
with a large buffer, or ext4 with commit=60,data=writeback), and sometimes
the querying gets redirected to the "indexing" server.

So far my best guesses are:

  1. Setup both nodes over the same Shared FS or Hadoop Gateway, write to
    "indexing" node, read from "querying" node (maybe even disable data
    alteration with 0.19.2 new APIs), don't let them join into cluster (?).
    Sometimes reboot "querying" node so it will recover from the gateway (maybe
    there is a better way to propagate changes?). Use "memory" index on
    "querying" node to prevent confusion over changed disk data (?).

  2. Setup a simple cluster replication with 1 replica, and just use
    "preference=_local" Search parameter to query local replica.

Would gladly accept any advice on how to achieve this. Thanks!

Hello Berkay,

Thanks for the advice. I use bulk indexing with auto refresh disabled,
and refresh from time to time based on 2nd server la. I mostly use
constant_score queries with terms filter, simple two-field numeric
ordering. From time to time I build relatively large facets
(3000-4000) with regex filters. Xms = Xmx = 16 GB. The disk is two
software raid-1 SSDs, XFS mounted with noatime. Tried other gc options
as well (settled on "-XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=2
-XX:+AggressiveOpts"). Queries take about 1.5-2 seconds, after that
become cached, but 2 seconds is a bit too much anyway, I was able to
reproduce dog piling effect.

What do you think about 1st option - using 2 disconnected (in terms of
replication) servers over the same shared gateway location, 2nd server
disabled for writing?

On Fri, Apr 6, 2012 at 4:32 AM, Berkay Mollamustafaoglu
mberkay@gmail.com wrote:

With 2 servers and 1 replica, both servers will have the same write load,
regardless of which you use for queries. In your use case, you can disable
auto refresh, refresh manually and use bulk API for writes.ES does not have
write vs read server concept, so by far your best option is to improve
performance is adding a 3rd server or more disks/cpu/memory, etc. I'd make
sure you have a performance problem with queries before diving into
attempting to force ES to work they way you're describing.

If you do have to go that route, you may try handling writes from your
client code rather than relying on replication. Write to 1st server while
querying the other for x minutes and then start reading from first server
while bringing the second up to date, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Apr 5, 2012 at 3:09 AM, Mikhail Sayapin mikhail.sayapin@gmail.com
wrote:

Hello,

I have an ecommerce store with a relatively large index of products. About
10 000 000 products, 15 GB index size. This index gets updated very often,
maybe 100-1000 updates per second.

What I'm trying to do is to setup two servers. One for rapidly updating
the index, and another one just for querying. There is no problem if the
data on the "querying" server is relatively stale (up to 4 hours is ok).

However I can't figure out a way to implement this. When I do replication,
querying server is also suffering from intense writing i/o (even on XFS with
a large buffer, or ext4 with commit=60,data=writeback), and sometimes the
querying gets redirected to the "indexing" server.

So far my best guesses are:

  1. Setup both nodes over the same Shared FS or Hadoop Gateway, write to
    "indexing" node, read from "querying" node (maybe even disable data
    alteration with 0.19.2 new APIs), don't let them join into cluster (?).
    Sometimes reboot "querying" node so it will recover from the gateway (maybe
    there is a better way to propagate changes?). Use "memory" index on
    "querying" node to prevent confusion over changed disk data (?).

  2. Setup a simple cluster replication with 1 replica, and just use
    "preference=_local" Search parameter to query local replica.

Would gladly accept any advice on how to achieve this. Thanks!

--
Regards,
Mikhail Sayapin
["I recommend the art of slow reading."]

Hi Michail,

Don't think that would help. With shared gateway, ES still writes to local
file system so both servers would have the same load plus the additional
overhead of writing to the gateway.
You may want to try the following:

  • have a separate ES cluster at each server.
  • writes go to 1 server and timestamp all writes
  • have code on the 2nd server that runs periodically retrieves only changes
    and writes to the second cluster. you can use scan search type and bulk API
    to make this process efficient.
  • use the 2nd server for queries

This way you can isolate most write traffic to the first server and update
second server when you want to. Hope this helps ..

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Apr 5, 2012 at 11:53 PM, Mikhail Sayapin
mikhail.sayapin@gmail.comwrote:

Hello Berkay,

Thanks for the advice. I use bulk indexing with auto refresh disabled,
and refresh from time to time based on 2nd server la. I mostly use
constant_score queries with terms filter, simple two-field numeric
ordering. From time to time I build relatively large facets
(3000-4000) with regex filters. Xms = Xmx = 16 GB. The disk is two
software raid-1 SSDs, XFS mounted with noatime. Tried other gc options
as well (settled on "-XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=2
-XX:+AggressiveOpts"). Queries take about 1.5-2 seconds, after that
become cached, but 2 seconds is a bit too much anyway, I was able to
reproduce dog piling effect.

What do you think about 1st option - using 2 disconnected (in terms of
replication) servers over the same shared gateway location, 2nd server
disabled for writing?

On Fri, Apr 6, 2012 at 4:32 AM, Berkay Mollamustafaoglu
mberkay@gmail.com wrote:

With 2 servers and 1 replica, both servers will have the same write load,
regardless of which you use for queries. In your use case, you can
disable
auto refresh, refresh manually and use bulk API for writes.ES does not
have
write vs read server concept, so by far your best option is to improve
performance is adding a 3rd server or more disks/cpu/memory, etc. I'd
make
sure you have a performance problem with queries before diving into
attempting to force ES to work they way you're describing.

If you do have to go that route, you may try handling writes from your
client code rather than relying on replication. Write to 1st server while
querying the other for x minutes and then start reading from first server
while bringing the second up to date, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Apr 5, 2012 at 3:09 AM, Mikhail Sayapin <
mikhail.sayapin@gmail.com>
wrote:

Hello,

I have an ecommerce store with a relatively large index of products.
About
10 000 000 products, 15 GB index size. This index gets updated very
often,
maybe 100-1000 updates per second.

What I'm trying to do is to setup two servers. One for rapidly updating
the index, and another one just for querying. There is no problem if the
data on the "querying" server is relatively stale (up to 4 hours is ok).

However I can't figure out a way to implement this. When I do
replication,
querying server is also suffering from intense writing i/o (even on XFS
with
a large buffer, or ext4 with commit=60,data=writeback), and sometimes
the
querying gets redirected to the "indexing" server.

So far my best guesses are:

  1. Setup both nodes over the same Shared FS or Hadoop Gateway, write to
    "indexing" node, read from "querying" node (maybe even disable data
    alteration with 0.19.2 new APIs), don't let them join into cluster (?).
    Sometimes reboot "querying" node so it will recover from the gateway
    (maybe
    there is a better way to propagate changes?). Use "memory" index on
    "querying" node to prevent confusion over changed disk data (?).

  2. Setup a simple cluster replication with 1 replica, and just use
    "preference=_local" Search parameter to query local replica.

Would gladly accept any advice on how to achieve this. Thanks!

--
Regards,
Mikhail Sayapin
["I recommend the art of slow reading."]

There isn't an option to have a slave that does not "index" documents in
elasticsearch. There is a concept of replicating index segment files to
slaves, but effectively, this is usually more expensive compared to
indexing (which you can solve by adding more boxes). I talk about it a bit
here:

.

On Thu, Apr 5, 2012 at 10:09 AM, Mikhail Sayapin
mikhail.sayapin@gmail.comwrote:

Hello,

I have an ecommerce store with a relatively large index of products. About
10 000 000 products, 15 GB index size. This index gets updated very often,
maybe 100-1000 updates per second.

What I'm trying to do is to setup two servers. One for rapidly updating
the index, and another one just for querying. There is no problem if the
data on the "querying" server is relatively stale (up to 4 hours is ok).

However I can't figure out a way to implement this. When I do replication,
querying server is also suffering from intense writing i/o (even on XFS
with a large buffer, or ext4 with commit=60,data=writeback), and sometimes
the querying gets redirected to the "indexing" server.

So far my best guesses are:

  1. Setup both nodes over the same Shared FS or Hadoop Gateway, write to
    "indexing" node, read from "querying" node (maybe even disable data
    alteration with 0.19.2 new APIs), don't let them join into cluster (?).
    Sometimes reboot "querying" node so it will recover from the gateway (maybe
    there is a better way to propagate changes?). Use "memory" index on
    "querying" node to prevent confusion over changed disk data (?).

  2. Setup a simple cluster replication with 1 replica, and just use
    "preference=_local" Search parameter to query local replica.

Would gladly accept any advice on how to achieve this. Thanks!