Realtime search structure

I want to vet my setup/plan with you to check it is sane and optimal, and
see if you have any ideas or suggestions. I'm not currently not using
elasticsearch, so this is my first attempt.

I'd like to index 300 million (100GB) documents a day. I'll keep this data
for 30 days. I mostly need filtering sorted by publish date (which could be
several hours behind indexing date). And I'd like to use percolate if it
works well with 3500/sec docs.

The plan is to create 6 hour indexes, with 6 shards and 1 replica each.
Data coming in will be added to the appropriate index by its publish date.
Although there will be some stragglers coming in up to a day after it is
published, most of the data will be inserted into the "newest" 2-3 indexes
(latest 12 hours).

I was thinking that we'd keep the latest 2-3 indexes memory-based. Once we
roll onto a new 6 hour index we would update the settings realtime for the
oldest of the 3 memory-based indexes to be file-based. I would think the
memory-based indices will help with the high insert rate, but I'm unsure
whether I can convert from a memory-based index to a file-based index
realtime and what performance implications this will have.

Every 6 hours we'll drop/delete the 121st index after we create a new index
which keeps us at 30 days. We'll also batch inserts to ~1mb batches of
docs. Some of the documents are larger, many are smaller, so this should
keep it more consistent than bulking by # of documents.

I'm going to run a test cluster using 3-6 4XL high-memory instances (68GB
memory, 1700GB storage).

Do you think this sounds like a good way to tackle this?
Is there anything we should do when we bulk load the initial 30 days of
data (turn off replicas or commits or something)?
Can you convert memory-store to file-store realtime?
Does bulking by 1mb batches sound reasonable? Should that be more or less?
Any other problems or optimizations you can see?

I figured this is not a new problem and there are others who can explain
how this should be done. So thank you for your time!

Jacob

Hi,

  • I'd suggest not to bother with using in-memory indices and converting
    to file based, etc.
  • For the initial load, starting with 0 replica and adding the replica
    after the data is indexed is faster.
  • 1MB batch size is OK as a starting point. You can experiment with
    different sizes pretty easily to optimize.

My 3.14 cents ...

Regards,
Berkay Mollamustafaoglu

mberkay on yahoo, google and skype

On Thu, Apr 5, 2012 at 3:02 PM, Jacob Wright jacwright@gmail.com wrote:

I want to vet my setup/plan with you to check it is sane and optimal, and
see if you have any ideas or suggestions. I'm not currently not using
elasticsearch, so this is my first attempt.

I'd like to index 300 million (100GB) documents a day. I'll keep this data
for 30 days. I mostly need filtering sorted by publish date (which could be
several hours behind indexing date). And I'd like to use percolate if it
works well with 3500/sec docs.

The plan is to create 6 hour indexes, with 6 shards and 1 replica each.
Data coming in will be added to the appropriate index by its publish date.
Although there will be some stragglers coming in up to a day after it is
published, most of the data will be inserted into the "newest" 2-3 indexes
(latest 12 hours).

I was thinking that we'd keep the latest 2-3 indexes memory-based. Once we
roll onto a new 6 hour index we would update the settings realtime for the
oldest of the 3 memory-based indexes to be file-based. I would think the
memory-based indices will help with the high insert rate, but I'm unsure
whether I can convert from a memory-based index to a file-based index
realtime and what performance implications this will have.

Every 6 hours we'll drop/delete the 121st index after we create a new
index which keeps us at 30 days. We'll also batch inserts to ~1mb batches
of docs. Some of the documents are larger, many are smaller, so this should
keep it more consistent than bulking by # of documents.

I'm going to run a test cluster using 3-6 4XL high-memory instances (68GB
memory, 1700GB storage).

Do you think this sounds like a good way to tackle this?
Is there anything we should do when we bulk load the initial 30 days of
data (turn off replicas or commits or something)?
Can you convert memory-store to file-store realtime?
Does bulking by 1mb batches sound reasonable? Should that be more or less?
Any other problems or optimizations you can see?

I figured this is not a new problem and there are others who can explain
how this should be done. So thank you for your time!

Jacob

Thank you.

One more thing I thought about: we will always sort by publish date. Is
there an optimization to be done here? Can I set the default sort on an
index to avoid pulling all the date fields into memory?

On Thursday, April 5, 2012 1:51:58 PM UTC-6, Berkay Mollamustafaoglu wrote:

Hi,

  • I'd suggest not to bother with using in-memory indices and
    converting to file based, etc.
  • For the initial load, starting with 0 replica and adding the replica
    after the data is indexed is faster.
  • 1MB batch size is OK as a starting point. You can experiment with
    different sizes pretty easily to optimize.

My 3.14 cents ...

Regards,
Berkay Mollamustafaoglu

mberkay on yahoo, google and skype

On Thu, Apr 5, 2012 at 3:02 PM, Jacob Wright wrote:

I want to vet my setup/plan with you to check it is sane and optimal, and
see if you have any ideas or suggestions. I'm not currently not using
elasticsearch, so this is my first attempt.

I'd like to index 300 million (100GB) documents a day. I'll keep this
data for 30 days. I mostly need filtering sorted by publish date (which
could be several hours behind indexing date). And I'd like to use percolate
if it works well with 3500/sec docs.

The plan is to create 6 hour indexes, with 6 shards and 1 replica each.
Data coming in will be added to the appropriate index by its publish date.
Although there will be some stragglers coming in up to a day after it is
published, most of the data will be inserted into the "newest" 2-3 indexes
(latest 12 hours).

I was thinking that we'd keep the latest 2-3 indexes memory-based. Once
we roll onto a new 6 hour index we would update the settings realtime for
the oldest of the 3 memory-based indexes to be file-based. I would think
the memory-based indices will help with the high insert rate, but I'm
unsure whether I can convert from a memory-based index to a file-based
index realtime and what performance implications this will have.

Every 6 hours we'll drop/delete the 121st index after we create a new
index which keeps us at 30 days. We'll also batch inserts to ~1mb batches
of docs. Some of the documents are larger, many are smaller, so this should
keep it more consistent than bulking by # of documents.

I'm going to run a test cluster using 3-6 4XL high-memory instances (68GB
memory, 1700GB storage).

Do you think this sounds like a good way to tackle this?
Is there anything we should do when we bulk load the initial 30 days of
data (turn off replicas or commits or something)?
Can you convert memory-store to file-store realtime?
Does bulking by 1mb batches sound reasonable? Should that be more or less?
Any other problems or optimizations you can see?

I figured this is not a new problem and there are others who can explain
how this should be done. So thank you for your time!

Jacob

Hi.

When sorting, it values for the field have to be loaded to memory, so

there is no workaround here... . Why you can possibly do is issue searches
only on the recent indexes, or based on the "from" date aspect based on the
time range an index is responsible for.

 You are going to create a 6 shards index, which is fine. With 6 hour

time range though, and 120 indexes, there will be many shards here, you
might need a bigger cluster to handle all the data as time passes, but that
should be simple since you can simply add more nodes.

On Fri, Apr 6, 2012 at 12:27 AM, Jacob Wright jacwright@gmail.com wrote:

Thank you.

One more thing I thought about: we will always sort by publish date. Is
there an optimization to be done here? Can I set the default sort on an
index to avoid pulling all the date fields into memory?

On Thursday, April 5, 2012 1:51:58 PM UTC-6, Berkay Mollamustafaoglu wrote:

Hi,

  • I'd suggest not to bother with using in-memory indices and
    converting to file based, etc.
  • For the initial load, starting with 0 replica and adding the
    replica after the data is indexed is faster.
  • 1MB batch size is OK as a starting point. You can experiment with
    different sizes pretty easily to optimize.

My 3.14 cents ...

Regards,
Berkay Mollamustafaoglu

mberkay on yahoo, google and skype

On Thu, Apr 5, 2012 at 3:02 PM, Jacob Wright wrote:

I want to vet my setup/plan with you to check it is sane and optimal, and

see if you have any ideas or suggestions. I'm not currently not using
elasticsearch, so this is my first attempt.

I'd like to index 300 million (100GB) documents a day. I'll keep this
data for 30 days. I mostly need filtering sorted by publish date (which
could be several hours behind indexing date). And I'd like to use percolate
if it works well with 3500/sec docs.

The plan is to create 6 hour indexes, with 6 shards and 1 replica each.
Data coming in will be added to the appropriate index by its publish date.
Although there will be some stragglers coming in up to a day after it is
published, most of the data will be inserted into the "newest" 2-3 indexes
(latest 12 hours).

I was thinking that we'd keep the latest 2-3 indexes memory-based. Once
we roll onto a new 6 hour index we would update the settings realtime for
the oldest of the 3 memory-based indexes to be file-based. I would think
the memory-based indices will help with the high insert rate, but I'm
unsure whether I can convert from a memory-based index to a file-based
index realtime and what performance implications this will have.

Every 6 hours we'll drop/delete the 121st index after we create a new
index which keeps us at 30 days. We'll also batch inserts to ~1mb batches
of docs. Some of the documents are larger, many are smaller, so this should
keep it more consistent than bulking by # of documents.

I'm going to run a test cluster using 3-6 4XL high-memory instances
(68GB memory, 1700GB storage).

Do you think this sounds like a good way to tackle this?
Is there anything we should do when we bulk load the initial 30 days of
data (turn off replicas or commits or something)?
Can you convert memory-store to file-store realtime?
Does bulking by 1mb batches sound reasonable? Should that be more or
less?
Any other problems or optimizations you can see?

I figured this is not a new problem and there are others who can explain
how this should be done. So thank you for your time!

Jacob