Scaling Elasticsearch for 40GB of data

Greetings,

I am migrating a 40GB volume of mixed metadata and unstructured text into
Elasticsearch (~500,000 documents with metadata + extracted text) and need
some recommendations about setting up a cluster for high availability and
high performance queries.

I have tried a few basic changes to the default configuration but have not
had a lot of success avoiding:

1.) OOM/Heap Space Errors during indexing (I can only index about 150k docs
during the ETL (Extract-Transform-Load) process before the system bails out
after 3-4 hours)
2.) Poor full text query performance (2+ seconds on an index size of
roughly 30GB for 1 concurrent user)

We have two environments set up for test:

Development
1x [ 8GB RAM 4x processor ]

Staging
2x [ 4GB RAM 2x processor ]

My shard and replica settings are as follows:
index.number_of_shards: 6
index.number_of_replicas: 1

In each environment I have the ES_HEAP_SIZE value set to maximum physical
memory - 1 GB (so 7192 on 8GB, 3072 on 4GB).

So, my questions are:

For a volume of data as I described, what is the expectation about the size
of cluster (number of nodes, amount of ram) I will need to support high
performance queries ( <500ms per query with an expected average volume of
25 concurrent users)

Second, after our initial ETL, we will only experience incremental indexing
(as new docs come in or older docs are changed ~200/day). What sort of
shard/replica settings should I use to facilitate our read-heavy behavior
after launch. I understand the rough relationship between
nodes/shards/replicas is that the more shards you have, the faster your
expected index performance will be and a larger number of replicas you have
distributed over your overall node count increases the possible query
performance

Please forgive me if I've missed something obvious in the docs or the group
list. I'm trying to plan and tune my setup without resorting to a lot of
guess-and-test. I appreciate any pointers you can offer.

Thanks,

-jason

--

Hi Jason,

Just a pair of initial suggestions:

You probably want to give ES less memory. Try half your RAM for a start
and see if you still see OOMs.
Use tools like SPM for Elasticsearch, or iostat, etc. to see if there is a
lot of disk reading going on while querying or if the JMV GC is stealing
the CPU.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Saturday, September 22, 2012 12:55:48 PM UTC-4, Jason Yankus wrote:

Greetings,

I am migrating a 40GB volume of mixed metadata and unstructured text into
Elasticsearch (~500,000 documents with metadata + extracted text) and need
some recommendations about setting up a cluster for high availability and
high performance queries.

I have tried a few basic changes to the default configuration but have not
had a lot of success avoiding:

1.) OOM/Heap Space Errors during indexing (I can only index about 150k
docs during the ETL (Extract-Transform-Load) process before the system
bails out after 3-4 hours)
2.) Poor full text query performance (2+ seconds on an index size of
roughly 30GB for 1 concurrent user)

We have two environments set up for test:

Development
1x [ 8GB RAM 4x processor ]

Staging
2x [ 4GB RAM 2x processor ]

My shard and replica settings are as follows:
index.number_of_shards: 6
index.number_of_replicas: 1

In each environment I have the ES_HEAP_SIZE value set to maximum physical
memory - 1 GB (so 7192 on 8GB, 3072 on 4GB).

So, my questions are:

For a volume of data as I described, what is the expectation about the
size of cluster (number of nodes, amount of ram) I will need to support
high performance queries ( <500ms per query with an expected average volume
of 25 concurrent users)

Second, after our initial ETL, we will only experience incremental
indexing (as new docs come in or older docs are changed ~200/day). What
sort of shard/replica settings should I use to facilitate our read-heavy
behavior after launch. I understand the rough relationship between
nodes/shards/replicas is that the more shards you have, the faster your
expected index performance will be and a larger number of replicas you have
distributed over your overall node count increases the possible query
performance

Please forgive me if I've missed something obvious in the docs or the
group list. I'm trying to plan and tune my setup without resorting to a
lot of guess-and-test. I appreciate any pointers you can offer.

Thanks,

-jason

--

Thanks for the tips. I'll give them a try and let you know how it goes.

-jason

On Monday, September 24, 2012 2:09:59 PM UTC-4, Otis Gospodnetic wrote:

Hi Jason,

Just a pair of initial suggestions:

You probably want to give ES less memory. Try half your RAM for a start
and see if you still see OOMs.
Use tools like SPM for Elasticsearch, or iostat, etc. to see if there is a
lot of disk reading going on while querying or if the JMV GC is stealing
the CPU.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Saturday, September 22, 2012 12:55:48 PM UTC-4, Jason Yankus wrote:

Greetings,

I am migrating a 40GB volume of mixed metadata and unstructured text into
Elasticsearch (~500,000 documents with metadata + extracted text) and need
some recommendations about setting up a cluster for high availability and
high performance queries.

I have tried a few basic changes to the default configuration but have
not had a lot of success avoiding:

1.) OOM/Heap Space Errors during indexing (I can only index about 150k
docs during the ETL (Extract-Transform-Load) process before the system
bails out after 3-4 hours)
2.) Poor full text query performance (2+ seconds on an index size of
roughly 30GB for 1 concurrent user)

We have two environments set up for test:

Development
1x [ 8GB RAM 4x processor ]

Staging
2x [ 4GB RAM 2x processor ]

My shard and replica settings are as follows:
index.number_of_shards: 6
index.number_of_replicas: 1

In each environment I have the ES_HEAP_SIZE value set to maximum physical
memory - 1 GB (so 7192 on 8GB, 3072 on 4GB).

So, my questions are:

For a volume of data as I described, what is the expectation about the
size of cluster (number of nodes, amount of ram) I will need to support
high performance queries ( <500ms per query with an expected average volume
of 25 concurrent users)

Second, after our initial ETL, we will only experience incremental
indexing (as new docs come in or older docs are changed ~200/day). What
sort of shard/replica settings should I use to facilitate our read-heavy
behavior after launch. I understand the rough relationship between
nodes/shards/replicas is that the more shards you have, the faster your
expected index performance will be and a larger number of replicas you have
distributed over your overall node count increases the possible query
performance

Please forgive me if I've missed something obvious in the docs or the
group list. I'm trying to plan and tune my setup without resorting to a
lot of guess-and-test. I appreciate any pointers you can offer.

Thanks,

-jason

--

Thanks for the tips. I'll give them a try and let you know how it goes.

-jason

--

Hi ,

I am creating index using ES with 8GB RAM ,1Node with 4 Shards. I am using bulk api to index my data.
So i send 200 docs in one batch , my total batch is 2000.But i got performance issues ,I got java heap space exception when 20k docs get indexed.

Any one suggest me how i can solve this issue.

Thanks in advance