Increasing performance with idle resources

Hi all,

I'm running elasticsearch with a single node on a machine with 24 total
cores (4x6) and 48GB RAM. The searches I'll be talking about here are run
over a pair of indices, each with around 2 million documents (~60GB data).
Each index has 5 shards and 0 replicas. The indices are continuously
updated through the bulk API with a few hundred new documents every second.

When the node is idle (aside from the bulk indexing), Lucene query_string
queries over these indices can take >2s to complete. Using bigdesk, I can
see plenty of memory and CPU unused during these searches.

Are there particular elasticsearch settings I can look at to try and better
utilize the idle resources to complete these searches more quickly? Should
I consider running multiple nodes on this single machine?

Cheers,

Luke

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Luke,
That sounds like a pretty heavy stream of updates, basically re-indexing
all your content every 2 hours.

I am guessing that you are disk bound.

How much memory is allocated to the ES heap? You probably want half
reserved for ES and half reserved for the system.

Increasing the translog flush threshold parameters may help to reduce the
commits that happen to the segments.

You may also see a benefit keeping the shard count around the number of
cores of the system.

Using filters on the search side where possible will also speed things up,
since they are cached.

There are likely other approaches I'm not aware of.

Best Regards,
Paul

On Thursday, April 18, 2013 5:33:13 PM UTC-6, Luke McCarthy wrote:

Hi all,

I'm running elasticsearch with a single node on a machine with 24 total
cores (4x6) and 48GB RAM. The searches I'll be talking about here are run
over a pair of indices, each with around 2 million documents (~60GB data).
Each index has 5 shards and 0 replicas. The indices are continuously
updated through the bulk API with a few hundred new documents every second.

When the node is idle (aside from the bulk indexing), Lucene query_string
queries over these indices can take >2s to complete. Using bigdesk, I can
see plenty of memory and CPU unused during these searches.

Are there particular elasticsearch settings I can look at to try and
better utilize the idle resources to complete these searches more quickly?
Should I consider running multiple nodes on this single machine?

Cheers,

Luke

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Thu, Apr 18, 2013 at 5:48 PM, ppearcy ppearcy@gmail.com wrote:

Hi Luke,
That sounds like a pretty heavy stream of updates, basically re-indexing
all your content every 2 hours.

I am guessing that you are disk bound.

Apparently not, according to bigdesk and iostat.

How much memory is allocated to the ES heap? You probably want half
reserved for ES and half reserved for the system.

Half (24gb) is allocated to heap, but not all of it is being used most of
the time (again, according to bigdesk).

Increasing the translog flush threshold parameters may help to reduce the

commits that happen to the segments.

Assuming that this will reduce the availability of the data, I will not be
able to do this (I am expected to return things that were indexed a few
seconds ago…)

You may also see a benefit keeping the shard count around the number of

cores of the system.

Shard count? Or segment count? I was under the impression that increasing
either of these would actually reduce search performance.

Using filters on the search side where possible will also speed things up,
since they are cached.

Yes, filtered queries are fine. It's only query_string queries that I'd
like to improve performance for (note that wrapping the query_string query
in a filter doesn't seem to improve anything…)

Thanks for your input,

Luke

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Translog doesn't effect availability of data (it is different from a
commit), instead it causes fewer merges, at the cost of slower recovery
times during a restart. You should also look at various merge policy
settings and try to tune them for a high index load.

I believe there is a search thread per shard not segment (I could be
wrong), so I do think increasing your shard count will help in this case
due to the high volume of updates, that's what I'd try.

Best Regards,
Paul

On Thursday, April 18, 2013 6:54:44 PM UTC-6, Luke McCarthy wrote:

On Thu, Apr 18, 2013 at 5:48 PM, ppearcy <ppe...@gmail.com <javascript:>>wrote:

Hi Luke,
That sounds like a pretty heavy stream of updates, basically
re-indexing all your content every 2 hours.

I am guessing that you are disk bound.

Apparently not, according to bigdesk and iostat.

How much memory is allocated to the ES heap? You probably want half
reserved for ES and half reserved for the system.

Half (24gb) is allocated to heap, but not all of it is being used most of
the time (again, according to bigdesk).

Increasing the translog flush threshold parameters may help to reduce the

commits that happen to the segments.

Assuming that this will reduce the availability of the data, I will not be
able to do this (I am expected to return things that were indexed a few
seconds ago…)

You may also see a benefit keeping the shard count around the number of

cores of the system.

Shard count? Or segment count? I was under the impression that increasing
either of these would actually reduce search performance.

Using filters on the search side where possible will also speed things
up, since they are cached.

Yes, filtered queries are fine. It's only query_string queries that I'd
like to improve performance for (note that wrapping the query_string query
in a filter doesn't seem to improve anything…)

Thanks for your input,

Luke

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.