General tips on performance tuning

Runar_Myklebust_2 · December 13, 2011, 9:41am

Hi.

Im currently intergrating elasticsearch in our java product, a CMS / Portal
solution. Until now, we have used a relational database and hibernate to
serve the index.

The main entity in our model is "content", and the datamodel is split into
3 index types; Metadata, customdata and extracted binary data. This is done
to ensure performance when updating metadata for large number of content,
e.g to be able to just update metadata.

The index is used for both text search and datasources, that is - queries
that fetches content to the web-portal. A web page in the portal typically
consists of several datasources, and will fire a couple of queries like:

{
"from" : 0,
"size" : 1,
"query" : {
"term" : {
"contenttype" : "banner"
}
},
"filter" : {
"bool" : {
"must" : {
"terms" : {
"contentlocations.menuitemkey_numeric" : [ "1954" ]
}
}
}
},
"sort" : [ {
"_score" : {
}
} ]
}

{
"from" : 0,
"size" : 3,
"query" : {
"term" : {
"contenttype" : "vignette"
}
},
"filter" : {
"bool" : {
"must" : {
"terms" : {
"contentlocations.menuitemkey_numeric" : [ "1954" ]
}
}
}
},
"sort" : [ {
"_score" : {
}
} ]
}

{
"from" : 0,
"size" : 10,
"query" : {
"term" : {
"contenttype" : "casestudy"
}
},
"filter" : {
"bool" : {
"must" : {
"terms" : {
"contentlocations.menuitemkey_numeric" : [ "1954" ]
}
}
}
},
"sort" : [ {
"_score" : {
}
} ]
}

{
"from" : 0,
"size" : 3,
"query" : {
"bool" : {
"must" : [ {
"range" : {
"data_end-date" : {
"from" : "2011-12-13t00:00:00.000+01:00",
"to" : null,
"include_lower" : true,
"include_upper" : true
}
}
}, {
"term" : {
"contenttype" : "event"
}
} ]
}
},
"filter" : {
"bool" : {
"must" : {
"terms" : {
"contentlocations.menuitemkey_numeric" : [ "1939" ]
}
}
}
},
"sort" : [ {
"orderby_data_start-date" : {
"order" : "asc"
}
}, {
"orderby_title" : {
"order" : "asc"
}
} ]
}

{
"from" : 0,
"size" : 10,
"query" : {
"match_all" : {
}
},
"filter" : {
"bool" : {
"must" : {
"terms" : {
"key_numeric" : [ "103636", "103623", "103630", "103975",
"103974", "105431", "105430", "105429", "105428", "105427" ]
}
}
}
},
"sort" : [ {
"_score" : {
}
} ]
}

What I can see so far, is that typical text queries gets a bit performance
boost by using elasticsearch compared to the old db/hibernate approach,
while portal queries like the above soon will seem slow compared to the
cached hibernate-queries.
I've not done any tuning whatsoever on the elasticsearch setup, just
integrated the engine, creates a local client with no specific settings
other than default. The index is also created with default values, and the
queries are created without any specific performance tuning.

For a typically production environment, the number of stored contents will
be from 50.000 to maybe a couple of hundred thousands, and a busy website
may have maybe a million page request within the 8 busy-hours of a day.

What i would like to know, is where to start to ensure the best possible
performance gains; caching? data-structure? queries? node configuration?
What are the typical main areas to watch for bottlenecks and easy gain?

best regards

Runar Myklebust

Karussell1 · December 13, 2011, 11:12am

Hard to see where your problem can be - please include your setup info
(index size, existing RAM, RAM usage, java type+version, ES, ...)

Could you try to remove the sort part and put the query part into the
terms filter (using a match all query) or try to directly put the
terms into the filter section without that boolean wrapper?

Also, are the key_numeric values ids?
http://www.elasticsearch.org/guide/reference/query-dsl/ids-query.html

Peter.

Weiwei_Wang · December 16, 2011, 2:38am

even with filters, with a pressure of 1500/s, es will log many slow
query logs after a while,please see
https://groups.google.com/group/elasticsearch/t/8841e480f7a3237d

On Dec 13, 7:12 pm, Karussell tableyourt...@googlemail.com wrote:

Hard to see where your problem can be - please include your setup info
(index size, existing RAM, RAM usage, java type+version, ES, ...)

Could you try to remove the sort part and put the query part into the
terms filter (using a match all query) or try to directly put the
terms into the filter section without that boolean wrapper?

Also, are the key_numeric values ids?Elasticsearch Platform — Find real-time answers at scale | Elastic

Peter.

Runar_Myklebust_2 · December 29, 2011, 10:17am

On Fri, Dec 16, 2011 at 3:38 AM, Weiwei Wang ww.wang.cs@gmail.com wrote:

even with filters, with a pressure of 1500/s, es will log many slow
query logs after a while,please see
https://groups.google.com/group/elasticsearch/t/8841e480f7a3237d

On Dec 13, 7:12 pm, Karussell tableyourt...@googlemail.com wrote:

Hard to see where your problem can be - please include your setup info
(index size, existing RAM, RAM usage, java type+version, ES, ...)

Could you try to remove the sort part and put the query part into the
terms filter (using a match all query) or try to directly put the
terms into the filter section without that boolean wrapper?

Also, are the key_numeric values ids?
Elasticsearch Platform — Find real-time answers at scale | Elastic

Peter.

Hi.

At the moment, I still try do test Elasticsearch in a closed
test-environment, and using JUnit-tests with a test-instance.

E.g; this test adds 3000 document to index, then does optimize index.

This is how the test-instance is set up, no settings other than default
applied:

final Settings settings =
ImmutableSettings.settingsBuilder().build();

    node = NodeBuilder.nodeBuilder().client( false ).local( true

).data( true ).settings( settings ).build();

The unit-test then executes this query:

{
"from" : 0,
"size" : 10,
"query" : {
"range" : {
"key_numeric" : {
"from" : null,
"to" : 1295.0,
"include_lower" : true,
"include_upper" : false
}
}
}
}

Where to is a random variable between 1 and 3000.

The result is pretty much the same when running 10 or 1000 samples:

samples: 50
max: 1254
average: 1015.08
median: 1007

I've tried to set different memory-values to the JVM when running the test,
from 256MB to 2G, but it doesnt seem to neither improve or worsen the
sample-times. Is these times expected?

Q1 : What settings on the Node is useful to tune?
Q2 : Often i have a bit of trouble finding how to set different settings
when using the java-API. Is there some documentation covering this?

--
mvh
Runar Myklebust

kimchy · December 30, 2011, 12:45pm

Can you share your test code? Those times (if in millis) are strange.
Regarding configuring the node using the Java API, first, you can simply
configure it with a yml/json file, or, just have the settings provided, for
example, cluster.name setting and then the value.

On Thu, Dec 29, 2011 at 12:17 PM, Runar Myklebust runar@myklebust.mewrote:

On Fri, Dec 16, 2011 at 3:38 AM, Weiwei Wang ww.wang.cs@gmail.com wrote:

even with filters, with a pressure of 1500/s, es will log many slow
query logs after a while,please see
https://groups.google.com/group/elasticsearch/t/8841e480f7a3237d

On Dec 13, 7:12 pm, Karussell tableyourt...@googlemail.com wrote:

Hard to see where your problem can be - please include your setup info
(index size, existing RAM, RAM usage, java type+version, ES, ...)

Could you try to remove the sort part and put the query part into the
terms filter (using a match all query) or try to directly put the
terms into the filter section without that boolean wrapper?

Also, are the key_numeric values ids?
Elasticsearch Platform — Find real-time answers at scale | Elastic

Peter.

Hi.

At the moment, I still try do test Elasticsearch in a closed
test-environment, and using JUnit-tests with a test-instance.

E.g; this test adds 3000 document to index, then does optimize index.

This is how the test-instance is set up, no settings other than default
applied:

final Settings settings =
ImmutableSettings.settingsBuilder().build();
    node = NodeBuilder.nodeBuilder().client( false ).local( true
).data( true ).settings( settings ).build();

The unit-test then executes this query:

{
"from" : 0,
"size" : 10,
"query" : {
"range" : {
"key_numeric" : {
"from" : null,
"to" : 1295.0,
"include_lower" : true,
"include_upper" : false
}
}
}
}

Where to is a random variable between 1 and 3000.

The result is pretty much the same when running 10 or 1000 samples:

samples: 50
max: 1254
average: 1015.08
median: 1007

I've tried to set different memory-values to the JVM when running the
test, from 256MB to 2G, but it doesnt seem to neither improve or worsen the
sample-times. Is these times expected?

Q1 : What settings on the Node is useful to tune?
Q2 : Often i have a bit of trouble finding how to set different settings
when using the java-API. Is there some documentation covering this?

--
mvh
Runar Myklebust