Improve Query Speed


(Karussell) #1

Hi,

simple question, but I know it depends on everything.

Here are the index properties from jetwick.com:

  • 10 mio tweets in an index with 3 shards (I know I should use one
    index per day ... I'll do that later ;))
  • feeding only ~50-100 tweets per seconds (twitter search is the
    limit now), but after ~400 tweets I'm doing a hard refresh. That is
    necessary so that I can always search (for retweets, duplicates etc)
    before indexing.
  • ES is started via ** on a 64bit jvm. The OS still has ~7GB

I would like to create warm-up-queries like it is possible in Solr.
Can I implement this in ElasticSearch? Is there an internal event
which tells ElasticSearch that a new lucene searcher is or should be
used?

What other optimizations regarding jvm, cache or index settings could
I try?

Kind Regards,
Peter.

**
JAVA_OPTS="-Xmx7200m -Xms7200m"
JAVA_OPTS="$JAVA_OPTS -XX:+UseCompressedOops"

JAVA_OPTS="$JAVA_OPTS -XX:+UseParNewGC"
JAVA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC"
JAVA_OPTS="$JAVA_OPTS -XX:+CMSParallelRemarkEnabled"
JAVA_OPTS="$JAVA_OPTS -XX:SurvivorRatio=8"
JAVA_OPTS="$JAVA_OPTS -XX:MaxTenuringThreshold=1"
JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"


(K.B.) #2

Hi,

I've seen increase in performance using a index-refresh request when
many new things were entered, e.g.:

client.admin().indices().refresh(new RefreshRequest(workingIndex));

(is this the hard refresh you were mentioning?)

I also noticed that ES caches queries, so in case you need certein
things very often you could just send the queries and then they should
be cached from that on;

On 1 Apr., 00:08, Karussell tableyourt...@googlemail.com wrote:

Hi,

simple question, but I know it depends on everything.

Here are the index properties from jetwick.com:

  • 10 mio tweets in an index with 3 shards (I know I should use one
    index per day ... I'll do that later ;))
  • feeding only ~50-100 tweets per seconds (twitter search is the
    limit now), but after ~400 tweets I'm doing a hard refresh. That is
    necessary so that I can always search (for retweets, duplicates etc)
    before indexing.
  • ES is started via ** on a 64bit jvm. The OS still has ~7GB

I would like to create warm-up-queries like it is possible in Solr.
Can I implement this in ElasticSearch? Is there an internal event
which tells ElasticSearch that a new lucene searcher is or should be
used?

What other optimizations regarding jvm, cache or index settings could
I try?

Kind Regards,
Peter.

**
JAVA_OPTS="-Xmx7200m -Xms7200m"
JAVA_OPTS="$JAVA_OPTS -XX:+UseCompressedOops"

JAVA_OPTS="$JAVA_OPTS -XX:+UseParNewGC"
JAVA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC"
JAVA_OPTS="$JAVA_OPTS -XX:+CMSParallelRemarkEnabled"
JAVA_OPTS="$JAVA_OPTS -XX:SurvivorRatio=8"
JAVA_OPTS="$JAVA_OPTS -XX:MaxTenuringThreshold=1"
JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"


(Karussell) #3

I've seen increase in performance using a index-refresh request when
many new things were entered, e.g.:

client.admin().indices().refresh(new RefreshRequest(workingIndex));

(is this the hard refresh you were mentioning?)

yes, exactly. Normally you shouldn't do that cause it decreases
indexing speed, which in my case is acceptable.

I also noticed that ES caches queries, so in case you need certein
things very often you could just send the queries and then they should
be cached from that on

Yes, that is one option. But my problem is that timing. When should I
send those queries?
Every second?

I would like to make sure that the user see response times less then a
second (although even this is a bit too much).
I must doing something wrong. I'll investigate if the massive use of
facets is the problematic part of the query.

BTW: I'm already using a relative high number (20s) for the
refresh_interval to decrease realtime enforcement.

Regards,
Peter.


(K.B.) #4

Are those requests/ searches similar somehow? You could try to see
what the top N queries are and then cache them within the app and
refresh this cache every N seconds/ minutes /- etc. - and instead
return only the cached values;

In case you also use much faceting you might have a look at the
current trunk - I saw shay did some improvements regarding facet-
speeed there;

On 1 Apr., 20:12, Karussell tableyourt...@googlemail.com wrote:

I've seen increase in performance using a index-refresh request when
many new things were entered, e.g.:

client.admin().indices().refresh(new RefreshRequest(workingIndex));

(is this the hard refresh you were mentioning?)

yes, exactly. Normally you shouldn't do that cause it decreases
indexing speed, which in my case is acceptable.

I also noticed that ES caches queries, so in case you need certein
things very often you could just send the queries and then they should
be cached from that on

Yes, that is one option. But my problem is that timing. When should I
send those queries?
Every second?

I would like to make sure that the user see response times less then a
second (although even this is a bit too much).
I must doing something wrong. I'll investigate if the massive use of
facets is the problematic part of the query.

BTW: I'm already using a relative high number (20s) for the
refresh_interval to decrease realtime enforcement.

Regards,
Peter.


(K.B.) #5

PS: also keep in mind that ES is still lucene under the hood;

So I think these tips might work there, too:

http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

I mean OS-Tuning (swappiness can occur even if fixed java memory size)
and more usage of filters if possible;

On 2 Apr., 11:05, "K.B." korbinian.ba...@googlemail.com wrote:

Are those requests/ searches similar somehow? You could try to see
what the top N queries are and then cache them within the app and
refresh this cache every N seconds/ minutes /- etc. - and instead
return only the cached values;

In case you also use much faceting you might have a look at the
current trunk - I saw shay did some improvements regarding facet-
speeed there;

On 1 Apr., 20:12, Karussell tableyourt...@googlemail.com wrote:

I've seen increase in performance using a index-refresh request when
many new things were entered, e.g.:

client.admin().indices().refresh(new RefreshRequest(workingIndex));

(is this the hard refresh you were mentioning?)

yes, exactly. Normally you shouldn't do that cause it decreases
indexing speed, which in my case is acceptable.

I also noticed that ES caches queries, so in case you need certein
things very often you could just send the queries and then they should
be cached from that on

Yes, that is one option. But my problem is that timing. When should I
send those queries?
Every second?

I would like to make sure that the user see response times less then a
second (although even this is a bit too much).
I must doing something wrong. I'll investigate if the massive use of
facets is the problematic part of the query.

BTW: I'm already using a relative high number (20s) for the
refresh_interval to decrease realtime enforcement.

Regards,
Peter.


(Shay Banon) #6

There isn't an option to have warm up queries when a new reader is created in ES currently. This one is tricky to get right (you want to do it only when you really want to), but its certainly planned.
On Friday, April 1, 2011 at 12:08 AM, Karussell wrote:

Hi,

simple question, but I know it depends on everything.

Here are the index properties from jetwick.com:

  • 10 mio tweets in an index with 3 shards (I know I should use one
    index per day ... I'll do that later ;))
  • feeding only ~50-100 tweets per seconds (twitter search is the
    limit now), but after ~400 tweets I'm doing a hard refresh. That is
    necessary so that I can always search (for retweets, duplicates etc)
    before indexing.
  • ES is started via ** on a 64bit jvm. The OS still has ~7GB

I would like to create warm-up-queries like it is possible in Solr.
Can I implement this in ElasticSearch? Is there an internal event
which tells ElasticSearch that a new lucene searcher is or should be
used?

What other optimizations regarding jvm, cache or index settings could
I try?

Kind Regards,
Peter.

**
JAVA_OPTS="-Xmx7200m -Xms7200m"
JAVA_OPTS="$JAVA_OPTS -XX:+UseCompressedOops"

JAVA_OPTS="$JAVA_OPTS -XX:+UseParNewGC"
JAVA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC"
JAVA_OPTS="$JAVA_OPTS -XX:+CMSParallelRemarkEnabled"
JAVA_OPTS="$JAVA_OPTS -XX:SurvivorRatio=8"
JAVA_OPTS="$JAVA_OPTS -XX:MaxTenuringThreshold=1"
JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"


(K.B.) #7

@Shay:

does ES depend on the order of filters to apply for performance or is
the order not important? I only know the behaviour from SOLR and there
even the order of the filters was important;

On 3 Apr., 07:56, Shay Banon shay.ba...@elasticsearch.com wrote:

There isn't an option to have warm up queries when a new reader is created in ES currently. This one is tricky to get right (you want to do it only when you really want to), but its certainly planned.

On Friday, April 1, 2011 at 12:08 AM, Karussell wrote:

Hi,

simple question, but I know it depends on everything.

Here are the index properties from jetwick.com:

  • 10 mio tweets in an index with 3 shards (I know I should use one
    index per day ... I'll do that later ;))
  • feeding only ~50-100 tweets per seconds (twitter search is the
    limit now), but after ~400 tweets I'm doing a hard refresh. That is
    necessary so that I can always search (for retweets, duplicates etc)
    before indexing.
  • ES is started via ** on a 64bit jvm. The OS still has ~7GB

I would like to create warm-up-queries like it is possible in Solr.
Can I implement this in ElasticSearch? Is there an internal event
which tells ElasticSearch that a new lucene searcher is or should be
used?

What other optimizations regarding jvm, cache or index settings could
I try?

Kind Regards,
Peter.

**
JAVA_OPTS="-Xmx7200m -Xms7200m"
JAVA_OPTS="$JAVA_OPTS -XX:+UseCompressedOops"

JAVA_OPTS="$JAVA_OPTS -XX:+UseParNewGC"
JAVA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC"
JAVA_OPTS="$JAVA_OPTS -XX:+CMSParallelRemarkEnabled"
JAVA_OPTS="$JAVA_OPTS -XX:SurvivorRatio=8"
JAVA_OPTS="$JAVA_OPTS -XX:MaxTenuringThreshold=1"
JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"


(Karussell) #8

@K.B.: thanks for those tips, I'll have a look at the lucene level
tuning tips again :slight_smile:

I'm already running the ES snapshot but I'll git pulling to a more
recent version.

Are those requests/ searches similar somehow?

yes. they have similar facet fields, but the queries terms are totally
different.

On 3 Apr., 07:56, Shay Banon shay.ba...@elasticsearch.com wrote:

There isn't an option to have warm up queries when a new reader is created in ES currently. This one is tricky to get right (you want to do it only when you really want to), but its certainly planned.

Shay, thanks for the information!

Do you know any other trick how I could improve query speed when doing
indexing at the same time?

Or should I go with polling some default queries every 30 sec or so?

And assuming my 10 mio tweet index is split up by time - say into 3
indices - will this improve query speed without adding hardware - not
really right?
(Except when I know before to which indices the query should go ...
hmmh I'll think about that)

Regards,
Peter


(Shay Banon) #9

Scheduling queries every X seconds is good for now.
On Monday, April 4, 2011 at 10:07 AM, Karussell wrote:

@K.B.: thanks for those tips, I'll have a look at the lucene level
tuning tips again :slight_smile:

I'm already running the ES snapshot but I'll git pulling to a more
recent version.

Are those requests/ searches similar somehow?

yes. they have similar facet fields, but the queries terms are totally
different.

On 3 Apr., 07:56, Shay Banon shay.ba...@elasticsearch.com wrote:

There isn't an option to have warm up queries when a new reader is created in ES currently. This one is tricky to get right (you want to do it only when you really want to), but its certainly planned.

Shay, thanks for the information!

Do you know any other trick how I could improve query speed when doing
indexing at the same time?

Or should I go with polling some default queries every 30 sec or so?

And assuming my 10 mio tweet index is split up by time - say into 3
indices - will this improve query speed without adding hardware - not
really right?
(Except when I know before to which indices the query should go ...
hmmh I'll think about that)

Regards,
Peter


(Karussell) #10

On 4 Apr., 19:08, Shay Banon shay.ba...@elasticsearch.com wrote:

Scheduling queries every X seconds is good for now.

Thanks, ok.

What else would you suggest to tune query performance? Any cache
settings like there are in solr?

BTW: investigated facets and performance (after hacking) is now much
better :slight_smile:


(Karussell) #11

I'll investigate the index settings a bit more now:

http://elasticsearch.karmi.cz/blog/2011/03/23/update-settings.html

E.g. decreasing mergeFactor should increase query speed (and slow down
indexing)

On 7 Apr., 00:25, Karussell tableyourt...@googlemail.com wrote:

On 4 Apr., 19:08, Shay Banon shay.ba...@elasticsearch.com wrote:

Scheduling queries every X seconds is good for now.

Thanks, ok.

What else would you suggest to tune query performance? Any cache
settings like there are in solr?

BTW: investigated facets and performance (after hacking) is now much
better :slight_smile:


(Shay Banon) #12

Yes, decreasing the merge factor will increase query speed, since less segments means less work when searching a single shard, but, it will make indexing slower / more expensive.

Another option to improve search speed is to reduce the term interval and play with the term divisor. Lowering the term interval will improve search perf, but, will cause more memory to be used (which you can control with the divisor). Both can also be set dynamically, though the term interval only applies for newly indexed docs.
On Saturday, April 9, 2011 at 5:38 PM, Karussell wrote:

I'll investigate the index settings a bit more now:

http://elasticsearch.karmi.cz/blog/2011/03/23/update-settings.html

E.g. decreasing mergeFactor should increase query speed (and slow down
indexing)

On 7 Apr., 00:25, Karussell tableyourt...@googlemail.com wrote:

On 4 Apr., 19:08, Shay Banon shay.ba...@elasticsearch.com wrote:

Scheduling queries every X seconds is good for now.

Thanks, ok.

What else would you suggest to tune query performance? Any cache
settings like there are in solr?

BTW: investigated facets and performance (after hacking) is now much
better :slight_smile:


(system) #13